Performance on Windows is worse than prepint #202

LiuZhexuan · 2024-06-10T07:39:22Z

I am using the lateset precompiled binary of SSIDS. I tested matrix ND/nd3k and PARSEC/Si10H16 and fact+solve time are around 1.5s/4.5s respectively. But in the SSDIS prepint A Sparse symmetric indefinite direct solver for GPU architectures, these two matrix can be solved in less than 0.5 second with 2 E5-2687W. I am using i9-14900K and all cores are at full load during runnning. Considering this is a 10 years later CPU with 24 cores, I think time consumption should be less.
Can someone help to test same matrix on similar platform? Thanks!

LiuZhexuan · 2024-06-10T08:23:08Z

I tried to use spral_ssids.exe for these two matrix just now. Console show ND/nd3k and PARSEC/Si10H16 time is ~0.5s/1.1s respectively. Below are logging from console:
D:\Desktop\111\bin> ./spral_ssids nd3k.rb --scale=auction --nrhs 2
Set scaling to Auction
solving for 2 right-hand sides
Reading 'nd3k.rb'...
ok
Forcing topology to 32
Using 0 GPUs
Used order 1
ok
Analyse took 0.405999988
Predict nfact = 1.49E+07
Predict nflop = 2.99E+10
nparts 1
cpu_fl 2.99E+10
gpu_fl 0.00E+00
Factorize...
ok
Factor took 0.485000014
Solve...
ok
Solve took 3.09999995E-02
number bad cmp = 0
fwd error || ||_inf = 5.9573790345268662E-011
bwd error scaled = 6.3086194706343708E-016 6.3086194706343708E-016
cmp: SMFCT
anal: 0.41
fact: 0.49
afact: 1.49E+07
aflop: 2.99E+10
nfact: 1.49E+07
nflop: 2.99E+10
delay: 0
inerti 0 0 9000
2x2piv 454
maxfro 3083
maxsup 2231
not_fi 0
not_se 0

D:\Desktop\111\bin> ./spral_ssids Si10H16.rb --scale=auction --nrhs 1
Set scaling to Auction
solving for 1 right-hand sides
Reading 'Si10H16.rb'...
ok
Forcing topology to 32
Using 0 GPUs
Used order 1
ok
Analyse took 0.296999991
Predict nfact = 3.18E+07
Predict nflop = 8.49E+10
nparts 1
cpu_fl 8.49E+10
gpu_fl 0.00E+00
Factorize...
ok
Factor took 1.01600003
Solve...
ok
Solve took 3.09999995E-02
number bad cmp = 0
fwd error || ||_inf = 5.3225091001252167E-011
bwd error scaled = 9.7885138181843040E-013
cmp: SMFCT
anal: 0.30
fact: 1.02
afact: 3.18E+07
aflop: 8.49E+10
nfact: 3.18E+07
nflop: 8.49E+10
delay: 0
inerti 41 0 17036
2x2piv 904
maxfro 4448
maxsup 3418
not_fi 0
not_se 0

jfowkes · 2024-06-10T13:33:10Z

The precompiled SSIDS binaries are not optimised and are not intended to be, that would be impossible. Please compile your own version of SPRAL If you would like optimised performance, then you can use optimised BLAS/LAPACK etc.

LiuZhexuan · 2024-06-10T13:43:33Z

The precompiled SSIDS binaries are not optimised and are not intended to be, that would be impossible. Please compile your own version of SPRAL If you would like optimised performance, then you can use optimised BLAS/LAPACK etc.

Thanks for reply! I will try to use menson later. But why the performance given by the executable spral_ssids.exe is much better? I think that exe is using the same dll used by MSVC

jfowkes · 2024-06-10T13:48:14Z

Thanks for reply! I will try to use menson later. But why the performance given by the executable spral_ssids.exe is much better? I think that exe is using the same dll used by MSVC

Indeed that is very strange, is It possible that they are using different default SSIDS options/settings?

LiuZhexuan · 2024-06-10T14:39:14Z

Thanks for reply! I will try to use menson later. But why the performance given by the executable spral_ssids.exe is much better? I think that exe is using the same dll used by MSVC

Indeed that is very strange, is It possible that they are using different default SSIDS options/settings?

I used the default setting both for MSVC and spral_ssids.exe. Attached is my code:
main.zip

jfowkes · 2024-06-10T14:53:23Z

Right but are you sure the default options are the same for both? In the spral_ssids.exe example above you're passing in --scale=auction for auction scaling. But the default is to use no scaling when spral_ssids_default_options is called: https://ralna.github.io/spral/_build/html/C/ssids.html#derived-types

LiuZhexuan · 2024-06-10T15:06:49Z

Right but are you sure the default options are the same for both? In the spral_ssids.exe example above you're passing in --scale=auction for auction scaling. But the default is to use no scaling when spral_ssids_default_options is called: https://ralna.github.io/spral/_build/html/C/ssids.html#derived-types

I tried again by not passing --scale=aution, time consumption is almost the same. Below are two test run under R5-5600X. Factorize using MSVC is still much slower than exe on this computer(~4.5s).

PS E:\test> .\spral_ssids.exe Si10H16.rb
Reading 'Si10H16.rb'...
ok
Forcing topology to 12
Using 0 GPUs
Used order 1
ok
Analyse took 0.375000000
Predict nfact = 3.18E+07
Predict nflop = 8.49E+10
nparts 1
cpu_fl 8.49E+10
gpu_fl 0.00E+00
Factorize...
ok
Factor took 1.71899998
Solve...
ok
Solve took 1.60000008E-02
number bad cmp = 0
fwd error || ||_inf = 6.3399951955034339E-011
bwd error scaled = 1.8274520805995796E-012
cmp: SMFCT
anal: 0.38
fact: 1.72
afact: 3.18E+07
aflop: 8.49E+10
nfact: 3.18E+07
nflop: 8.49E+10
delay: 0
inerti 41 0 17036
2x2piv 904
maxfro 4448
maxsup 3418
not_fi 0
not_se 0
PS E:\test> .\spral_ssids.exe Si10H16.rb --scale=auction
Set scaling to Auction
Reading 'Si10H16.rb'...
ok
Forcing topology to 12
Using 0 GPUs
Used order 1
ok
Analyse took 0.360000014
Predict nfact = 3.18E+07
Predict nflop = 8.49E+10
nparts 1
cpu_fl 8.49E+10
gpu_fl 0.00E+00
Factorize...
ok
Factor took 1.71800005
Solve...
ok
Solve took 1.60000008E-02
number bad cmp = 0
fwd error || ||_inf = 3.3993030612577968E-011
bwd error scaled = 1.7394675123364451E-012
cmp: SMFCT
anal: 0.36
fact: 1.72
afact: 3.18E+07
aflop: 8.49E+10
nfact: 3.18E+07
nflop: 8.49E+10
delay: 0
inerti 41 0 17036
2x2piv 904
maxfro 4448
maxsup 3418
not_fi 0
not_se 0

jfowkes · 2024-06-10T15:24:21Z

Interesting, the spral_ssids.exe is compiled using MinGW's gfortran compiler and not MSVC so it may be that.

amontoison · 2024-06-12T04:12:44Z

Note that the library libspral.dll provided by the artifact is also compiled with GCC / GFortran / G++ compilers and MinGW.
Precompiled artifacts should have optimized performance for a given architecture (x64, AArch64, etc.), but not for a specific microarchitecture (specific CPU model).
Except for BLAS / LAPACK, I don't think we see differences in practice.

LiuZhexuan · 2024-06-12T10:02:11Z

Note that the library libspral.dll provided by the artifact is also compiled with GCC / GFortran / G++ compilers and MinGW. Precompiled artifacts should have optimized performance for a given architecture (x64, AArch64, etc.), but not for a specific microarchitecture (specific CPU model). Except for BLAS / LAPACK, I don't think we see differences in practice.

The blas provided by precompiled lib is openblas, which will also be used by spral_ssids.exe

mjacobse added build-system and removed build-system labels Jun 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance on Windows is worse than prepint #202

Performance on Windows is worse than prepint #202

LiuZhexuan commented Jun 10, 2024

LiuZhexuan commented Jun 10, 2024 •

edited

Loading

jfowkes commented Jun 10, 2024

LiuZhexuan commented Jun 10, 2024

jfowkes commented Jun 10, 2024

LiuZhexuan commented Jun 10, 2024 •

edited

Loading

jfowkes commented Jun 10, 2024 •

edited

Loading

LiuZhexuan commented Jun 10, 2024 •

edited

Loading

jfowkes commented Jun 10, 2024 •

edited

Loading

amontoison commented Jun 12, 2024 •

edited

Loading

LiuZhexuan commented Jun 12, 2024

Performance on Windows is worse than prepint #202

Performance on Windows is worse than prepint #202

Comments

LiuZhexuan commented Jun 10, 2024

LiuZhexuan commented Jun 10, 2024 • edited Loading

jfowkes commented Jun 10, 2024

LiuZhexuan commented Jun 10, 2024

jfowkes commented Jun 10, 2024

LiuZhexuan commented Jun 10, 2024 • edited Loading

jfowkes commented Jun 10, 2024 • edited Loading

LiuZhexuan commented Jun 10, 2024 • edited Loading

jfowkes commented Jun 10, 2024 • edited Loading

amontoison commented Jun 12, 2024 • edited Loading

LiuZhexuan commented Jun 12, 2024

LiuZhexuan commented Jun 10, 2024 •

edited

Loading

LiuZhexuan commented Jun 10, 2024 •

edited

Loading

jfowkes commented Jun 10, 2024 •

edited

Loading

LiuZhexuan commented Jun 10, 2024 •

edited

Loading

jfowkes commented Jun 10, 2024 •

edited

Loading

amontoison commented Jun 12, 2024 •

edited

Loading