Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance on Windows is worse than prepint #202

Open
LiuZhexuan opened this issue Jun 10, 2024 · 10 comments
Open

Performance on Windows is worse than prepint #202

LiuZhexuan opened this issue Jun 10, 2024 · 10 comments

Comments

@LiuZhexuan
Copy link

I am using the lateset precompiled binary of SSIDS. I tested matrix ND/nd3k and PARSEC/Si10H16 and fact+solve time are around 1.5s/4.5s respectively. But in the SSDIS prepint A Sparse symmetric indefinite direct solver for GPU architectures, these two matrix can be solved in less than 0.5 second with 2 E5-2687W. I am using i9-14900K and all cores are at full load during runnning. Considering this is a 10 years later CPU with 24 cores, I think time consumption should be less.
Can someone help to test same matrix on similar platform? Thanks!

@LiuZhexuan
Copy link
Author

LiuZhexuan commented Jun 10, 2024

I tried to use spral_ssids.exe for these two matrix just now. Console show ND/nd3k and PARSEC/Si10H16 time is ~0.5s/1.1s respectively. Below are logging from console:
D:\Desktop\111\bin> ./spral_ssids nd3k.rb --scale=auction --nrhs 2
Set scaling to Auction
solving for 2 right-hand sides
Reading 'nd3k.rb'...
ok
Forcing topology to 32
Using 0 GPUs
Used order 1
ok
Analyse took 0.405999988
Predict nfact = 1.49E+07
Predict nflop = 2.99E+10
nparts 1
cpu_fl 2.99E+10
gpu_fl 0.00E+00
Factorize...
ok
Factor took 0.485000014
Solve...
ok
Solve took 3.09999995E-02
number bad cmp = 0
fwd error || ||_inf = 5.9573790345268662E-011
bwd error scaled = 6.3086194706343708E-016 6.3086194706343708E-016
cmp: SMFCT
anal: 0.41
fact: 0.49
afact: 1.49E+07
aflop: 2.99E+10
nfact: 1.49E+07
nflop: 2.99E+10
delay: 0
inerti 0 0 9000
2x2piv 454
maxfro 3083
maxsup 2231
not_fi 0
not_se 0

D:\Desktop\111\bin> ./spral_ssids Si10H16.rb --scale=auction --nrhs 1
Set scaling to Auction
solving for 1 right-hand sides
Reading 'Si10H16.rb'...
ok
Forcing topology to 32
Using 0 GPUs
Used order 1
ok
Analyse took 0.296999991
Predict nfact = 3.18E+07
Predict nflop = 8.49E+10
nparts 1
cpu_fl 8.49E+10
gpu_fl 0.00E+00
Factorize...
ok
Factor took 1.01600003
Solve...
ok
Solve took 3.09999995E-02
number bad cmp = 0
fwd error || ||_inf = 5.3225091001252167E-011
bwd error scaled = 9.7885138181843040E-013
cmp: SMFCT
anal: 0.30
fact: 1.02
afact: 3.18E+07
aflop: 8.49E+10
nfact: 3.18E+07
nflop: 8.49E+10
delay: 0
inerti 41 0 17036
2x2piv 904
maxfro 4448
maxsup 3418
not_fi 0
not_se 0

@jfowkes
Copy link
Contributor

jfowkes commented Jun 10, 2024

The precompiled SSIDS binaries are not optimised and are not intended to be, that would be impossible. Please compile your own version of SPRAL If you would like optimised performance, then you can use optimised BLAS/LAPACK etc.

@LiuZhexuan
Copy link
Author

The precompiled SSIDS binaries are not optimised and are not intended to be, that would be impossible. Please compile your own version of SPRAL If you would like optimised performance, then you can use optimised BLAS/LAPACK etc.

Thanks for reply! I will try to use menson later. But why the performance given by the executable spral_ssids.exe is much better? I think that exe is using the same dll used by MSVC

@jfowkes
Copy link
Contributor

jfowkes commented Jun 10, 2024

Thanks for reply! I will try to use menson later. But why the performance given by the executable spral_ssids.exe is much better? I think that exe is using the same dll used by MSVC

Indeed that is very strange, is It possible that they are using different default SSIDS options/settings?

@LiuZhexuan
Copy link
Author

LiuZhexuan commented Jun 10, 2024

Thanks for reply! I will try to use menson later. But why the performance given by the executable spral_ssids.exe is much better? I think that exe is using the same dll used by MSVC

Indeed that is very strange, is It possible that they are using different default SSIDS options/settings?

I used the default setting both for MSVC and spral_ssids.exe. Attached is my code:
main.zip

@jfowkes
Copy link
Contributor

jfowkes commented Jun 10, 2024

Right but are you sure the default options are the same for both? In the spral_ssids.exe example above you're passing in --scale=auction for auction scaling. But the default is to use no scaling when spral_ssids_default_options is called: https://ralna.github.io/spral/_build/html/C/ssids.html#derived-types

@LiuZhexuan
Copy link
Author

LiuZhexuan commented Jun 10, 2024

Right but are you sure the default options are the same for both? In the spral_ssids.exe example above you're passing in --scale=auction for auction scaling. But the default is to use no scaling when spral_ssids_default_options is called: https://ralna.github.io/spral/_build/html/C/ssids.html#derived-types

I tried again by not passing --scale=aution, time consumption is almost the same. Below are two test run under R5-5600X. Factorize using MSVC is still much slower than exe on this computer(~4.5s).

PS E:\test> .\spral_ssids.exe Si10H16.rb
Reading 'Si10H16.rb'...
ok
Forcing topology to 12
Using 0 GPUs
Used order 1
ok
Analyse took 0.375000000
Predict nfact = 3.18E+07
Predict nflop = 8.49E+10
nparts 1
cpu_fl 8.49E+10
gpu_fl 0.00E+00
Factorize...
ok
Factor took 1.71899998
Solve...
ok
Solve took 1.60000008E-02
number bad cmp = 0
fwd error || ||_inf = 6.3399951955034339E-011
bwd error scaled = 1.8274520805995796E-012
cmp: SMFCT
anal: 0.38
fact: 1.72
afact: 3.18E+07
aflop: 8.49E+10
nfact: 3.18E+07
nflop: 8.49E+10
delay: 0
inerti 41 0 17036
2x2piv 904
maxfro 4448
maxsup 3418
not_fi 0
not_se 0
PS E:\test> .\spral_ssids.exe Si10H16.rb --scale=auction
Set scaling to Auction
Reading 'Si10H16.rb'...
ok
Forcing topology to 12
Using 0 GPUs
Used order 1
ok
Analyse took 0.360000014
Predict nfact = 3.18E+07
Predict nflop = 8.49E+10
nparts 1
cpu_fl 8.49E+10
gpu_fl 0.00E+00
Factorize...
ok
Factor took 1.71800005
Solve...
ok
Solve took 1.60000008E-02
number bad cmp = 0
fwd error || ||_inf = 3.3993030612577968E-011
bwd error scaled = 1.7394675123364451E-012
cmp: SMFCT
anal: 0.36
fact: 1.72
afact: 3.18E+07
aflop: 8.49E+10
nfact: 3.18E+07
nflop: 8.49E+10
delay: 0
inerti 41 0 17036
2x2piv 904
maxfro 4448
maxsup 3418
not_fi 0
not_se 0

@jfowkes
Copy link
Contributor

jfowkes commented Jun 10, 2024

Interesting, the spral_ssids.exe is compiled using MinGW's gfortran compiler and not MSVC so it may be that.

@amontoison
Copy link
Member

amontoison commented Jun 12, 2024

Note that the library libspral.dll provided by the artifact is also compiled with GCC / GFortran / G++ compilers and MinGW.
Precompiled artifacts should have optimized performance for a given architecture (x64, AArch64, etc.), but not for a specific microarchitecture (specific CPU model).
Except for BLAS / LAPACK, I don't think we see differences in practice.

@LiuZhexuan
Copy link
Author

Note that the library libspral.dll provided by the artifact is also compiled with GCC / GFortran / G++ compilers and MinGW. Precompiled artifacts should have optimized performance for a given architecture (x64, AArch64, etc.), but not for a specific microarchitecture (specific CPU model). Except for BLAS / LAPACK, I don't think we see differences in practice.

The blas provided by precompiled lib is openblas, which will also be used by spral_ssids.exe

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants