You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We wanted to get an idea of what might be different between DGEMM from gh-146, which is brute force "for loops," and the faster-performing version in OpenBLAS.
From a quick scan of the OpenBLAS code base at the time of writing, there are a few relevant things to note:
OpenBLAS uses hand crafted assembly for many architectures and algorithms, for example kernel/x86_64/dgemm_kernel_4x8_haswell.S has 5000 lines of assembly for DGEMM stuff--so that's obviously something that might be optimized at a different level of "tuning" vs. us at the moment; I'm not sure how easy it will be for me to read the assembly and check for specific types of optimizations like pipelining and so on..
they also seem to be able to leverage i.e., cuda_dgemm_kernel so it may make sense to compare with them on the GPU with a specific compilation of OpenBLAS for that scenario (though this is less convenient for the current benchmarks, because SciPy is not GPU-swappable off the shelf; could look at CuPy CuBLAS or some other Python interface for comparison maybe)
git grep -E -i "strassen" returns no results in OpenBLAS, so the usage of an algorithm with a fundamentally different asymptotic behavior is perhaps a bit less likely; also, see related discussion agreeing with this: https://stackoverflow.com/a/11421344/2942522
in short, it seems like Strassen may have substantial algorithm coefficients/caching issues and maybe even numerical stability issues that prevent it from being the primary choice despite the asymptotic advantages (this may also be why I don't think it is even mentioned in the IEEE paper we were looking at..)
Not sure how helpful all of this is, but my initial impression is that low-level optimizations in assembly for specific architectures drive a lot of the improvements, rather than fancier/asymptotically-superior algorithms that are far more complex.
The text was updated successfully, but these errors were encountered:
We wanted to get an idea of what might be different between DGEMM from gh-146, which is brute force "for loops," and the faster-performing version in OpenBLAS.
From a quick scan of the OpenBLAS code base at the time of writing, there are a few relevant things to note:
kernel/x86_64/dgemm_kernel_4x8_haswell.S
has 5000 lines of assembly forDGEMM
stuff--so that's obviously something that might be optimized at a different level of "tuning" vs. us at the moment; I'm not sure how easy it will be for me to read the assembly and check for specific types of optimizations like pipelining and so on..cuda_dgemm_kernel
so it may make sense to compare with them on the GPU with a specific compilation of OpenBLAS for that scenario (though this is less convenient for the current benchmarks, because SciPy is not GPU-swappable off the shelf; could look at CuPy CuBLAS or some other Python interface for comparison maybe)git grep -E -i "strassen"
returns no results in OpenBLAS, so the usage of an algorithm with a fundamentally different asymptotic behavior is perhaps a bit less likely; also, see related discussion agreeing with this: https://stackoverflow.com/a/11421344/2942522Not sure how helpful all of this is, but my initial impression is that low-level optimizations in assembly for specific architectures drive a lot of the improvements, rather than fancier/asymptotically-superior algorithms that are far more complex.
The text was updated successfully, but these errors were encountered: