Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A few notes on DGEMM vs. OpenBLAS #159

Open
tylerjereddy opened this issue Jan 16, 2023 · 0 comments
Open

A few notes on DGEMM vs. OpenBLAS #159

tylerjereddy opened this issue Jan 16, 2023 · 0 comments

Comments

@tylerjereddy
Copy link
Contributor

We wanted to get an idea of what might be different between DGEMM from gh-146, which is brute force "for loops," and the faster-performing version in OpenBLAS.

From a quick scan of the OpenBLAS code base at the time of writing, there are a few relevant things to note:

  • OpenBLAS uses hand crafted assembly for many architectures and algorithms, for example kernel/x86_64/dgemm_kernel_4x8_haswell.S has 5000 lines of assembly for DGEMM stuff--so that's obviously something that might be optimized at a different level of "tuning" vs. us at the moment; I'm not sure how easy it will be for me to read the assembly and check for specific types of optimizations like pipelining and so on..
  • they also seem to be able to leverage i.e., cuda_dgemm_kernel so it may make sense to compare with them on the GPU with a specific compilation of OpenBLAS for that scenario (though this is less convenient for the current benchmarks, because SciPy is not GPU-swappable off the shelf; could look at CuPy CuBLAS or some other Python interface for comparison maybe)
  • git grep -E -i "strassen" returns no results in OpenBLAS, so the usage of an algorithm with a fundamentally different asymptotic behavior is perhaps a bit less likely; also, see related discussion agreeing with this: https://stackoverflow.com/a/11421344/2942522
    • in short, it seems like Strassen may have substantial algorithm coefficients/caching issues and maybe even numerical stability issues that prevent it from being the primary choice despite the asymptotic advantages (this may also be why I don't think it is even mentioned in the IEEE paper we were looking at..)

Not sure how helpful all of this is, but my initial impression is that low-level optimizations in assembly for specific architectures drive a lot of the improvements, rather than fancier/asymptotically-superior algorithms that are far more complex.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant