Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BLAS, L2 GeMM] Throughput decreases for specific matrix sizes #197

Open
afzalxo opened this issue Mar 21, 2024 · 0 comments
Open

[BLAS, L2 GeMM] Throughput decreases for specific matrix sizes #197

afzalxo opened this issue Mar 21, 2024 · 0 comments

Comments

@afzalxo
Copy link

afzalxo commented Mar 21, 2024

I implemented and ran the gemm_1CU example from here on a U50 card. The output is of the form

DATA_CSV:,MemWidth,Freq,M,K,N,Ops,KernelCycles,TimeKernelMs,TimeApiMs,EffKernelPct,EffApiPct,PerfKernelTops,PerfApiTops

From this, the kernel GOPS can simply be computed using PerfKernelTops * 1000. I ran this example for various square matrix sizes and plotted the GOPS on the vertical axis against matrix sizes on the horizontal axis. The resulting performance profile exhibits weird behavior, as seen below. The performance drops significantly for matrix sizes that are multiples of 4096.

u50_int16_l2_gemm

I tried various different data types (float, int32, int16), two FPGA cards (U50 and U280), and both HBM and DDR memory interfaces, and the results are fairly consistent for each memory interface. That is, for U50 HBM and U280 HBM, the performance drops are seen at matrix sizes that are multiples of 4096, while for U280 DDR, the performance drops are seen at some other matrix sizes.

Are there any guesses as to why this phenomenon occurs? The kernel loads submatrices out of the large input matrices and performs GeMM on them. So, this shouldn't happen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant