-
Notifications
You must be signed in to change notification settings - Fork 369
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
squarePacked GEMM. #586
Open
madanm3
wants to merge
13
commits into
flame:master
Choose a base branch
from
madanm3:sqp
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
squarePacked GEMM. #586
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Contributor
madanm3
commented
Dec 15, 2021
•
edited
Loading
edited
- squarePacked(sqp) GEMM framework is mainly targeted for real and complex GEMM operation for square and small matrix sizes (where, m=k=n <= 512 and m is multiple of 8).
- First level implementation is done for double precision & double complex precision only.
- Framework uses set of column major dgemm kernels. mx8: {8mx6n, 8mx5n, 8mx4n, 8mx3n, 8mx2n, 8mx1n}
- In dgemm, by default A matrix is always packed and for AtxB operations A transpose is done while packing.
- These real dgemm kernels are re-used in induced zgemm implementation with 3m algorithm.
- 3m implementation in sqp, packs all matrix (A, B & C) and its real and imaginary component.
- New 3m method offers up-to 25% gain over other zgemm implementation in blis for the targeted sizes.
- Though PR implementation is efficient for square and smaller matrix sizes, implementation allows tuning for parameters for other sizes and shapes.
- Replacing multiplication with add and sub when alpha, beta = +/-1 is done in the implementation for both real and complex GEMM.
- Basic multithreading implementation of threading work items along m dimension is done, which could be useful for large m and relatively smaller n and k sizes. This method is currently disabled since more generic implementation with further framework modification is in progress.
- By default, k partition is not done since focus was for small matrix sizes. This does single load and store of C matrix.
- But there is provision in the implementation to do k partition by changing kx parameter.
- Complete implementation is lightweight and limited to 2 c files. One file for framework and another for kernels.
- There is provision to add new kernel set, apart from mx8 kernel set, which is currently used. (new kernel sets - WIP)
1. In zgemm, mkernel outperforms nkernel for both m > n, and n > m. 2. Irrespective of mu and nu sizes, mkernel is forced for zgemm based on analysis done. Change-Id: Iafb7ddb2519c17cf2225da84d6cc74ed985cc21e AMD-Internal: [CPUPL-1352]
1. SquarePacked algorithm focuses on efficient zgemm/dgemm implementation for square matrix sizes (m=k=n) 2. Variation of 3m algorithm (3m_sqp) is implemented to allow single load and store of C matrix in kernel. 3. Currently the method supports only m multiple of 8. Residues cases to be implemented later. 4. dgemm Real kernel (dgemm_sqp) implementation without alpha, beta multiple is done, since real alpha and beta scaling are in 3m_sqp framework. 5. gemm_sqp supports dgemm when alpha = +/-1.0 and beta = 1.0. Change-Id: I49becaf6079da4be29be5b06057ff4e50770a7d8 AMD-Internal: [CPUPL-1352]
1. Added comments. AMD-Internal: [CPUPL-1429] Change-Id: Ie37e24e58cd8bf836038a2258ebd09c3912fab9e
1. bli_malloc modified to normal malloc and address alignment within 3m_sqp. 2. function added to pack A real,imag and sum. 3. function added to pack B real,imag and sum. 4. function added to pack C real,imag and beta handling. 4. sum and sub vectorized. AMD-Internal: [CPUPL-1352] Change-Id: I514e9efb053d529caef2de413d74d0dac2ceca54
1. mx1, mx4 kernel addition and framework modification. 2. 8mx6n kernel addition. 3. NULL check added in dgemm_sqp malloc. 4. mem tracing added. 5. Restricted 3m_sqp to limited matrix sizes. 6. Induced methods disabled temporarily for debug. AMD-Internal: [CPUPL-1352] Change-Id: I31671859b32bfbb359687fb7c9056f9eb904c8b2
1. Re-enabling 3m methods for zgemm. 2. Vectorization of pack_sum routines re-enabled with bug fix. 3. 8mx6n kernel added. AMD-Internal: [CPUPL-1352] Change-Id: Id9f010ba763afc52d268c2e68805f069919b8810
1. kx partitions added to k loop for dgemm and zgemm. 2. mx loop based threading model added for dgemm as prototype of zgemm. 3. nx loop added for 3m_sqp and dgemm_sqp. 4. single 3m_sqp workspace allocation with smaller memory footprint. 5. sqp framework done from dgemm and zgemm. 6. sqp kernels moved to seperate kernel file. 7. residue kernel core added to handle mx<8. 8. multi-instance tuning for 3m_sqp done. 9. user can set env "BLIS_MULTI_INSTANCE" to 1 for better multi-instance behavior of 3m_sqp. AMD-Internal: [CPUPL-1521] Change-Id: Ibef50a8a37fe99f164edb4621acb44fc0c86514c
1. 3m_sqp support for A matrix with conjugate_no_transpose and conjugate_transpose added. AMD-Internal: [CPUPL-1521] Change-Id: Ie6e5c49cf86f7d3b95d78705cf445e57f20b3d1f
1. Induced Method turned off, till the path fully tested for different alpha,beta conditions. 2. Fix for Beta =0, and C = NAN done. Change-Id: I5a7bd1393ac245c2ebb72f9a634728af4c0d4000
1. New err_t param in bli_malloc_user added. 2. AOCL_DTL log removed.
This reverts commit 231a464.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.