-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Revised calcHTranspose operator #68
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @johnmauff for working on this. It is still unclear to me what you have done to reduce the memory usage but I have some clarification questions first.
src/CostFunction3D.h
Outdated
uint64_t *IH; // uint64_t | ||
uint32_t *JH; // uint32_t |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why shall we define *IH
and *JH
with different types?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question and this relates to your overall question as to why these changes save memory. The JH array can have values that range from (0,nstate-1) while IH can have values that range from (0,nnz-1). While nnz exceeds the threshold for 32-bit integers, nState does not. So we can save memory here by using 32-bit inttegers for the JH array which has a total number of elements of nnz. The previous implementation had multiple 64-bit integer index arrays that had nnz elements. I have added comments to the CostFunction3D.h to help clarify the differences in word size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aha, I see. Thanks for your clarification and it now makes sense. As long as nState
won't exceed the threshold for 32-bit integer in the future SAMURAI production test, this change seems safe to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @johnmauff for addressing my comments. Here is the second round of my questions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @johnmauff for addressing all my comments.
I could confirm that I am able to run a 4-panel hurricane case on Casper's H100 GPU with your branch (it takes about 2 hours in total and runtime: Cost3D minimize: 5222
).
One minor and optional suggestion: I prefer to merge my PR #66 first (still waiting for @mmbell 's review) so that we can merge it into your PR and test if it works as expected.
@sjsprecious Thanks for your thorough review. I am fine with waiting for PR #66 to be committed first. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for making these changes and for the thorough discussion about the reasoning.
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #68 +/- ##
=======================================
Coverage 32.62% 32.62%
=======================================
Files 51 51
Lines 16815 16815
=======================================
Hits 5486 5486
Misses 11329 11329 ☔ View full report in Codecov by Sentry. |
This PR represents a complete rewrite of the calcHTranspose operator. The previous implementation only stored the H matrix and accessed it using several indirect addressing arrays for the calcHTranspose operation. The previous storage format was non-standard and on large problems like the hurricane_4panel test case actually consumed more memory than just explicitly storing the H^t matrix. This PR changes the form of the calcHTranspose operator such that the H^t matrix is explicitly stored in a CSR format. This change enables both a reduction in memory usage for computationally large problems and execution time on the CPU. Due to reduction in memory usage, both the hurricane_4panel configuration can be run on 80GB H100 GPU's and the hurricane case can be run on a 40 GB A100 GPU without code changes. The version of the code has been tested on the following platforms and input configurations:
Derecho CPU
beltrami
supercell
hurricane
typhoonChanthu2020
hurricane_4panel
Derecho A100 GPU
beltrami
supercell
hurricane
typhoonChanthu2020
Casper H100 GPU
beltrami
supercell
hurricane
typhoonChanthu2020
hurricane_4panel