-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Takeaways from LUMI Pilot program #421
Comments
Thanks for sharing this info. Ad 1: Can you share some performance data? Eg. how it compares to nVidia? Ad 2: I think for now compilation of R is the only option. Thankfully I did not have problem with compiling R on any architecture yet. Ad 3: It's interesting. Setting Ad 4: I remember you were doing some ADIOS integration. How does ADIOS relate to for example Catalyst? And how does it relate to HDF5? Feel free to start another issue on this subject, as I think this doesn't relate much to HIP/LUMI |
cat run.slurm
/users/XXX/select_gpu.sh:
|
Some side notes, I forgot: Those should be turned off as default (or at configure?). It's 1000s of files/lines for big runs:
And maybe we could support data-dumps to h5 format instead of lots of pri? or default to DUMPNAME/slice.pri. Again lots of files per folder, and it gets complicated to handle when doing restarted simulations. Most of the time you have ~2days worth. There is also numbering error in at least pri - we assumed that 99 GPUs is enough, and leading zero is messy to handle in batch ;)
|
Ad printouts and xml: I agree it's cumbersome with many threads. There is couple of prints to be cleaned and xml files can be fixed to just export a single one. Ad dumps: The "pri" files were just the simples (and fastest) way of doing things. We can switch to h5, but I think this should be as an option, as I don't want hdf5 to be an must dependency. As for the file mess, we can come up with some good way to arrange them. You can use the Ad numbering: that is true that it's designed to have two digits for the rank, but I don't know if changing it now is a good idea. I agree it's a bit messy, but you can use: |
As for performance, @mdzik could you run some tests to compare double to double performance between nvidia and AMD? The "theory" is that AMD cards are designed for double precision. |
AFAIK I was mistaken - binary was double precission |
TCLB was part of LUMI (https://lumi-supercomputer.eu/lumis-second-pilot-phase-in-full-swing/) Pilot Program which is now ending.
Apart from performance results, there are some issues that might be worth consideration. LUMI is a brand new CRAY/HPE computer with AMD Instinct MI250X 128GB HBM2e cards.
As for results, I made 0.8e9 lattice dissolution simulation for AGU :D That is around half of the 12cm experimental core at 30um resolution.
I still have few days left - if you want to check something we could do it.
The text was updated successfully, but these errors were encountered: