-
Notifications
You must be signed in to change notification settings - Fork 8
How to use Memory Load Balance in neurodamus
When using neurodamus
on a big circuit one might incur into issues and OOMs due to the sheer size of the circuit and imbalance in the memory distribution of gids during the simulation.
In order to mitigate and solve these issues, we have implemented a memory balancing mechanism in neurodamus that uses the estimation data collected in the dry-run mode to balance the memory usage across all the nodes.
The usage is pretty simple and the whole workflow just need two execution of neurodamus
; first in dry-run and then in normal simulation mode.
Let's see how!
-
Run
neurodamus
indry-run
mode:neurodamus (or special) ... --dry-run
This will run the dry run workflow, balance the memory and at the end of the execution anallocation_r*_c*.pkl.gz
file will be created. By default, the balance distribution will happen on the amount of nodes/ranks that the dry run suggests. However you can manually specify the amount of ranks you want to distribute on by using the--num-target-ranks
option. So, for example, let's say you want to distribute over 100 ranks, you can runneurodamus
with:neurodamus (or special) ... --dry-run --num-target-ranks=100
-
Run
neurodamus
in Memory Load Balance mode: Once theallocation_r*_c*.pkl.gz
file has been created, you can run your circuit normally, using the normal options you would use but making sure to add the--lb-mode=Memory
option to putneurodamus
in Memory Load Balance mode. The allocation file will be automatically loaded if it's in the same directory where you're running it:neurodamus (or special) ... --lb-mode=Memory
Keep in mind that the program will automatically create a new correct allocation file if the number of ranks in the execution do not correspond to any allocation file present in the directory.
If this procedure still fails, it might be necessary to rebalance the rank assignment in post-processing. To do so, please refer to the memory_load_balance.rst
file you can find in the docs
directory of this repo.