Mutli node training with different numbers of GPUs on each node, error about world size #52
-
Hi all, I'm trying to train a model using two DGX nodes but with a different number of GPUs on each node. I'm using torchrun to launch the training. This is the shell script I'm running on each node. The only difference between the script on each node is $NPROC, which is 1 on one node and 2 on the other.
Error from the node with NPROC = 1
Error from the node with NPROC = 2, seems like each process prints an error
Long term I think we plan to move to SLURM to orchestrate these jobs, so info on how to run mutli-node jobs there is appreciated, but right now we are manually launching on each node and understanding how to do that would be very helpful. Thanks! |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
I think you would need to change the repo and modify the PyTorch lightning wrapping if you want to train like this. As far as I can tell the way it is written it assumes the same number of gpus per node. PyTorch lighting makes multiple gpus and multi node training easy and it is pretty flexible in terms of node topology, but this needs to be implemented manually. This repo works well for me on a multiple node setup using PBS, and it seems there is some code to show it's been run on slurm. Long story short, run on the single node with 2 gpus to avoid extra work. Refactor the way this repo uses lighting to use your uneven topology. |
Beta Was this translation helpful? Give feedback.
I think you would need to change the repo and modify the PyTorch lightning wrapping if you want to train like this. As far as I can tell the way it is written it assumes the same number of gpus per node. PyTorch lighting makes multiple gpus and multi node training easy and it is pretty flexible in terms of node topology, but this needs to be implemented manually. This repo works well for me on a multiple node setup using PBS, and it seems there is some code to show it's been run on slurm.
Long story short, run on the single node with 2 gpus to avoid extra work. Refactor the way this repo uses lighting to use your uneven topology.