Skip to content

Mutli node training with different numbers of GPUs on each node, error about world size #52

Answered by fred-dev
Tristan-Kosciuch asked this question in Q&A
Discussion options

You must be logged in to vote

I think you would need to change the repo and modify the PyTorch lightning wrapping if you want to train like this. As far as I can tell the way it is written it assumes the same number of gpus per node. PyTorch lighting makes multiple gpus and multi node training easy and it is pretty flexible in terms of node topology, but this needs to be implemented manually. This repo works well for me on a multiple node setup using PBS, and it seems there is some code to show it's been run on slurm.

Long story short, run on the single node with 2 gpus to avoid extra work. Refactor the way this repo uses lighting to use your uneven topology.

Replies: 1 comment

Comment options

You must be logged in to vote
0 replies
Answer selected by Tristan-Kosciuch
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants