Mutli node training with different numbers of GPUs on each node, error about world size #52

Tristan-Kosciuch · 2024-03-15T19:36:18Z

Tristan-Kosciuch
Mar 15, 2024

Hi all,

I'm trying to train a model using two DGX nodes but with a different number of GPUs on each node. I'm using torchrun to launch the training. This is the shell script I'm running on each node. The only difference between the script on each node is $NPROC, which is 1 on one node and 2 on the other.

#!/bin/sh

conda activate audio

export CUDA_VISIBLE_DEVICES="0,7"
export WANDB_MODE=offline

torchrun --nproc_per_node=$NPROC --nnodes=2 --node-rank=1 --rdzv_id=456 --rdzv_backend=c10d \
 --rdzv_endpoint=IP:PORT stable-audio-tools/train.py \
  --dataset-config data.json \
  --model-config model.json \
  --num-gpus $NPROC \
  --pretransform-ckpt-path pretransform.ckpt \
  --name baseline \
  --config-file stable-audio-tools/defaults.ini \
  --num-nodes 2 \
  --ckpt-path model.ckpt \
  --strategy ddp

Error from the node with NPROC = 1

ValueError: You set `devices=1` and `num_nodes=2` in Lightning, but the product (1 * 2) does not match the world size (3).

Error from the node with NPROC = 2, seems like each process prints an error

ValueError: You set `devices=2` and `num_nodes=2` in Lightning, but the product (2 * 2) does not match the world size (3).
ValueError: You set `devices=2` and `num_nodes=2` in Lightning, but the product (2 * 2) does not match the world size (3).

Long term I think we plan to move to SLURM to orchestrate these jobs, so info on how to run mutli-node jobs there is appreciated, but right now we are manually launching on each node and understanding how to do that would be very helpful. Thanks!

Answered by fred-dev

Apr 18, 2024

I think you would need to change the repo and modify the PyTorch lightning wrapping if you want to train like this. As far as I can tell the way it is written it assumes the same number of gpus per node. PyTorch lighting makes multiple gpus and multi node training easy and it is pretty flexible in terms of node topology, but this needs to be implemented manually. This repo works well for me on a multiple node setup using PBS, and it seems there is some code to show it's been run on slurm.

Long story short, run on the single node with 2 gpus to avoid extra work. Refactor the way this repo uses lighting to use your uneven topology.

View full answer

fred-dev · 2024-04-18T11:29:09Z

fred-dev
Apr 18, 2024

I think you would need to change the repo and modify the PyTorch lightning wrapping if you want to train like this. As far as I can tell the way it is written it assumes the same number of gpus per node. PyTorch lighting makes multiple gpus and multi node training easy and it is pretty flexible in terms of node topology, but this needs to be implemented manually. This repo works well for me on a multiple node setup using PBS, and it seems there is some code to show it's been run on slurm.

Long story short, run on the single node with 2 gpus to avoid extra work. Refactor the way this repo uses lighting to use your uneven topology.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mutli node training with different numbers of GPUs on each node, error about world size #52

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Mutli node training with different numbers of GPUs on each node, error about world size #52

Tristan-Kosciuch Mar 15, 2024

Replies: 1 comment

fred-dev Apr 18, 2024

Tristan-Kosciuch
Mar 15, 2024

fred-dev
Apr 18, 2024