Skip to content

Commit

Permalink
n_gpus
Browse files Browse the repository at this point in the history
Sikan Li committed Jun 28, 2024
1 parent 9a8ea12 commit 79cd9f4
Showing 3 changed files with 4 additions and 3 deletions.
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -288,7 +288,7 @@ GNS can be trained in parallel on multiple nodes with multiple GPUs.
### Usage

```shell
mpiexec.hydra -np $NNODES -ppn 1 ../slurm_scripts/launch_helper.sh $DOCKER_IMG_LOCATION
mpiexec.hydra -np $NNODES -ppn 1 ../slurm_scripts/launch_helper.sh $DOCKER_IMG_LOCATION $n_gpu_per_node
```


2 changes: 1 addition & 1 deletion slurm_scripts/launch_helper.sh
Original file line number Diff line number Diff line change
@@ -15,7 +15,7 @@ fi


PRELOAD="/opt/apps/tacc-apptainer/1.1.8/bin/apptainer exec --nv $1 "
CMD="torchrun --nproc_per_node 4 --nnodes $NNODES --node_rank=$LOCAL_RANK --master_addr=$MAIN_RANK --master_port=1234 train.py"
CMD="torchrun --nproc_per_node $2 --nnodes $NNODES --node_rank=$LOCAL_RANK --master_addr=$MAIN_RANK --master_port=1234 train.py"

FULL_CMD="$PRELOAD $CMD"
echo "Training command: $FULL_CMD"
3 changes: 2 additions & 1 deletion slurm_scripts/launch_train.sh
Original file line number Diff line number Diff line change
@@ -16,4 +16,5 @@ scontrol show hostnames > $NODEFILE
NNODES=$(< $NODEFILE wc -l)

CONTAINER=$1
mpiexec.hydra -np $NNODES -ppn 1 ../slurm_scripts/launch_helper.sh $CONTAINER
n_gpu_per_node=$2
mpiexec.hydra -np $NNODES -ppn 1 ../slurm_scripts/launch_helper.sh $CONTAINER $n_gpu_per_node

0 comments on commit 79cd9f4

Please sign in to comment.