-
Notifications
You must be signed in to change notification settings - Fork 7
distributed training of jediNET
vlimant edited this page Oct 11, 2019
·
9 revisions
The base command to run is
python3 TrainingDriver.py --loss categorical_crossentropy --epochs 1 --batch 200 --model examples/example_jedi_torch.py --mode gem --worker-optimizer adam --cache /imdata/
on iBanks cluster, it can be run within a single node with
mpirun --prefix /opt/openmpi-3.1.0 -np 3 --tag-output singularity exec --nv -B /imdata/ -B /storage/ /storage/group/gpu/software/singularity/ibanks/edge.simg python3 TrainingDriver.py --loss categorical_crossentropy --epochs 1 --batch 200 --model examples/example_jedi_torch.py --mode gem --worker-optimizer adam --cache /imdata/
to run the hyper-parameter optimization, 2 blocks of 1+2 master+worker
mpirun --prefix /opt/openmpi-3.1.0 -np 7 --tag-output singularity exec --nv -B /imdata/ -B /storage/ /storage/group/gpu/software/singularity/ibanks/edge.simg python3 OptimizationDriver.py --loss categorical_crossentropy --epochs 1000 --batch 200 --model examples/example_jedi_torch.py --early "val_loss,~<,4" --checkpoint jedi-3 --mode gem --worker-optimizer adam --cache /imdata/ --block-size 3
nothing special, the --cache option might be annoying
get slurm
module load slurm
get a node
salloc --partition gpu --gres=gpu:v100-32gb:4 --cpus-per-task 5 --nodes 1 --ntasks-per-node 4 --ntasks=4
get software
module load gcc
module load openmpi2
module load python3
module load python3-mpi4py
module load lib/hdf5/1.8.21-openmpi2
module load cuda/10.1.243_418.87.00
module load nccl
run mpi
mpirun -np $SLURM_NTASKS --tag-output python3 OptimizationDriver.py --loss categorical_crossentropy --epochs 1000 --batch 200 --model examples/example_jedi_torch.py --early "val_loss,~<,4" --checkpoint jedi-3 --mode gem --worker-optimizer adam --block-size 3 --epoch 1 --opt-restore --num-iteration 20