distributed training of jediNET

command

The base command to run is

python3 TrainingDriver.py  --loss categorical_crossentropy --epochs 1 --batch 200 --model examples/example_jedi_torch.py --mode gem --worker-optimizer adam --cache /imdata/

on iBanks cluster, it can be run within a single node with

mpirun --prefix /opt/openmpi-3.1.0 -np 3 --tag-output singularity exec --nv -B /imdata/ -B /storage/ /storage/group/gpu/software/singularity/ibanks/edge.simg python3 TrainingDriver.py  --loss categorical_crossentropy --epochs 1 --batch 200 --model examples/example_jedi_torch.py --mode gem --worker-optimizer adam --cache /imdata/

to run the hyper-parameter optimization, 2 blocks of 1+2 master+worker

mpirun --prefix /opt/openmpi-3.1.0 -np 7 --tag-output singularity exec --nv -B /imdata/ -B /storage/ /storage/group/gpu/software/singularity/ibanks/edge.simg python3 OptimizationDriver.py --loss categorical_crossentropy --epochs 1000 --batch 200 --model examples/example_jedi_torch.py  --early "val_loss,~<,4" --checkpoint jedi-3 --mode gem --worker-optimizer adam --cache /imdata/ --block-size 3

setup

ibanks

nothing special, the --cache option might be annoying

Flat-Iron Cluster

get slurm

module load slurm

get a node

salloc --partition gpu --gres=gpu:v100-32gb:4 --cpus-per-task 5 --nodes 1 --ntasks-per-node 4 --ntasks=4

get software

module load gcc
module load openmpi2
module load python3
module load python3-mpi4py
module load lib/hdf5/1.8.21-openmpi2
module load cuda/10.1.243_418.87.00
module load nccl

run mpi

mpirun -np $SLURM_NTASKS --tag-output  python3 OptimizationDriver.py --loss categorical_crossentropy --epochs 1000 --batch 200 --model examples/example_jedi_torch.py --early "val_loss,~<,4" --checkpoint jedi-3 --mode gem --worker-optimizer adam  --block-size 3 --epoch 1 --opt-restore --num-iteration 20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

distributed training of jediNET

command

setup

ibanks

Flat-Iron Cluster

Clone this wiki locally