-
Notifications
You must be signed in to change notification settings - Fork 0
Setup
To exploit the ML GPU cluster, one is required to submit jobs to Slurm for scheduling. However, the working environment must be consistent across Slurm worker machines. NEC Labs provides NoNFS storage as the default environment:
- the user home directory is local
- the
~/Backups
is mounted and shared through the ITS NFS (faster) - the
/zdata
is mounted and shared through the ML department NFS (slow)
It is recommended to install conda under ~/Backups
for better performance by specifying /home/ml[i]/user/Backups/miniconda3
as the install location.
Make sure the conda virtual environment is activated if used once logged in.
- Create or link files/directories to synchronize if necessary in
~/Backups
cd ~/Backups
mv ~/.profile . # if present
mv ~/.bash_history . # if present
mv ~/.vimrc . # if present
mv ~/.condarc . # if present
mv ~/.gitconfig . # if present
mv ~/.python_history . # if present
mkdir .conda # if not created yet
mkdir tmp # if not created yet
ln -s /zdata/users/user/projects # link to huge storage on ML NFS
- Edit
~/Backups/.mydesktopln
to ensure corresponding symlinks are created automatically in the home directory:
.profile
.bash_history
.vimrc
.conda
.condarc
.gitconfig
.python_history
miniconda3
projects
tmp
A PyTorch program using GPU(s) must be submitted to Slurm for execution. When a job lands on a Slurm worker, the current directory will be changed to the same as the submission directory. The directory structure must be consistent across machines. Otherwise, the job would fail immediately.
However, if it is a Python job, the runtime environment may differ from local execution. The distinction is that the current working directory of the job may not be part of the sys.path. The following code segment prepends he current working directory:
#!/usr/bin/env python
import sys, os
sys.path.append(os.getcwd())
...