Setup

To exploit the ML GPU cluster, one is required to submit jobs to Slurm for scheduling. However, the working environment must be consistent across Slurm worker machines. NEC Labs provides NoNFS storage as the default environment:

the user home directory is local
the ~/Backups is mounted and shared through the ITS NFS (faster)
the /zdata is mounted and shared through the ML department NFS (slow)

Conda Installation

It is recommended to install conda under ~/Backups for better performance by specifying /home/ml[i]/user/Backups/miniconda3 as the install location. Make sure the conda virtual environment is activated if used once logged in.

NoNFS Setup

Create or link files/directories to synchronize if necessary in ~/Backups

cd ~/Backups
mv ~/.profile .                     # if present
mv ~/.bash_history .                # if present
mv ~/.vimrc .                       # if present
mv ~/.condarc .                     # if present
mv ~/.gitconfig .                   # if present
mv ~/.python_history .              # if present
mkdir .conda                        # if not created yet
mkdir tmp                           # if not created yet
ln -s /zdata/users/user/projects    # link to huge storage on ML NFS

Edit ~/Backups/.mydesktopln to ensure corresponding symlinks are created automatically in the home directory:

.profile
.bash_history
.vimrc
.conda
.condarc
.gitconfig
.python_history
miniconda3
projects
tmp

PyTorch Job Submission

A PyTorch program using GPU(s) must be submitted to Slurm for execution. When a job lands on a Slurm worker, the current directory will be changed to the same as the submission directory. The directory structure must be consistent across machines. Otherwise, the job would fail immediately.

However, if it is a Python job, the runtime environment may differ from local execution. The distinction is that the current working directory of the job may not be part of the sys.path. The following code segment prepends he current working directory:

#!/usr/bin/env python

import sys, os
sys.path.append(os.getcwd())
...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Setup

Conda Installation

NoNFS Setup

PyTorch Job Submission

Clone this wiki locally