Skip to content
Farley Lai edited this page Apr 30, 2021 · 1 revision

To exploit the ML GPU cluster, one is required to submit jobs to Slurm for scheduling. However, the working environment must be consistent across Slurm worker machines. NEC Labs provides NoNFS storage as the default environment:

  • the user home directory is local
  • the ~/Backups is mounted and shared through the ITS NFS (faster)
  • the /zdata is mounted and shared through the ML department NFS (slow)

Conda Installation

It is recommended to install conda under ~/Backups for better performance by specifying /home/ml[i]/user/Backups/miniconda3 as the install location. Make sure the conda virtual environment is activated if used once logged in.

NoNFS Setup

  1. Create or link files/directories to synchronize if necessary in ~/Backups
cd ~/Backups
mv ~/.profile .                     # if present
mv ~/.bash_history .                # if present
mv ~/.vimrc .                       # if present
mv ~/.condarc .                     # if present
mv ~/.gitconfig .                   # if present
mv ~/.python_history .              # if present
mkdir .conda                        # if not created yet
mkdir tmp                           # if not created yet
ln -s /zdata/users/user/projects    # link to huge storage on ML NFS
  1. Edit ~/Backups/.mydesktopln to ensure corresponding symlinks are created automatically in the home directory:

PyTorch Job Submission

A PyTorch program using GPU(s) must be submitted to Slurm for execution. When a job lands on a Slurm worker, the current directory will be changed to the same as the submission directory. The directory structure must be consistent across machines. Otherwise, the job would fail immediately.

However, if it is a Python job, the runtime environment may differ from local execution. The distinction is that the current working directory of the job may not be part of the sys.path. The following code segment prepends he current working directory:

#!/usr/bin/env python

import sys, os
Clone this wiki locally