WPI maintains an HPC cluster for teaching and research purposes, known as Turing. It has significant capabilities, but is a shared resource and is still limited in what it can accommodate.
This readme contains some essential information to help you get started! Please feel free to express your feedback on these instructions by creating an issue on this repo.
As a general overview: Turing is a Linux-based system, running Ubuntu
20.04, with jobs managed by the SLURM scheduler. In general, you will
use SSH to connect to turing.wpi.edu
, which will connect you to one of
the login nodes. These do not have GPUs or the power to do 'real work',
but instead are used for managing jobs running on its many compute
nodes. From there, you can work with SLURM, giving it jobs to be run on
the hardware with GPUs. Note that all GPU access is mediated by SLURM
-- you can't use any GPU time without it giving it to you, but that also
guarantees that nobody else will attempt to use that GPU while your code
is using it.
For most normal work, this is done by writing a short shell script, which will describe the resources required to run the job, load the software modules (primarily: cuda toolkit and Python environment), and then run your code.
-
SSH to cluster
-
load all the modules required such as Python, cuda etc
-
source the pip venv which has all the required libraries. Alternatively, you can use Conda to install/manage the packages.
-
submits the job to slurm via shell script
Here is the video tutorial from Ramana outlining the same. Accompanying files for the video tutorial are hosted here.
Acknowledgments: Thanks to Ramana and James for providing this content.