Exx Slurm Setup

Exx setup

This guide's purpose is to give a quick overview on how to install Slurm and Docker on compute servers.

Slurm

To manage jobs on the Exx server, we use Slurm Workload Manager. I allows to schedule jobs and manage resources like GPU.

Here's a short introduction to Slurm by Bull.

Installation

To install Slurm, we follow the instructions from slothparadise. Note that you may need this line to a free number:

MUNGEUSER=991

Use Slurm Configuration Tool to generate configuration files. Here's the parameters we used:

Config used

Put the generated configurations into /etc/slurm.

Generate a cgroup configuration file from the example:

cp cgroup.conf.example cgroup.conf

Configurations for Exx are stored here: willGuimont/exx_slurm_config

Accounting

https://gist.github.com/DaisukeMiyamoto/d1dac9483ff0971d5d9f34000311d312 https://slurm.schedmd.com/accounting.html#mysql-configuration

https://slurm.schedmd.com/accounting.html#mysql-configuration https://slurm.schedmd.com/accounting.html#database-configuration

sacctmgr add user brian Account=norlab

http://www.ceci-hpc.be/slurm_prio.html#:~:text=Slurm%20computes%20job%20priorities%20regularly,of%20that%20users'%20pending%20jobs.

GRES

https://www.mail-archive.com/[email protected]/msg04744.html

Ldd nvml

Docker

Install Docker Engine and follow the post-installation steps for Linux to allow non-sudo users to use docker.

Start docker on boot:

sudo systemctl enable docker.service
sudo systemctl enable containerd.service

Install nvidia-container-toolkit

distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
   && curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.repo | sudo tee /etc/yum.repos.d/nvidia-docker.repo
 
sudo yum clean expire-cache
sudo yum install nvidia-container-toolkit -y

Verify that you can see GPUs from docker containers:

docker run --rm --gpus all -e NVIDIA_VISIBLE_DEVICES=all nvidia/cuda:11.0-base nvidia-smi

Cron job to clean docker compose files

Add those lines to the cronjob file with crontab -e:

0 9 * * 3 docker network prune
30 9 * * 3 docker container prune

Add a new user

useradd -c 'Full name' -m <username> -G docker

Add the following line to the new user's .bashrc:

PATH=$PATH:/usr/local/bin

Add the user to account manager:

sacctmgr add user <username> Account=norlab

Add an example job example_job.sh:

#!/bin/bash
#SBATCH --gres=gpu:1
#SBATCH --cpus-per-task=8
#SBATCH --time=4-00:00
#SBATCH --job-name=ExampleJob
#SBATCH --output=%x-%j.out

docker run --rm bash -c "echo 'working from slurm'"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exx Slurm Setup

Exx setup

Slurm

Installation

Accounting

GRES

Docker

Install nvidia-container-toolkit

Cron job to clean docker compose files

Add a new user

Home

New Students

Norlab's Robots

Protocols

Templates

Resources

Grants

Datasets

Mapping

Deep Learning

ROS

Ubuntu

Docker (work in progress)

Tips & tricks

Norlab's Recipes

Clone this wiki locally