Skip to content

Adding a node to the Slurm cluster

David-Alexandre Duclos edited this page Feb 9, 2024 · 2 revisions

Slurm Setup

Services

If slurm and munge are already installed, you might need to remove the users, groups and packages before moving forward (see here)

Setup Munge and Slurm users and groups :

export MUNGEUSER=991
sudo groupadd -g $MUNGEUSER munge
sudo useradd  -m -c "MUNGE Uid 'N' Gid Emporium" -d /var/lib/munge -u $MUNGEUSER -g munge -s /sbin/nologin munge
export SLURMUSER=992
sudo groupadd -g $SLURMUSER slurm
sudo useradd  -m -c "SLURM workload manager" -d /var/lib/slurm -u $SLURMUSER -g slurm -s /bin/bash slurm

Install Munge and Slurm :

sudo apt install slurm-wlm slurm-client munge

Add new node info to server's slurm.conf

SSH connection

Connect server to worker through SSH

(Optional) Create a passphrase-less ssh key for Norlab purposes :

ssh-keygen -q -t rsa -b 4096 -N '' -f ~/.ssh/norlab

Share a passphrase-less ssh key :

ssh-copy-id -i ~/.ssh/norlab.pub user@ip

Config files

From server to worker :

  • Copy /etc/munge/munge.key
  • Copy ~/.slurm.conf and cgroup.conf files from /etc/slurm-llnl

Add GPU info to /etc/slurm-llnl/grep.conf on worker

Enable and start correct services on worker :

sudo systemctl enable munge
sudo systemctl start munge
sudo systemctl enable slurmd
sudo systemctl start slurmd

Restart slurmctld service on server :

sudo systemctl restart slurmctld

Slurm node state

To enable a node for computing on the cluster (kills jobs that are running) :

sudo scontrol update nodename=<nodename> state=idle

To re-enable a node for computing on the cluster (keeps jobs running) :

sudo scontrol update nodename=<nodename> state=resume

To disable access to a node on the cluster (lets jobs finish) :

sudo scontrol update nodename=<nodename> state=drain reason=<reason>

Slurm script setup

All IO data should be situated on the server

Each node needs consistant directories w.r.t. server for slurm deployment

Use rsync to synchronize files, as it's faster and more secure than scp

The slurm script should contain :

  • Initial data transfer for input data
  • Execution of code through a Docker container (with docker run)
  • Final data transfer for output data
    • The Docker container should have the option --rm, to avoid cluttering the worker node
    • Thus, volumes should be used to ensure IO data is external to the container
    • trap and docker wait commands can be used to wait until the end of the container's life, to subsequently transfer the output data back to the server node

Example slurm script

#!/bin/bash
#SBATCH --gres=gpu:1
#SBATCH --cpus-per-task=16
#SBATCH --time=2-00:00
#SBATCH --job-name=ubt-m2f
#SBATCH --output=%x-%j.out

SERVER_USER=user-norlab
SERVER_IP=123.456.789.123
INPUT_DIR=dataset
OUTPUT_DIR=outputs

rsync -vPr --rsh=ssh "$SERVER_USER@$SERVER_IP:$(pwd)/$INPUT_DIR" .

docker build -t image_name .
container_id=$(
    docker run --gpus all -e CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES --rm --ipc host --detach \
        --mount type=bind,source="$(pwd)",target=/app \
        --mount type=bind,source="$(pwd)/$INPUT_DIR",target="/app/$INPUT_DIR" \
        --mount type=bind,source="$(pwd)/$OUTPUT_DIR",target="/app/$OUTPUT_DIR" \
        --mount type=bind,source=/dev/shm,target=/dev/shm \
        image_name python train_net.py
)

transfer_outputs() {
    rsync -vPr --rsh=ssh "$OUTPUT_DIR" "$SERVER_USER@$SERVER_IP:$(pwd)"
}

trap transfer_outputs EXIT
docker wait $container_id

Deploying work with Slurm

To specify which GPU you need with slurm, specify a list of nodes lacking the resources you need

  • --include will not work, as it implies that every node should be used

Example slurm deployment

sbatch --exclude=node1,node3 my-script.sh

References

http://web.archive.org/web/20221207035745/https://www.slothparadise.com/how-to-install-slurm-on-centos-7-cluster/

https://nekodaemon.com/2022/09/02/Slurm-Quick-Installation-for-Cluster-on-Ubuntu-20-04/

https://github.com/norlab-ulaval/mask_bev/blob/main/slurm_train.sh

Norlab's Robots

Protocols

Templates

Resources

Grants

Datasets

Mapping

Deep Learning

ROS

Ubuntu

Docker (work in progress)

Tips & tricks

Clone this wiki locally