Skip to content

Latest commit

 

History

History
210 lines (177 loc) · 7.92 KB

README.md

File metadata and controls

210 lines (177 loc) · 7.92 KB

CINECA GUIDE

  1. Getting started
  2. Submit a job
  3. Additional infos
  4. Conda and git
  5. Singularity
  6. SLURM

Getting started

For registration and account association follow:

https://wiki.u-gov.it/confluence/display/SCAIUS/UG2.1+Getting+started#expand-3Connectingtothecluster

Update (08/09/2023): If you have already an account on CINECA, notice that it has recently change the authentication procedure for log-in into the cluster:

  • follow this guide for activating the 2FA (send an email to [email protected] to get the activation link)
  • follow this guide from point n.3, you will install smallstep for creating a new certificate valid for 12 hours on your pc
	eval $(ssh-agent) # activate the ssh-agent
	step ssh login '<user-email>' --provisioner cineca-hpc #  obtain the certificate
	
  • The user has now to put his/her cluster credentials (username and password) and push the button "Sign in". Then, keycloak will ask for the OTP code generated by the Authenticator
  • Once authenticated, you will see a Success message on your browser meaning that the certificate has been generated and it is available on your PC.

IMPORTANT: the temporary certificate is valid for 12 hours. If you reboot your PC the certificate is lost and you need to download a new one launching again the "step ssh login ..." command.

Command and scripts inside the cluster (CINECA) to submit a job

./train.sh <num_cpu> <max_walltime> (e.g. ./train.sh 12 24:00:00)

train.sh:

	#!/bin/bash
	# >>> Pulling repos
	...
	sbatch --job-name=job_example --cpus-per-task=${1} --time=${2} --output=./slurm_output/job_example.out --error=./slurm_output/job_example.out train.sbatch

train.sbatch:

	#!/bin/bash
	#SBATCH --partition=g100_usr_prod
	#SBATCH --mem=20000M
	#SBATCH --ntasks=1
	#SBATCH --mail-type=ALL
	#SBATCH --mail-user=<your email>

	# >>> IF YOU NEED TO USE A CONTAINER FOLLOW THE CODE BELOW (otherwise use your code here)<<<
	# Load the module
	module load singularity
	# Run the container
	singularity exec --hostname ${SLURM_SUBMIT_HOST}${SLURM_JOB_ID} ./container.sif bash ./container_train.sh

container_train.sh:

	#!/bin/bash
	# >>> Activate the conda enviroment
	...
	# >>> Execute code
	...

CINECA: Additional infos

https://wiki.u-gov.it/confluence/display/SCAIUS/UG3.3%3A+GALILEO100+UserGuide

Cineca allows the usage of TMUX as terminal multiplexer: https://tmuxcheatsheet.com/

Cineca works only offline inside the running node. Therefore:

  • pull the repos before submitting the job (e.g. in train.sh)
  • to use a logger (e.g. wandb):
    • use wandb_mode as offline
    • to sync with the server, inside the wandb folder: wandb sync --include-offline ./offline-*
    • Script for syncronize wandb offline runs (supponing to have a group directory containing more than one run)
            #!/bin/bash
      
            # argument 1: group directory
            conda activate <env_name>
      
            RAND_ID=$(python3 -c "import wandb; print(wandb.util.generate_id());")
      
            echo "Syncing runs $1 to run new id $RAND_ID"
      
            # first, sync last series of logs to new id
            first_dir=$(ls -t $1| head -1)
            wandb sync $1/$first_dir/wandb/$(ls -t $1/$first_dir/wandb/ | grep offline | head -1) --id $RAND_ID
      
            # then all the others + last again to sync hyper-params.
            for dir in $(ls $1)
            do
        		run=$(ls $1/$dir/wandb/ | grep offline)
       			echo $run
       			wandb sync $1/$dir/wandb/$run --id $RAND_ID;
            done
      

New setup for tracking live of experiments with wandb

  • login node has to create a reverse proxy towards to the compute node, while the running job has to wait this proxy is up before using wandb

  • add this at the begin of your script (use any port you prefer):

     echo Waiting the reverse proxy...
     while ! netstat -an | grep 34567 &> /dev/null; do sleep 1; done
     export HTTP_PROXY=socks5://127.0.0.1:34567
     export HTTPS_PROXY=socks5://127.0.0.1:34567
     export SOCK_PROXY=socks5://127.0.0.1:34567
     echo Reverse proxy is up and running!
    
  • this other script must be in execution for all the duration of the job, controlling periodically which job are in run and opening a new proxy for each of them

     #!/bin/bash
     
     INTERVAL=10
    
     while true; do
     	# Get the list of running jobs for the user
     	nodes=$(squeue -u $USER -h -t R -o "%N" | uniq)
    
     	for node in $nodes; do
    
             # Check if a reverse proxy is already set up for this job
     	    n=$(ps -f -u $USER | grep -e "ssh.*$node" | wc -l)
             if [ $n -eq 1 ]; then
     	echo Creating proxy for $node...
             	ssh -oStrictHostKeyChecking=no -N -R 34567 -f $node
         	fi
         done
     
     	sleep $INTERVAL
     done
    
  • ssh connection has kept in background and killled by cineca when the job ended

  • this solution works for wandb, huggingfacehub, and any library/application which use requests - NOT for dataset download by torchvision

Installations of conda and git on the cluster

  • Install conda:
        mkdir -p ~/miniconda3
        wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda3/miniconda.sh
        bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3
        rm -rf ~/miniconda3/miniconda.sh
        ~/miniconda3/bin/conda init bash
        ~/miniconda3/bin/conda init zsh
    
  • Create a conda environment:
        conda create -n <env_name> python=3.8
        conda activate <env_name>
    
  • If singularity is not installed:
        conda install -c conda-forge singularity
    
  • Clone git repositories:
        conda install gh --channel conda-forge
        gh auth login
        <gh token>
        git clone <repo>
    

Singularity: Additional infos

Usually a cluster (e.g. CINECA, HPC) do not allow the use of Docker for security reasons, however it is possible to use Singularity as alternative. Singularity, differently from Docker, creates a container as a directory inside the original host filesystem. Therefore, if you have originally created the Docker container in the path /home/a/b/c, Singularity would virtually create a path /home/a/b/c inside the actual host filesystem. When you use Singularity for the first time you should be take note of these steps:

  • add in ~/.bashrc file:
	  export SINGULARITY_CACHEDIR=/scratch/gpfs/$USER/SINGULARITY_CACHE
	  export SINGULARITY_TMPDIR=/tmp
  • pull docker image <docker_path> and convert into a singularity image <container>.sif (in your login node)
     module load singularity
     singularity pull <new_sing_img>.sif docker://<docker_path>
  • NOTE: if you are not able to pull it from the cluster, you can copy a pre-existent .sif into the cluster
  • test singularity container using
       singularity shell <container>.sif
    

Useful links:

SLURM cheatsheet

  • submit a job sbatch <file_name>.sbatch
  • show all jobs squeue
  • show your jobs squeue -u <username>
  • show job infos scontrol show job <job_id>
  • partions status sinfo
  • delete a job scancel <job_id>
  • running an interactive session on a node srun --nodes=1 --tasks-per-node=1 --pty /bin/bash