Skip to content

Materials related to operating on Compute Canada clusters

Notifications You must be signed in to change notification settings

sgyzetrov/ComputeCanada_Resources

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ComputeCanada_Resources

Materials related to operating on Compute Canada clusters

English version of document: click

Pro Tips

When we are done with the simple introductory guide above and moving further with CC servers, naturally we will encounter numerous problems. There are some useful things that I have learned so far and I would like to share with you (I will be updating this section as well as FAQ periodically):

  1. Get a short summary of the CPU- and memory-efficiency of a completed jobs with seff 1:

    seff <jobID>
    
  2. Find more detailed information about a completed job with sacct, and optionally, control what it prints using --format 1:

    $ sacct -j <jobID>
    $ sacct -j <jobID> --format=JobID,JobName,MaxRSS,Elapsed
    

    Use the MaxRSS accounting field to determine how much memory a job needed. The value returned will be the largest resident set size for any of the tasks. If you want to know which task and node this occurred on, print the MaxRSSTask and MaxRSSNode fields also. The sstat command works on a running job much the same way that sacct works on a completed job.

  3. Want to transfer large data across different clusters (e.g. transfer 20GiB data from Graham to Beluga?)?

    Use Globus (documentation: https://docs.computecanada.ca/wiki/Globus)! Blazing fast transfer, we are talking 10-GiB-in-seconds fast. Forget about open two Cyberduck windows and drap-n-drop, it takes forever and your computer has to stay in wake during.

  4. Use partition-stats to check for current available nodes.

FAQ

  1. Waiting time on CC servers really long? Possible reasons:

    • Higher demand on the cluster.
    • The group has used more its share in the recent past.
    • Improperly specified or non-optimal job requests.

    N.B. Larger and longer jobs are more difficult to schedule. Breaking a job into smaller pieces by checkpointing is likely to be more important.

    Once a job is running we can try srun --jobid=<jobID> --pty bash to jump into the job and then run typical *nix commands such as "top" to see resource usage.

  2. Parallel programming with the Python multiprocessing module:

    On a cluster, it is very important to use the cores that are allocated to your job. Launching more processes than you have cores requested will slow down your calculation and possibly overload the compute node. Launching fewer processes than you have cores will result in wasted resources and cores remaining idle. The correct number of cores to use in your code is determined by the amount of resources you requested to the scheduler 2.

    In job script try something like

    #SBATCH --cpus-per-task=4
    python cubes_parallel.py
    

    And in Python script try

     import multiprocessing as mp
     import os
    
     def worker_func(x):
         return x**3
    
     ncpus = int(os.environ.get('SLURM_CPUS_PER_TASK',default=1))
     pool = mp.Pool(processes=ncpus)
     data = [1, 2, 3, 4, 5, 6]
     cubes = pool.map(worker_func, data)
     print(cubes)

Resources

  1. WestGrid scheduling training sessions and materials: link

References

[1] https://docs.computecanada.ca/wiki/Running_jobs#Completed_jobs
[2] https://docs.computecanada.ca/wiki/Python#Parallel_programming_with_the_Python_multiprocessing_module

About

Materials related to operating on Compute Canada clusters

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published