Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suggestion to use cpus-per-gpu instead of ntasks in slurm docs #425

Open
WPoelman opened this issue Aug 9, 2024 · 1 comment
Open

Suggestion to use cpus-per-gpu instead of ntasks in slurm docs #425

WPoelman opened this issue Aug 9, 2024 · 1 comment

Comments

@WPoelman
Copy link

WPoelman commented Aug 9, 2024

In the documentation all examples use ntasks or n to specify the number of CPUs needed per GPU. This generally works fine, but external tools (such as submitit ) have a specific interpretation of ntasks, which can lead to issues. It might be better to explicitly use the cpus-per-gpu slurm option in the examples to avoid such issues. The options both work identically in my tests requesting GPUs on the debug node and on wice.

@moravveji
Copy link
Contributor

This is an interesting one. Few remarks:

  • Like always, there are different ways to request the same resource (TRES and GRES) in Slurm, using the supported directives. On VSC docs, we have opted for the most obvious and generic ones to cover the majority of the use cases on our clusters.
  • A black-belt user knows best how to exploit the resource specifications using more fine-grained options e.g. from the official sbatch documentation. So, we leave off such cases from the official VSC docs, because the Slurm docs are just there. Also the audience for such specialized use cases are in minority.
  • Because of a specific interpretation of a package (submitit in this case), we are not gonna tune the VSC docs and our Slurm configurations. It is actually the other way around: the third-party software which uses the underlying scheduler needs to align his interpretation of the Slurm job submit options/directives with the original ones as documented on Slurm docs.
  • To me, the --cpus-per-gpu is a useful option for multi-GPU jobs, when an advanced user wants to take full control over process distribution. For the single-GPU jobs, it does not offer much of added values. Take the following two-node GPU example:
    srun -A <account> -M genius --nodes=2 --ntasks=8 --cpus-per-gpu=1 --gpus-per-node=4 --pty bash -l
    And you immediately see how transparent it is to specify the --cpus-per-gpu option.

So, if you have another comment or question, please let us know. Else, we can perhaps close this issue item.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants