You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In the documentation all examples use ntasks or n to specify the number of CPUs needed per GPU. This generally works fine, but external tools (such as submitit ) have a specific interpretation of ntasks, which can lead to issues. It might be better to explicitly use the cpus-per-gpu slurm option in the examples to avoid such issues. The options both work identically in my tests requesting GPUs on the debug node and on wice.
The text was updated successfully, but these errors were encountered:
Like always, there are different ways to request the same resource (TRES and GRES) in Slurm, using the supported directives. On VSC docs, we have opted for the most obvious and generic ones to cover the majority of the use cases on our clusters.
A black-belt user knows best how to exploit the resource specifications using more fine-grained options e.g. from the official sbatch documentation. So, we leave off such cases from the official VSC docs, because the Slurm docs are just there. Also the audience for such specialized use cases are in minority.
Because of a specific interpretation of a package (submitit in this case), we are not gonna tune the VSC docs and our Slurm configurations. It is actually the other way around: the third-party software which uses the underlying scheduler needs to align his interpretation of the Slurm job submit options/directives with the original ones as documented on Slurm docs.
To me, the --cpus-per-gpu is a useful option for multi-GPU jobs, when an advanced user wants to take full control over process distribution. For the single-GPU jobs, it does not offer much of added values. Take the following two-node GPU example:
In the documentation all examples use
ntasks
orn
to specify the number of CPUs needed per GPU. This generally works fine, but external tools (such as submitit ) have a specific interpretation ofntasks
, which can lead to issues. It might be better to explicitly use thecpus-per-gpu
slurm option in the examples to avoid such issues. The options both work identically in my tests requesting GPUs on the debug node and on wice.The text was updated successfully, but these errors were encountered: