You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Resource requirements of tasks or actors do NOT impose limits on actual physical resource usage. For example, Ray doesn’t prevent a num_cpus=1 task from launching multiple threads and using multiple physical CPUs.
While technically true, this sections reads as if num_cpus is strictly for scheduling and has no implication for job performance. However, this is untrue and this section of the docs contradicts the NOTE in Cluster Resources which highlights explicitly the interaction of num_cpus and OMP_NUM_THREADS (and, by extension, torch.get_num_cpus(), etc).
Ray sets the environment variable OMP_NUM_THREADS=<num_cpus> if num_cpus is set on the task/actor
In practice, lowering OMP_NUM_THREADS can lead to a pretty meaningful degradation in job perf, especially for jobs that require torch and numpy.
Of note, Physical Resources and Logical Resources is very high in the docs tree: it's under Ray -> User Guides. Cluster Resources is much lower under Developer Guides -> Configuring Ray. This adds to the confusion.
One suggestion would be to explicitly mention how num_cpus affects OMP_NUM_THREADS in the Physical Resources and Logical Resources section. Or, just link to Cluster Resources from there.
That sounds reasonable, but please make it pronounced! (e.g. a NOTE callout, or something, akin to how it's mentioned in the Cluster Resources section as it's a very important exception).
FWIW, in practice, we'd set num_cpus=0.25 on the mistaken belief that doing so had no perf implications; this caused OMP_NUM_THREADS=1 and ultimately was the source of a 25-30% perf degradation. For use cases that require torch/numpy, which I'd imagine are numerous, not knowing about this can be fairly damning.
Description
The Physical Resources and Logical Resources section of the Ray docs, very explicitly states
While technically true, this sections reads as if
num_cpus
is strictly for scheduling and has no implication for job performance. However, this is untrue and this section of the docs contradicts the NOTE in Cluster Resources which highlights explicitly the interaction ofnum_cpus
andOMP_NUM_THREADS
(and, by extension,torch.get_num_cpus()
, etc).In practice, lowering
OMP_NUM_THREADS
can lead to a pretty meaningful degradation in job perf, especially for jobs that requiretorch
andnumpy
.Of note, Physical Resources and Logical Resources is very high in the docs tree: it's under Ray -> User Guides. Cluster Resources is much lower under Developer Guides -> Configuring Ray. This adds to the confusion.
One suggestion would be to explicitly mention how
num_cpus
affectsOMP_NUM_THREADS
in the Physical Resources and Logical Resources section. Or, just link to Cluster Resources from there.Link
Physical Resources and Logical Resources
Cluster Resources
The text was updated successfully, but these errors were encountered: