Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Misleading documentation about num_cpus and physical resources #48867

Open
paul-twelvelabs opened this issue Nov 22, 2024 · 3 comments · May be fixed by #48871
Open

[Core] Misleading documentation about num_cpus and physical resources #48867

paul-twelvelabs opened this issue Nov 22, 2024 · 3 comments · May be fixed by #48871
Labels
docs An issue or change related to documentation triage Needs triage (eg: priority, bug/not-bug, and owning component)

Comments

@paul-twelvelabs
Copy link

paul-twelvelabs commented Nov 22, 2024

Description

The Physical Resources and Logical Resources section of the Ray docs, very explicitly states

Resource requirements of tasks or actors do NOT impose limits on actual physical resource usage. For example, Ray doesn’t prevent a num_cpus=1 task from launching multiple threads and using multiple physical CPUs.

While technically true, this sections reads as if num_cpus is strictly for scheduling and has no implication for job performance. However, this is untrue and this section of the docs contradicts the NOTE in Cluster Resources which highlights explicitly the interaction of num_cpus and OMP_NUM_THREADS (and, by extension, torch.get_num_cpus(), etc).

Ray sets the environment variable OMP_NUM_THREADS=<num_cpus> if num_cpus is set on the task/actor

In practice, lowering OMP_NUM_THREADS can lead to a pretty meaningful degradation in job perf, especially for jobs that require torch and numpy.

Of note, Physical Resources and Logical Resources is very high in the docs tree: it's under Ray -> User Guides. Cluster Resources is much lower under Developer Guides -> Configuring Ray. This adds to the confusion.

One suggestion would be to explicitly mention how num_cpus affects OMP_NUM_THREADS in the Physical Resources and Logical Resources section. Or, just link to Cluster Resources from there.

Link

Physical Resources and Logical Resources

Cluster Resources

@paul-twelvelabs paul-twelvelabs added docs An issue or change related to documentation triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Nov 22, 2024
@Superskyyy
Copy link
Contributor

I guess adding a hyperlink to that as an exception could suffice for now as this is the only exception that we know of according to the current impl?

@paul-twelvelabs
Copy link
Author

paul-twelvelabs commented Nov 22, 2024

That sounds reasonable, but please make it pronounced! (e.g. a NOTE callout, or something, akin to how it's mentioned in the Cluster Resources section as it's a very important exception).

FWIW, in practice, we'd set num_cpus=0.25 on the mistaken belief that doing so had no perf implications; this caused OMP_NUM_THREADS=1 and ultimately was the source of a 25-30% perf degradation. For use cases that require torch/numpy, which I'd imagine are numerous, not knowing about this can be fairly damning.

@Superskyyy
Copy link
Contributor

Let me do that.

@Superskyyy Superskyyy linked a pull request Nov 22, 2024 that will close this issue
8 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs An issue or change related to documentation triage Needs triage (eg: priority, bug/not-bug, and owning component)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants