Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use shared pool of CUDA streams instead of thread-local pools #1294

Merged
merged 1 commit into from
Nov 21, 2024

Conversation

msimberg
Copy link
Contributor

@msimberg msimberg commented Oct 28, 2024

The idea is that the thread-local streams (three by default) may limit concurrency of kernels within a worker thread. Using a single but larger shared pool of streams should lead to more possible concurrency, especially if many independent kernels are submitted quickly after each other from the same worker thread. At the same time, having e.g. three thread-local streams per thread on a Grace CPU is fake concurrency: CUDA will not actually allow allow a concurrency of 3×72=216. Using a shared pool limits the number of created streams to something a bit more manageable, and does not scale unnecessarily with the number of worker threads. Finally, using the same streams concurrently from different threads should not create thread-safety issues.

This change does not noticeably change the performance on P100, MI250, or H100 GPUs.

@msimberg msimberg self-assigned this Oct 28, 2024
Copy link

codacy-production bot commented Oct 28, 2024

Coverage summary from Codacy

See diff coverage on Codacy

Coverage variation Diff coverage
+0.00% (target: -1.00%) (target: 90.00%)
Coverage variation details
Coverable lines Covered lines Coverage
Common ancestor commit (a6a945b) 18222 13762 75.52%
Head commit (b0bc0cf) 18222 (+0) 13762 (+0) 75.52% (+0.00%)

Coverage variation is the difference between the coverage for the head and common ancestor commits of the pull request branch: <coverage of head commit> - <coverage of common ancestor commit>

Diff coverage details
Coverable lines Covered lines Diff coverage
Pull request (#1294) 0 0 ∅ (not applicable)

Diff coverage is the percentage of lines that are covered by tests out of the coverable lines that the pull request added or modified: <covered lines added or modified>/<coverable lines added or modified> * 100%

See your quality gate settings    Change summary preferences

Codacy stopped sending the deprecated coverage status on June 5th, 2024. Learn more

@msimberg
Copy link
Contributor Author

On old Piz Daint (with P100s) this change does not seem to have a big impact on performance, which is good. I still want to try on H100s before I mark this as ready for review.

@msimberg msimberg force-pushed the cuda-pool-shared-streams branch 5 times, most recently from d00f83e to 024634b Compare November 19, 2024 14:22
@msimberg
Copy link
Contributor Author

On LUMI (MI250) and Alps (GH200) using this with 32 normal and high priority streams in the pool performs very close to the old thread-local streams (three per thread).

@msimberg msimberg added this pull request to the merge queue Nov 21, 2024
Merged via the queue into pika-org:main with commit f528e6f Nov 21, 2024
36 checks passed
@msimberg msimberg deleted the cuda-pool-shared-streams branch November 21, 2024 21:10
@msimberg msimberg added this to the 0.31.0 milestone Nov 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

Successfully merging this pull request may close these issues.

1 participant