Use shared pool of CUDA streams instead of thread-local pools #1294

msimberg · 2024-10-28T15:09:30Z

The idea is that the thread-local streams (three by default) may limit concurrency of kernels within a worker thread. Using a single but larger shared pool of streams should lead to more possible concurrency, especially if many independent kernels are submitted quickly after each other from the same worker thread. At the same time, having e.g. three thread-local streams per thread on a Grace CPU is fake concurrency: CUDA will not actually allow allow a concurrency of 3×72=216. Using a shared pool limits the number of created streams to something a bit more manageable, and does not scale unnecessarily with the number of worker threads. Finally, using the same streams concurrently from different threads should not create thread-safety issues.

This change does not noticeably change the performance on P100, MI250, or H100 GPUs.

codacy-production · 2024-10-28T15:14:35Z

Coverage summary from Codacy

See diff coverage on Codacy

Coverage variation	Diff coverage
✅ +0.00% (target: -1.00%)	✅ ∅ (target: 90.00%)

Coverage variation details

	Coverable lines	Covered lines	Coverage
Common ancestor commit (`a6a945b`)	18222	13762	75.52%
Head commit (`b0bc0cf`)	18222 (+0)	13762 (+0)	75.52% (+0.00%)

Coverage variation is the difference between the coverage for the head and common ancestor commits of the pull request branch: <coverage of head commit> - <coverage of common ancestor commit>

Diff coverage details

	Coverable lines	Covered lines	Diff coverage
Pull request (#1294)	0	0	∅ (not applicable)

Diff coverage is the percentage of lines that are covered by tests out of the coverable lines that the pull request added or modified: <covered lines added or modified>/<coverable lines added or modified> * 100%

See your quality gate settings Change summary preferences

_{Codacy stopped sending the deprecated coverage status on June 5th, 2024. Learn more}

msimberg · 2024-10-31T08:48:38Z

On old Piz Daint (with P100s) this change does not seem to have a big impact on performance, which is good. I still want to try on H100s before I mark this as ready for review.

msimberg · 2024-11-19T14:23:49Z

On LUMI (MI250) and Alps (GH200) using this with 32 normal and high priority streams in the pool performs very close to the old thread-local streams (three per thread).

msimberg self-assigned this Oct 28, 2024

msimberg force-pushed the cuda-pool-shared-streams branch from 876b1e5 to bd738a5 Compare October 28, 2024 17:15

msimberg force-pushed the cuda-pool-shared-streams branch 5 times, most recently from d00f83e to 024634b Compare November 19, 2024 14:22

msimberg marked this pull request as ready for review November 19, 2024 14:23

msimberg requested review from aurianer and biddisco as code owners November 19, 2024 14:23

msimberg mentioned this pull request Nov 19, 2024

Expand and add more CUDA/HIP documentation #1309

Merged

msimberg force-pushed the cuda-pool-shared-streams branch from 024634b to 740b664 Compare November 20, 2024 14:40

Use shared pool of CUDA streams instead of thread-local pools

b0bc0cf

msimberg force-pushed the cuda-pool-shared-streams branch from 740b664 to b0bc0cf Compare November 21, 2024 15:59

msimberg enabled auto-merge November 21, 2024 16:00

msimberg added this pull request to the merge queue Nov 21, 2024

Merged via the queue into pika-org:main with commit f528e6f Nov 21, 2024
36 checks passed

msimberg deleted the cuda-pool-shared-streams branch November 21, 2024 21:10

msimberg added this to the 0.31.0 milestone Nov 22, 2024

msimberg mentioned this pull request Nov 27, 2024

Add configuration option for choosing number of GPU streams when they're not per-thread eth-cscs/DLA-Future#1222

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use shared pool of CUDA streams instead of thread-local pools #1294

Use shared pool of CUDA streams instead of thread-local pools #1294

msimberg commented Oct 28, 2024 •

edited

Loading

codacy-production bot commented Oct 28, 2024 •

edited

Loading

msimberg commented Oct 31, 2024

msimberg commented Nov 19, 2024

Use shared pool of CUDA streams instead of thread-local pools #1294

Use shared pool of CUDA streams instead of thread-local pools #1294

Conversation

msimberg commented Oct 28, 2024 • edited Loading

codacy-production bot commented Oct 28, 2024 • edited Loading

Coverage summary from Codacy

See diff coverage on Codacy

See your quality gate settings Change summary preferences

msimberg commented Oct 31, 2024

msimberg commented Nov 19, 2024

msimberg commented Oct 28, 2024 •

edited

Loading

codacy-production bot commented Oct 28, 2024 •

edited

Loading