Use shared pool of CUDA streams instead of thread-local pools #1294
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The idea is that the thread-local streams (three by default) may limit concurrency of kernels within a worker thread. Using a single but larger shared pool of streams should lead to more possible concurrency, especially if many independent kernels are submitted quickly after each other from the same worker thread. At the same time, having e.g. three thread-local streams per thread on a Grace CPU is fake concurrency: CUDA will not actually allow allow a concurrency of 3×72=216. Using a shared pool limits the number of created streams to something a bit more manageable, and does not scale unnecessarily with the number of worker threads. Finally, using the same streams concurrently from different threads should not create thread-safety issues.
This change does not noticeably change the performance on P100, MI250, or H100 GPUs.