Use command buffers to reduce idle time #2091

FluxusMagna · 2024-01-18T17:18:03Z

GPUs are hungry pieces of hardware and want a steady supply of commands. Many practical algorithms involve many interations where each iteration launches one or more kernels that are by themselves not very large, and actual kernel runtime can be small compared to the idle time in between. This affects mostly small to medium sized data sets, but sometimes subroutines of operations on larger data as well.

For my particle simulations the issue becomes apparent as I want to run as many time steps per second as possible, with each time step launching many tens of kernels. Smaller systems that I want to run become very inefficient loads, even though the 'main computation' should saturate the GPU reasonably well.

My knowledge of both the Futhark runtime and low level GPU-programming is quite limited, but I guess a fundamental question is how to divide commands into different buffers. I think one important consideration is to minimize the penalty of splitting an entry point into two, as composing simpler but more general entry points in the calling language makes for more maintainable code. As I understand it, in principle nothing is promised to be done before a context is synced, so it should be possible to queue kernels of different entry points together without too much issue. How many buffers would be necessary? With three buffers one could be executing, one waiting, and one be written to, much like for graphics.

This is of course no small task to implement properly, but I think it is something that should eventually be done to improve performance in general and for smaller kernels in particular.

athas · 2024-01-18T17:28:51Z

I'm not sure the main GPGPU APIs (CUDA, OpenCL, HIP) support command buffers. That's more something you find in Vulkan. My impression is that fine grained operations are more common (and thus important to handle efficiently) in graphics.

I don't think the overhead of splitting an entry point in twain is very great in Futhark. It involves no additional synchronisation. It does involve re-taking a (probably completely uncontested) CPU lock, but that's nearly free. If you call many entry points without syncing in between, the GPU operations they contain will be enqueued asynchronously.

The risk of course is that the entry points contain Futhark operations that require GPU synchronisation. This mostly occurs when Futhark needs to read a scalar (or whatever) in order to do CPU-side control flow. We have made some improvements to avoid this (by shunting sequential code into tiny single-thread GPU kernels), but there are certainly still cases where this happens.

FluxusMagna · 2024-01-18T17:55:03Z

It apparently does exist in a more recent version of OpenCL at least
https://dl.acm.org/doi/10.1145/3529538.3529979
,though OpenCL doesn't seem like the future to me. It wouldn't suprise me if there is something simliar for HIP/CUDA, but a brief search didn't turn up anything obvious.

I guess it all comes back to the elusive Vulkan backend #1856 . In a way it seems like a really nice compilation target because it gives so much control, but is far too tedious for most people to write by hand. I suppose it would also solve some platform issues since it should run on most GPUs.

potatoboiler · 2024-04-16T08:17:00Z

Noob about CUDA runtime and I'm not terribly familiar with Futhark, but would CUDA streams / graphs serve as a step towards this goal?

Relevant links:
https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__GRAPH.html
https://developer.nvidia.com/blog/cuda-graphs/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use command buffers to reduce idle time #2091

Use command buffers to reduce idle time #2091

FluxusMagna commented Jan 18, 2024

athas commented Jan 18, 2024

FluxusMagna commented Jan 18, 2024 •

edited

Loading

potatoboiler commented Apr 16, 2024

Use command buffers to reduce idle time #2091

Use command buffers to reduce idle time #2091

Comments

FluxusMagna commented Jan 18, 2024

athas commented Jan 18, 2024

FluxusMagna commented Jan 18, 2024 • edited Loading

potatoboiler commented Apr 16, 2024

FluxusMagna commented Jan 18, 2024 •

edited

Loading