-
Notifications
You must be signed in to change notification settings - Fork 166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use command buffers to reduce idle time #2091
Comments
I'm not sure the main GPGPU APIs (CUDA, OpenCL, HIP) support command buffers. That's more something you find in Vulkan. My impression is that fine grained operations are more common (and thus important to handle efficiently) in graphics. I don't think the overhead of splitting an entry point in twain is very great in Futhark. It involves no additional synchronisation. It does involve re-taking a (probably completely uncontested) CPU lock, but that's nearly free. If you call many entry points without syncing in between, the GPU operations they contain will be enqueued asynchronously. The risk of course is that the entry points contain Futhark operations that require GPU synchronisation. This mostly occurs when Futhark needs to read a scalar (or whatever) in order to do CPU-side control flow. We have made some improvements to avoid this (by shunting sequential code into tiny single-thread GPU kernels), but there are certainly still cases where this happens. |
It apparently does exist in a more recent version of OpenCL at least I guess it all comes back to the elusive Vulkan backend #1856 . In a way it seems like a really nice compilation target because it gives so much control, but is far too tedious for most people to write by hand. I suppose it would also solve some platform issues since it should run on most GPUs. |
Noob about CUDA runtime and I'm not terribly familiar with Futhark, but would CUDA streams / graphs serve as a step towards this goal? Relevant links: |
GPUs are hungry pieces of hardware and want a steady supply of commands. Many practical algorithms involve many interations where each iteration launches one or more kernels that are by themselves not very large, and actual kernel runtime can be small compared to the idle time in between. This affects mostly small to medium sized data sets, but sometimes subroutines of operations on larger data as well.
For my particle simulations the issue becomes apparent as I want to run as many time steps per second as possible, with each time step launching many tens of kernels. Smaller systems that I want to run become very inefficient loads, even though the 'main computation' should saturate the GPU reasonably well.
My knowledge of both the Futhark runtime and low level GPU-programming is quite limited, but I guess a fundamental question is how to divide commands into different buffers. I think one important consideration is to minimize the penalty of splitting an entry point into two, as composing simpler but more general entry points in the calling language makes for more maintainable code. As I understand it, in principle nothing is promised to be done before a context is synced, so it should be possible to queue kernels of different entry points together without too much issue. How many buffers would be necessary? With three buffers one could be executing, one waiting, and one be written to, much like for graphics.
This is of course no small task to implement properly, but I think it is something that should eventually be done to improve performance in general and for smaller kernels in particular.
The text was updated successfully, but these errors were encountered: