You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm trying to build a mental picture of how concurrency patterns between dispatches are lowered to the runtime API and how the runtimes (in particular CUDA, because that backend doesn't fully implement synchronization yet) might implement that API. It's more of a brain dump than a structured document with concrete questions, unfortunately. I still have to read though some more code and wrap my head around it. It's likely that I will have some more concrete questions then. If anything of the things below seem off already, I would appreciate getting corrected.
TL;DR of my current understanding is that CUDA graphs would leave some (unquantified) amount of performance on the table because only some concurrency patterns can be represented though the HAL with a single buffer, and multiple buffers results in multiple graphs, making execution less efficient.
HAL synchronization primitives
The HAL allows expressing dependencies between command buffers as well as between dispatches within a command buffer. Dependencies can be expressed as barriers, affinity masks, fences, and events.
Execution barrier
Defines an execution dependency of everything dispatched to a command buffer after the barrier on everything dispatched before the barrier. Unless sequenced through execution barriers, commands dispatched to a buffer may execute in any order. Barriers can be restricted to a subset of pipeline stages (do we have any examples of a performance gain from specifying a tighter mask?).
Queue Affinity
Queue affinity may be used to define dependencies between command buffers. hal.device.queue takes a bit mask specifying which logical queues may execute the command buffer. Logical queues are implicitly ordered, so two buffers executing with the same single-bit affinity mask are also ordered. On the other hand, executing a buffer with more than one affinity bit set results in no ordering guarantee with any of the other buffers.
Note: could/should hal.fence.signal/await expose queue affinity as well?
Fences
Fences are used to synchronize execution of command buffers. hal.command_buffer.execute chains fences to order execution on the device, and hal.fence.signal/await allows synchronization with the host. Device allocations also chain fences.
Events
There are some traces of an event type, but it seems unused.
Mapping to vendor APIs
The synchronization expressed through the HAL needs to be mapped to the target driver API.
Vulkan mapping
HAL's execution barriers and fences directly map to the Vulkan concepts of the same name. Fences can also be implemented using timeline semaphores, which are a generalization of both fences and semaphores by inducing a partial order at the command granularity.
Queue affinity can be implemented using multiple queues per device (side note: NVIDIA drivers only implement one multiplexed physical queue per family). Buffers which are required to execute on the same virtual queue need to be ordered using semaphores.
CUDA streams and events
Unlike the commands of a Vulkan buffer, commands of one CUDA stream are implicitly ordered. Therefore, execution barriers between commands scheduled on the same stream are automatically satisfied.
On the other hand, commands not ordered by barriers would need to be scheduled on separate streams to have the opportunity to execute concurrently. Using a large number of streams and synchronizing them as needed using events will hurt performance in the common case of a single sequence of dependent commands. In other words, the runtime needs to dynamically assign commands to streams to minimize the use of events, but preferrably without hurting concurrency by introducing false dependencies. The streams at the buffer level should be virtual and map to actual CUDA streams that span multiple buffer executions.
The HAL normally uses a deferred buffer dispatch model, so the stream assignment could happen when all commands and dependencies are known. But deferred buffer dispatch hurts latency for CUDA streams and there is a one-shot model to avoid this. Cross-buffer stream assignment would need to be fully dynamic without seeing all buffers that will be executed ahead of time.
Assigning commands to streams seems to be complex enough that one would try to avoid producing a close-to-optimal solution at runtime. However, assigning commands to a single stream is trivial and optimal for fully ordered command buffers with execution barriers between each command. One possible approach would be to always fully order the commands of each buffer, and direct the compiler to use separate buffers for workloads where parallel execution is expected to be benefitial. With this approach, each queue affinity could simply correspond to one CUDA stream.
Fences would be implemented using events, both for synchronizing device execution across streams as well as for synchronizing with the host.
CUDA hides memory consistency from the user (e.g. caches are flushed implicitly between commands). If I understand correctly, the CUDA HAL may drop any memory barriers.
CUDA graphs
Each execution barrier can be represented as a complete bipartite graph. Presumably redundent edges will be trimmed by the driver when the graph is instantiated.
I think the assumption is that each buffer corresponds to a CUDA graph, because the buffer building API is deferred, but buffer execution on the queue is immediate. Cross-buffer dependencies would be handled the same way as for CUDA streams.
Concurrency patterns
The HAL synchronization primitives were mostly inspired by the Vulkan API, queue affinity by CPU schedulers. This makes mapping to the corresponding APIs easy, but requires more work when the target API provides a significantly different API.
I've heard the argument that Vulkan's barriers are the lowest common denominator that maps well to the underlying hardware implementation. I beliefe this is true for the grapics mode of GPUs with it's relatively long rendering pipeline through a number of fixed function units where the side effects of those stages require fine-grained synchronization. It's been more than a decade that I've been familiar with current GPUs, but I think there is still a largely separate compute mode to run CUDA workloads, which has no such pipeline and a different scheduler (which at the high level simply multiplexes the CUDA streams at kernel granularity).
What seems problematic to me is that the barrier is not generic in the sense that it can only define bipartite graphs, and that a bipartite graph is awkward and inefficient to represent using streams. But at the end of the way, it's probably more relevant what concurrency patterns one might want to represent and whether those are easy to convey efficiently across the HAL boundary.
Sequence of commands
A single (completely ordered) sequence of commands is the simplest but likely also the most common concurrency pattern. At the HAL level, this would be a single buffer with a barrier between every command. It is simple enough to detect that special case and map it to a single stream or linear graph.
Two (or more) independent sequences of commands need to be modeled using one buffer per sequence and no fusion opportunity with predecessor and successor buffers. This is reasonable for CUDA streams. With one graph per buffer though, we would be leaving some performance on the table because we would be instantiating more graphs and schedule them on more streams than necessary. Is this slight inefficiency acceptable?
Fork / Join
Two (or more) parallel commands with common predecessor and successor can be represented efficiently with barrier and we should expect that the compiler will e.g. try to overlap compute and communication commands within a single command buffer. At the HAL level, this corresponds to series of commands without barriers in between. Again, it would probably not be too hard to detect and special-case the mapping to the corresponding stream pattern. For graphs, it might even be acceptable to build the complete dependency graph and let the driver trim redundent edges.
Arbitrary graph
One could imagine that the compiler builds dependency graphs as one buffer per command and synchronizes their execution using fences. This would be inefficient for Vulkan as well and I expect there would need to be some pass to fuse command buffers. At that point, we would again leave some performance on the table by not combining multiple buffers into one graph.
Alternatives
It seems to me that it could potentially have been simpler and less limiting to implement command dependencies by passing fences (or some other type) between dispatches. This would allow fusion of larger graphs into a single buffer, which the runtime could inspect (because dispatches are a deferred API, whereas executing the buffer is immediate).
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
I'm trying to build a mental picture of how concurrency patterns between dispatches are lowered to the runtime API and how the runtimes (in particular CUDA, because that backend doesn't fully implement synchronization yet) might implement that API. It's more of a brain dump than a structured document with concrete questions, unfortunately. I still have to read though some more code and wrap my head around it. It's likely that I will have some more concrete questions then. If anything of the things below seem off already, I would appreciate getting corrected.
TL;DR of my current understanding is that CUDA graphs would leave some (unquantified) amount of performance on the table because only some concurrency patterns can be represented though the HAL with a single buffer, and multiple buffers results in multiple graphs, making execution less efficient.
HAL synchronization primitives
The HAL allows expressing dependencies between command buffers as well as between dispatches within a command buffer. Dependencies can be expressed as barriers, affinity masks, fences, and events.
Execution barrier
Defines an execution dependency of everything dispatched to a command buffer after the barrier on everything dispatched before the barrier. Unless sequenced through execution barriers, commands dispatched to a buffer may execute in any order. Barriers can be restricted to a subset of pipeline stages (do we have any examples of a performance gain from specifying a tighter mask?).
Queue Affinity
Queue affinity may be used to define dependencies between command buffers.
hal.device.queue
takes a bit mask specifying which logical queues may execute the command buffer. Logical queues are implicitly ordered, so two buffers executing with the same single-bit affinity mask are also ordered. On the other hand, executing a buffer with more than one affinity bit set results in no ordering guarantee with any of the other buffers.Note: could/should
hal.fence.signal/await
expose queue affinity as well?Fences
Fences are used to synchronize execution of command buffers.
hal.command_buffer.execute
chains fences to order execution on the device, andhal.fence.signal/await
allows synchronization with the host. Device allocations also chain fences.Events
There are some traces of an event type, but it seems unused.
Mapping to vendor APIs
The synchronization expressed through the HAL needs to be mapped to the target driver API.
Vulkan mapping
HAL's execution barriers and fences directly map to the Vulkan concepts of the same name. Fences can also be implemented using timeline semaphores, which are a generalization of both fences and semaphores by inducing a partial order at the command granularity.
Queue affinity can be implemented using multiple queues per device (side note: NVIDIA drivers only implement one multiplexed physical queue per family). Buffers which are required to execute on the same virtual queue need to be ordered using semaphores.
CUDA streams and events
Unlike the commands of a Vulkan buffer, commands of one CUDA stream are implicitly ordered. Therefore, execution barriers between commands scheduled on the same stream are automatically satisfied.
On the other hand, commands not ordered by barriers would need to be scheduled on separate streams to have the opportunity to execute concurrently. Using a large number of streams and synchronizing them as needed using events will hurt performance in the common case of a single sequence of dependent commands. In other words, the runtime needs to dynamically assign commands to streams to minimize the use of events, but preferrably without hurting concurrency by introducing false dependencies. The streams at the buffer level should be virtual and map to actual CUDA streams that span multiple buffer executions.
The HAL normally uses a deferred buffer dispatch model, so the stream assignment could happen when all commands and dependencies are known. But deferred buffer dispatch hurts latency for CUDA streams and there is a one-shot model to avoid this. Cross-buffer stream assignment would need to be fully dynamic without seeing all buffers that will be executed ahead of time.
Assigning commands to streams seems to be complex enough that one would try to avoid producing a close-to-optimal solution at runtime. However, assigning commands to a single stream is trivial and optimal for fully ordered command buffers with execution barriers between each command. One possible approach would be to always fully order the commands of each buffer, and direct the compiler to use separate buffers for workloads where parallel execution is expected to be benefitial. With this approach, each queue affinity could simply correspond to one CUDA stream.
Fences would be implemented using events, both for synchronizing device execution across streams as well as for synchronizing with the host.
CUDA hides memory consistency from the user (e.g. caches are flushed implicitly between commands). If I understand correctly, the CUDA HAL may drop any memory barriers.
CUDA graphs
Each execution barrier can be represented as a complete bipartite graph. Presumably redundent edges will be trimmed by the driver when the graph is instantiated.
I think the assumption is that each buffer corresponds to a CUDA graph, because the buffer building API is deferred, but buffer execution on the queue is immediate. Cross-buffer dependencies would be handled the same way as for CUDA streams.
Concurrency patterns
The HAL synchronization primitives were mostly inspired by the Vulkan API, queue affinity by CPU schedulers. This makes mapping to the corresponding APIs easy, but requires more work when the target API provides a significantly different API.
I've heard the argument that Vulkan's barriers are the lowest common denominator that maps well to the underlying hardware implementation. I beliefe this is true for the grapics mode of GPUs with it's relatively long rendering pipeline through a number of fixed function units where the side effects of those stages require fine-grained synchronization. It's been more than a decade that I've been familiar with current GPUs, but I think there is still a largely separate compute mode to run CUDA workloads, which has no such pipeline and a different scheduler (which at the high level simply multiplexes the CUDA streams at kernel granularity).
What seems problematic to me is that the barrier is not generic in the sense that it can only define bipartite graphs, and that a bipartite graph is awkward and inefficient to represent using streams. But at the end of the way, it's probably more relevant what concurrency patterns one might want to represent and whether those are easy to convey efficiently across the HAL boundary.
Sequence of commands
A single (completely ordered) sequence of commands is the simplest but likely also the most common concurrency pattern. At the HAL level, this would be a single buffer with a barrier between every command. It is simple enough to detect that special case and map it to a single stream or linear graph.
Two (or more) independent sequences of commands need to be modeled using one buffer per sequence and no fusion opportunity with predecessor and successor buffers. This is reasonable for CUDA streams. With one graph per buffer though, we would be leaving some performance on the table because we would be instantiating more graphs and schedule them on more streams than necessary. Is this slight inefficiency acceptable?
Fork / Join
Two (or more) parallel commands with common predecessor and successor can be represented efficiently with barrier and we should expect that the compiler will e.g. try to overlap compute and communication commands within a single command buffer. At the HAL level, this corresponds to series of commands without barriers in between. Again, it would probably not be too hard to detect and special-case the mapping to the corresponding stream pattern. For graphs, it might even be acceptable to build the complete dependency graph and let the driver trim redundent edges.
Arbitrary graph
One could imagine that the compiler builds dependency graphs as one buffer per command and synchronizes their execution using fences. This would be inefficient for Vulkan as well and I expect there would need to be some pass to fuse command buffers. At that point, we would again leave some performance on the table by not combining multiple buffers into one graph.
Alternatives
It seems to me that it could potentially have been simpler and less limiting to implement command dependencies by passing fences (or some other type) between dispatches. This would allow fusion of larger graphs into a single buffer, which the runtime could inspect (because dispatches are a deferred API, whereas executing the buffer is immediate).
Beta Was this translation helpful? Give feedback.
All reactions