This document lists technical details regarding the CUDA implemenation of IREE's Hardware Abstraction Layer (HAL), called a CUDA HAL driver.
Note that there is an existing CUDA HAL driver under the iree/hal/drivers/cuda/
directory; what this directory holds is a rewrite for it. Once this rewrite is mature enough, it will replace the existing one. For the rewrite rationale, goals, and plans, please see [Issue #13245][iree-cuda-rewrite].
The IREE HAL uses semaphores to synchronize work between host CPU threads and device GPU streams. It's a unified primitive that covers all directions--host to host, host to device, device to host, and device to device, and allows flexible signal and wait ordering--signal before wait, or wait before signal. There is no limit on the number of waits of the same value too.
The core state of a HAL semaphore consists of a monotonically increasing 64-bit integer value, which forms a timeline--signaling the semaphore to a larger value advances the timeline and unblocks work waiting on some earlier values. The semantics closely mirrors Vulkan timeline semaphore.
In CUDA, there is no direct equivalent primitives providing all the capabilities needed by the HAL semaphore abstraction:
cuStreamWriteValue64()
and cuStreamWaitValue64()
, which can implment HAL semaphore 64-bit integer value signal and wait. Though these operations require device pointers and cannot accepts pointers to managed memory buffers, meaning no support for the host. Additionally, per the spec, “synchronization ordering established through these APIs is not visible to CUDA. CUDA tasks that are (even indirectly) ordered by these APIs should also have that order expressed with CUDA-visible dependencies such as events.” So it's not suitable for integration with other CUDA components.cuSignalExternalSemaphoresAsync()
and cuWaitExternalSemaphoresAsync()
, which can directly map to Vulkan timeline semaphores. Though these APIs are meant to handle exernal resources--there is no way to create CUexternalSemaphore
objects directly other than cuImportExternalSemaphore()
.Therefore, to implement the support, we need to leverage multiple native CPU or CUDA primitives under the hood.
CUevent
capabilitiesThe main synchronization mechanism is CUDA event--CUevent
. As a functionality and integration baseline, we use CUevent
to implement the IREE HAL semaphore abstraction.
CUevent
natively supports the following capabilities:
cuEventSynchronize()
or cuGraphAddEventWaitNode()
) for the same CUevent
signal (e.g., via cuEventRecord()
or cuGraphAddEventRecordNode()
).We need to fill the remaining capability gaps. Before going into details, the overall approach would be to:
CUevent
, each CUevent
would just be a “timepoint” on the timeline.CUevent
to wait on.Concretely, for a given HAL semaphore, looking at the four directions:
A CPU thread signals the semaphore timeline to a new value.
If there are CPU waits, it is purely on the CPU side. We just need to use common CPU notification mechanisms. In IREE we have iree_event_t
wrapping various low-level OS primitives for it. So we can just use that to represent a wait timepoint. We need to keep track of all CPU wait timepoints in the timeline. After a new signaled value, go through the timeline and notify all those waiting on earlier values.
If there are GPU waits, given that there are no way we can signal a CUevent
on CPU, one way to handle this is to cache and defer the submission batches by ourselves until CPU signals past the desired value. To support this, we would need to implement a deferred/pending actions queue.
GPU signals can only be through a CUevent
object, which has a binary state. We need to advance the timeline too. One way is to use cuLaunchHostFunc()
to perform the advance from the CPU side. This additionally would mean we can reuse the logic form CPU signaling to unblock CPU waits.
For GPU waits, we can also leverage the same logic--using CPU signaling to unblock deferred GPU queue actions. Though this is performant, given that the CPU is involved for GPU internal synchronization. We want to use CUevent
instead:
CUevent
. This would also guarantee the requirement of CUevent
--recording should happen before waiting.CUevent
to unblock multiple GPU waits. That's allowed, though it would mean we need to be careful regarding CUevent
lifetime management. Here we can use reference counting to see how many timepoints are using it and automatically return to a pool once done.Another problem is that per the cuLaunchHostFunc()
doc, “the function will be called after currently enqueued work and will block work added after it.” We don't want the blocking behavior involving host. So we can use a dedicated CUstream
for launching the host function, waiting on the CUevent
from the original stream too. We can also handle resource deallocation together there.
To summarize, we need the following data structures to implement HAL semaphore:
iree_event_t
: CPU notification mechanism wrapping low-level OS primitives. Used by host wait timepoints.iree_event_pool_t
: a pool for CPU iree_event_t
objects to recycle.iree_hal_cuda2_event_t
: GPU notification mechanism wrapping a CUevent
and a reference count. Used by device signal and wait timepoints. Associates with a iree_hal_cuda2_event_pool_t
pool--returns to the pool directly on once reference count goes to 0.iree_hal_cuda2_event_pool_t
: a pool for GPU iree_hal_cuda2_event_t
objects to recycle.iree_hal_cuda2_timepoint_t
: an object that wraps a CPU iree_event_t
or GPU iree_hal_cuda2_event_t
to represent wait/signal of a timepoint on a timeline.iree_hal_cuda2_timepoint_pool_t
: a pool for iree_hal_cuda2_timepoint_t
objects to recycle. This pool builds upon the CPU and GPU event pool--it acquires CPU/GPU event objects there.iree_hal_cuda_timeline_semaphore_t
: contains a list of CPU wait and GPU wait/signal timepoints.iree_hal_cuda2_queue_action_t
: a pending queue action (kernel launch or stream-ordered allocation).iree_hal_cuda2_pending_queue_actions_t
: a data structure to manage pending queue actions. It provides APIs to enqueue actions, and advance the queue on demand--queue actions are released to the GPU when all their wait semaphores are signaled past the desired value, or we can have a CUevent
object already recorded to some CUstream
to wait on.