experimental/cuda2/README.md - 3p/openxla/iree - Git at Google

 # IREE CUDA HAL Driver

 This document lists technical details regarding the CUDA implemenation of
 IREE's [Hardware Abstraction Layer (HAL)][iree-hal], called a CUDA HAL driver.

 Note that there is an existing CUDA HAL driver under the
 [`iree/hal/drivers/cuda/`][iree-cuda] directory; what this directory holds is
 a rewrite for it. Once this rewrite is mature enough, it will replace the
 existing one. For the rewrite rationale, goals, and plans, please see
 [Issue #13245][iree-cuda-rewrite].

 ## Synchronization

 ### HAL Semaphore

 The IREE HAL uses semaphores to synchronize work between host CPU threads and
 device GPU streams. It's a unified primitive that covers all directions--host
 to host, host to device, device to host, and device to device, and allows
 flexible signal and wait ordering--signal before wait, or wait before signal.
 There is no limit on the number of waits of the same value too.

 The core state of a HAL semaphore consists of a monotonically increasing 64-bit
 integer value, which forms a timeline--signaling the semaphore to a larger
 value advances the timeline and unblocks work waiting on some earlier values.
 The semantics closely mirrors
 [Vulkan timeline semaphore][vulkan-timeline-semaphore].

 In CUDA, there is no direct equivalent primitives providing all the capabilities
 needed by the HAL semaphore abstraction:

 * [Stream memory operations][cu-mem-ops] provides `cuStreamWriteValue64()` and
   `cuStreamWaitValue64()`, which can implment HAL semaphore 64-bit integer value
   signal and wait. Though these operations require device pointers and cannot
   accepts pointers to managed memory buffers, meaning no support for the host.
   Additionally, per the spec, "synchronization ordering established through
   these APIs is not visible to CUDA. CUDA tasks that are (even indirectly)
   ordered by these APIs should also have that order expressed with
   CUDA-visible dependencies such as events." So it's not suitable for
   integration with other CUDA components.
 * For [external resource interoperability][cu-external-resource], we have APIs
   like `cuSignalExternalSemaphoresAsync()` and `cuWaitExternalSemaphoresAsync()`,
   which can directly map to Vulkan timeline semaphores. Though these APIs are
   meant to handle exernal resources--there is no way to create
   `CUexternalSemaphore` objects directly other than `cuImportExternalSemaphore()`.

 Therefore, to implement the support, we need to leverage multiple native CPU or
 CUDA primitives under the hood.

 #### `CUevent` capabilities

 The main synchronization mechanism is [CUDA event--`CUevent`][cu-event].
 As a functionality and integration baseline, we use `CUevent` to implement the
 IREE HAL semaphore abstraction.

 `CUevent` natively supports the following capabilities:

 * State: binary; either unsignaled or signaled. There can exist multiple
   waits (e.g., via `cuEventSynchronize()` or `cuGraphAddEventWaitNode()`) for
   the same `CUevent` signal (e.g., via `cuEventRecord()` or
   `cuGraphAddEventRecordNode()`).
 * Ordering: must be signal before wait. Waiting before signal would mean
   waiting an empty set of work, or previously recorded work.
 * Direction: device to device, device to host.

 We need to fill the remaining capability gaps. Before going into details,
 the overall approach would be to:

 * State: we need a 64-bit integer value timeline. Given the binary state of
   a `CUevent`, each `CUevent` would just be a "timepoint" on the timeline.
 * Ordering: we need to defer releasing the workload to the GPU until the
   semaphore waits are reached on the host, or we can have some device
   `CUevent` to wait on.
 * Direction: host to host and host to device is missing; we can support that
   with host synchronization mechanisms.

 #### Signal to wait analysis

 Concretely, for a given HAL semaphore, looking at the four directions:

 ##### CPU signal

 A CPU thread signals the semaphore timeline to a new value.

 If there are CPU waits, it is purely on the CPU side. We just need to use common
 CPU notification mechanisms. In IREE we have `iree_event_t` wrapping various
 low-level OS primitives for it. So we can just use that to represent a wait
 timepoint. We need to keep track of all CPU wait timepoints in the timeline.
 After a new signaled value, go through the timeline and notify all those waiting
 on earlier values.

 If there are GPU waits, given that there are no way we can signal a `CUevent` on
 CPU, one way to handle this is to cache and defer the submission batches by
 ourselves until CPU signals past the desired value. To support this, we would
 need to implement a deferred/pending actions queue.

 ##### GPU signal

 GPU signals can only be through a `CUevent` object, which has a binary state.
 We need to advance the timeline too. One way is to use `cuLaunchHostFunc()`
 to perform the advance from the CPU side. This additionally would mean we can
 reuse the logic form CPU signaling to unblock CPU waits.

 For GPU waits, we can also leverage the same logic--using CPU signaling to
 unblock deferred GPU queue actions. Though this is performant, given that
 the CPU is involved for GPU internal synchronization. We want to use `CUevent`
 instead:

 * We keep track of all GPU signals in the timeline. Once we see a GPU wait
   request, try to scan the timeline to find a GPU signal that advances the
   timeline past the desired value, and use that for waiting instead.
 * We may not see GPU signal before seeing GPU wait requests, then we can also
   keep track of all GPU waits in the timeline. Later once see either a CPU
   signal or GPU signal advancing past the waited value, we can handle them
   accordingly--submitting immediately or associating the `CUevent`.
   This would also guarantee the requirement of `CUevent`--recording should
   happen before waiting.
 * We can use the same `CUevent` to unblock multiple GPU waits. That's allowed,
   though it would mean we need to be careful regarding `CUevent` lifetime
   management. Here we can use reference counting to see how many timepoints
   are using it and automatically return to a pool once done.

 Another problem is that per the `cuLaunchHostFunc()` doc, "the function will
 be called after currently enqueued work and will block work added after it."
 We don't want the blocking behavior involving host. So we can use a dedicated
 `CUstream` for launching the host function, waiting on the `CUevent` from the
 original stream too. We can also handle resource deallocation together there.

 #### Data structures

 To summarize, we need the following data structures to implement HAL semaphore:

 * `iree_event_t`: CPU notification mechanism wrapping low-level OS primitives.
   Used by host wait timepoints.
 * `iree_event_pool_t`: a pool for CPU `iree_event_t` objects to recycle.
 * `iree_hal_cuda2_event_t`: GPU notification mechanism wrapping a `CUevent` and
   a reference count. Used by device signal and wait timepoints. Associates with
   a `iree_hal_cuda2_event_pool_t` pool--returns to the pool directly on once
   reference count goes to 0.
 * `iree_hal_cuda2_event_pool_t`: a pool for GPU `iree_hal_cuda2_event_t` objects
   to recycle.
 * `iree_hal_cuda2_timepoint_t`: an object that wraps a CPU `iree_event_t` or
   GPU `iree_hal_cuda2_event_t` to represent wait/signal of a timepoint on a
   timeline.
 * `iree_hal_cuda2_timepoint_pool_t`: a pool for `iree_hal_cuda2_timepoint_t`
   objects to recycle. This pool builds upon the CPU and GPU event pool--it
   acquires CPU/GPU event objects there.
 * `iree_hal_cuda_timeline_semaphore_t`: contains a list of CPU wait and GPU
   wait/signal timepoints.
 * `iree_hal_cuda2_queue_action_t`: a pending queue action (kernel launch or
   stream-ordered allocation).
 * `iree_hal_cuda2_pending_queue_actions_t`: a data structure to manage pending
   queue actions. It provides APIs to enqueue actions, and advance the queue on
   demand--queue actions are released to the GPU when all their wait semaphores
   are signaled past the desired value, or we can have a `CUevent` object already
   recorded to some `CUstream` to wait on.


 [iree-hal]: https://github.com/openxla/iree/tree/main/runtime/src/iree/hal
 [iree-cuda]: https://github.com/openxla/iree/tree/main/runtime/src/iree/hal/drivers/cuda
 [iree-cuda-rewite]: https://github.com/openxla/iree/issues/13245
 [vulkan-timeline-semaphore]: https://www.khronos.org/blog/vulkan-timeline-semaphores
 [cu-mem-ops]: https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEMOP.html
 [cu-external-resource]: https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__EXTRES__INTEROP.html
 [cu-event]: https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__EVENT.html
	# IREE CUDA HAL Driver

	This document lists technical details regarding the CUDA implemenation of
	IREE's [Hardware Abstraction Layer (HAL)][iree-hal], called a CUDA HAL driver.

	Note that there is an existing CUDA HAL driver under the
	[`iree/hal/drivers/cuda/`][iree-cuda] directory; what this directory holds is
	a rewrite for it. Once this rewrite is mature enough, it will replace the
	existing one. For the rewrite rationale, goals, and plans, please see
	[Issue #13245][iree-cuda-rewrite].

	## Synchronization

	### HAL Semaphore

	The IREE HAL uses semaphores to synchronize work between host CPU threads and
	device GPU streams. It's a unified primitive that covers all directions--host
	to host, host to device, device to host, and device to device, and allows
	flexible signal and wait ordering--signal before wait, or wait before signal.
	There is no limit on the number of waits of the same value too.

	The core state of a HAL semaphore consists of a monotonically increasing 64-bit
	integer value, which forms a timeline--signaling the semaphore to a larger
	value advances the timeline and unblocks work waiting on some earlier values.
	The semantics closely mirrors
	[Vulkan timeline semaphore][vulkan-timeline-semaphore].

	In CUDA, there is no direct equivalent primitives providing all the capabilities
	needed by the HAL semaphore abstraction:

	* [Stream memory operations][cu-mem-ops] provides `cuStreamWriteValue64()` and
	`cuStreamWaitValue64()`, which can implment HAL semaphore 64-bit integer value
	signal and wait. Though these operations require device pointers and cannot
	accepts pointers to managed memory buffers, meaning no support for the host.
	Additionally, per the spec, "synchronization ordering established through
	these APIs is not visible to CUDA. CUDA tasks that are (even indirectly)
	ordered by these APIs should also have that order expressed with
	CUDA-visible dependencies such as events." So it's not suitable for
	integration with other CUDA components.
	* For [external resource interoperability][cu-external-resource], we have APIs
	like `cuSignalExternalSemaphoresAsync()` and `cuWaitExternalSemaphoresAsync()`,
	which can directly map to Vulkan timeline semaphores. Though these APIs are
	meant to handle exernal resources--there is no way to create
	`CUexternalSemaphore` objects directly other than `cuImportExternalSemaphore()`.

	Therefore, to implement the support, we need to leverage multiple native CPU or
	CUDA primitives under the hood.

	#### `CUevent` capabilities

	The main synchronization mechanism is [CUDA event--`CUevent`][cu-event].
	As a functionality and integration baseline, we use `CUevent` to implement the
	IREE HAL semaphore abstraction.

	`CUevent` natively supports the following capabilities:

	* State: binary; either unsignaled or signaled. There can exist multiple
	waits (e.g., via `cuEventSynchronize()` or `cuGraphAddEventWaitNode()`) for
	the same `CUevent` signal (e.g., via `cuEventRecord()` or
	`cuGraphAddEventRecordNode()`).
	* Ordering: must be signal before wait. Waiting before signal would mean
	waiting an empty set of work, or previously recorded work.
	* Direction: device to device, device to host.

	We need to fill the remaining capability gaps. Before going into details,
	the overall approach would be to:

	* State: we need a 64-bit integer value timeline. Given the binary state of
	a `CUevent`, each `CUevent` would just be a "timepoint" on the timeline.
	* Ordering: we need to defer releasing the workload to the GPU until the
	semaphore waits are reached on the host, or we can have some device
	`CUevent` to wait on.
	* Direction: host to host and host to device is missing; we can support that
	with host synchronization mechanisms.

	#### Signal to wait analysis

	Concretely, for a given HAL semaphore, looking at the four directions:

	##### CPU signal

	A CPU thread signals the semaphore timeline to a new value.

	If there are CPU waits, it is purely on the CPU side. We just need to use common
	CPU notification mechanisms. In IREE we have `iree_event_t` wrapping various
	low-level OS primitives for it. So we can just use that to represent a wait
	timepoint. We need to keep track of all CPU wait timepoints in the timeline.
	After a new signaled value, go through the timeline and notify all those waiting
	on earlier values.

	If there are GPU waits, given that there are no way we can signal a `CUevent` on
	CPU, one way to handle this is to cache and defer the submission batches by
	ourselves until CPU signals past the desired value. To support this, we would
	need to implement a deferred/pending actions queue.

	##### GPU signal

	GPU signals can only be through a `CUevent` object, which has a binary state.
	We need to advance the timeline too. One way is to use `cuLaunchHostFunc()`
	to perform the advance from the CPU side. This additionally would mean we can
	reuse the logic form CPU signaling to unblock CPU waits.

	For GPU waits, we can also leverage the same logic--using CPU signaling to
	unblock deferred GPU queue actions. Though this is performant, given that
	the CPU is involved for GPU internal synchronization. We want to use `CUevent`
	instead:

	* We keep track of all GPU signals in the timeline. Once we see a GPU wait
	request, try to scan the timeline to find a GPU signal that advances the
	timeline past the desired value, and use that for waiting instead.
	* We may not see GPU signal before seeing GPU wait requests, then we can also
	keep track of all GPU waits in the timeline. Later once see either a CPU
	signal or GPU signal advancing past the waited value, we can handle them
	accordingly--submitting immediately or associating the `CUevent`.
	This would also guarantee the requirement of `CUevent`--recording should
	happen before waiting.
	* We can use the same `CUevent` to unblock multiple GPU waits. That's allowed,
	though it would mean we need to be careful regarding `CUevent` lifetime
	management. Here we can use reference counting to see how many timepoints
	are using it and automatically return to a pool once done.

	Another problem is that per the `cuLaunchHostFunc()` doc, "the function will
	be called after currently enqueued work and will block work added after it."
	We don't want the blocking behavior involving host. So we can use a dedicated
	`CUstream` for launching the host function, waiting on the `CUevent` from the
	original stream too. We can also handle resource deallocation together there.

	#### Data structures

	To summarize, we need the following data structures to implement HAL semaphore:

	* `iree_event_t`: CPU notification mechanism wrapping low-level OS primitives.
	Used by host wait timepoints.
	* `iree_event_pool_t`: a pool for CPU `iree_event_t` objects to recycle.
	* `iree_hal_cuda2_event_t`: GPU notification mechanism wrapping a `CUevent` and
	a reference count. Used by device signal and wait timepoints. Associates with
	a `iree_hal_cuda2_event_pool_t` pool--returns to the pool directly on once
	reference count goes to 0.
	* `iree_hal_cuda2_event_pool_t`: a pool for GPU `iree_hal_cuda2_event_t` objects
	to recycle.
	* `iree_hal_cuda2_timepoint_t`: an object that wraps a CPU `iree_event_t` or
	GPU `iree_hal_cuda2_event_t` to represent wait/signal of a timepoint on a
	timeline.
	* `iree_hal_cuda2_timepoint_pool_t`: a pool for `iree_hal_cuda2_timepoint_t`
	objects to recycle. This pool builds upon the CPU and GPU event pool--it
	acquires CPU/GPU event objects there.
	* `iree_hal_cuda_timeline_semaphore_t`: contains a list of CPU wait and GPU
	wait/signal timepoints.
	* `iree_hal_cuda2_queue_action_t`: a pending queue action (kernel launch or
	stream-ordered allocation).
	* `iree_hal_cuda2_pending_queue_actions_t`: a data structure to manage pending
	queue actions. It provides APIs to enqueue actions, and advance the queue on
	demand--queue actions are released to the GPU when all their wait semaphores
	are signaled past the desired value, or we can have a `CUevent` object already
	recorded to some `CUstream` to wait on.


	[iree-hal]: https://github.com/openxla/iree/tree/main/runtime/src/iree/hal
	[iree-cuda]: https://github.com/openxla/iree/tree/main/runtime/src/iree/hal/drivers/cuda
	[iree-cuda-rewite]: https://github.com/openxla/iree/issues/13245
	[vulkan-timeline-semaphore]: https://www.khronos.org/blog/vulkan-timeline-semaphores
	[cu-mem-ops]: https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEMOP.html
	[cu-external-resource]: https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__EXTRES__INTEROP.html
	[cu-event]: https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__EVENT.html