docs/design_docs/metal_hal_driver.md - 3p/openxla/iree - Git at Google

 # Metal HAL Driver

 This document lists technical details regarding the Metal HAL driver. Note that
 the Metal HAL driver is working in progress; this document is expected to be
 updated along the way.

 IREE provides a [Hardware Abstraction Layer (HAL)][iree-hal] as a common
 interface to different compute accelerators. IREE HAL's design draws inspiration
 from modern GPU architecture and APIs; so implementing a HAL driver using modern
 GPU APIs is generally straightforward. This applies to the Metal HAL driver.

 ## Overall Design Choices

 ### Metal Versions

 The Metal HAL driver expects Metal 2+. Metal 2 introduces useful features like
 argument buffer, performance shaders, and others, that can improve performance
 and make IREE HAL implementation simpler. Metal 2 was released late 2017 and are
 supported since macOS High Sierra and iOS 11. It is already dominant
 ([macOS][macos-version-share], [iOS][ios-version-share]) right now.

 ### Programming Languages and Libraries

 The Metal HAL driver lives under the [`iree/hal/metal/`][iree-metal] directory.
 Header (`.h`) and implementation (`.mm`) files are put adjacent to each other.

 The Metal framework only exposes Objective-C or Swift programming language APIs.
 Metal HAL driver needs to inherit from common HAL abstraction classes, which are
 C++. So we use [Objective-C++][objcxx] for implementing the Metal HAL driver.
 The headers try to stay with pure C/C++ syntax when possible, except for
 `#import <Metal/Metal.h>` and using Metal `id` types.

 ### Object Lifetime Management

 Objective-C uses refcount for tracking object lifetime and managing memory. This
 is traditionally done manually by sending `retain` and `release` messages to
 Objective-C objects. Modern Objective-C allows developers to opt in to use
 [Automatic Reference Counting][objc-arc] to let the compiler to automatically
 deduce and insert `retain`/`release` where possible to simplify the burdern of
 manual management.

 We don't use ARC in the Metal HAL driver given that IREE has its own object
 [refcount][iree-refptr] and lifetime management mechanism. Metal HAL GPU objects
 are tracked with that to be consistent with others. Each Metal HAL GPU object
 `retain`s the underlying Metal `id<MTL*>` object on construction and `release`s
 on destruction.

 ## GPU Objects

 Metal is one of the main modern GPU APIs that provide more explicit control over
 the hardware. The mapping between IREE HAL classes and Metal protocols are
 relatively straightforward:

 IREE HAL Class                             | Metal Protocol
 :----------------------------------------: | :------------:
 [`hal::Driver`][hal-driver]                | N/A
 [`hal::Device`][hal-device]                | [`MTLDevice`][mtl-device]
 [`hal::CommandQueue`][hal-command-queue]   | [`MTLCommandQueue`][mtl-command-queue]
 [`hal::CommandBuffer`][hal-command-buffer] | [`MTLCommandBuffer`][mtl-command-buffer]
 [`hal::Semaphore`][hal-semaphore]          | [`MTLSharedEvent`][mtl-shared-event]
 [`hal::Allocator`][hal-allocator]          | N/A
 [`hal::Buffer`][hal-buffer]                | [`MTLBuffer`][mtl-buffer]
 [`hal::Executable`][hal-executable]        | [`MTLLibrary`][mtl-library]
 [`hal::ExecutableCache`][hal-executable-cache] | N/A
 [`hal::DescriptorSetLayout`][hal-descriptor-set-layout] | N/A
 [`hal::DescriptorSet`][hal-descriptor-set] | N/A
 [`hal::ExecutableLayout`][hal-executable-layout] | N/A

 In the following subsections, we go over each pair to provide more details.

 ### Driver

 There is no native driver abstraction in Metal. IREE's Metal HAL driver still
 provides a [`hal::metal::MetalDriver`][metal-driver] subclass inheriting from
 common [`hal::Driver`][hal-driver] class. `hal::metal::MetalDriver` just
 `retain`s all available Metal devices in the system during its lifetime to
 provide similar interface as other HAL drivers.

 ### Device

 [`hal::metal::MetalDevice`][metal-device] inherits [`hal::Device`][hal-device]
 to provide the interface to Metal GPU device by wrapping a `id<MTLDevice>`. Upon
 construction, `hal::metal::MetalDevice` creates and retains one queue for both
 dispatch and transfer during its lifetime.

 Metal requres command buffers to be created from a `MTLCommandQueue`. In IREE
 HAL, command buffers are directly created from the `hal::Device`.
 `hal::metal::MetalDevice` chooses the proper queue to create the command buffer
 under the hood.

 ### Command queue

 IREE HAL command queue follows Vulkan for modelling submission. Specifically,
 `hal::CommandQueue::Submit()` takes a `SubmissionBatch`, which contains a list
 of waiting `hal::Semaphore`s, a list of command buffers, and a list signaling
 `hal::Semaphore`s. There is no direct mapping in Metal; so
 [`hal::metal::MetalCommandQueue`][metal-command-queue] performs the submission
 in three steps:

 1.  Create a new `MTLCommandBuffer` to `encodeWaitForEvent:value` for all
     waiting `hal::Semaphore`s and commit this command buffer.
 1.  Commit all command buffers in the `SubmissionBatch`.
 1.  Create a new `MTLCommandBuffer` to `encodeSignalEvent:value` for all
     signaling `hal::Semaphore`s and commit this command buffer.

 There is also no direct `WaitIdle()` for
 [`MTLCommandQueue`][mtl-command-queue]s. `hal::metal::MetalCommandQueue`
 implements `WaitIdle()` by committing an empty `MTLCommandBuffer` and
 registering a complete handler for it to signal a semaphore to wake the current
 thread, which is put into sleep by waiting on the semaphore.

 ### Command buffer

 In Metal, commands are recorded into a command buffer with three different kinds
 of [command encoders][mtl-command-encoder]: `MTLRenderCommandEncoder`,
 `MTLComputeCommandEncoder`, `MTLBlitCommandEncoder`, and
 `MTLParallelRenderCommandEncoder`. Each encoder has its own create/end call.
 There is no overall begin/end call for the whold command buffer. So even
 [`hal::metal::MetalCommandBuffer`][metal-command-buffer] implements an overall
 `Begin()`/`End()` call, under the hood it may create a new command encoder for a
 specific API call.

 ### Timeline semaphore

 [`hal::Semaphore`][hal-semaphore] allows host->device, device->host, host->host,
 and device->device synchronization. It maps to Vulkan timeline semaphore. In
 Metal world, the counterpart would be [`MTLSharedEvent`][mtl-shared-event]. Most
 of the `hal::Semaphore` APIs are simple to implement in
 [`MetalSharedEvent`][metal-shared-event], with `Wait()` as an exception. A
 listener is registered on the `MTLSharedEvent` with
 `notifyListener:atValue:block:` to singal a semaphore to wake the current
 thread, which is put into sleep by waiting on the semaphore.

 ### Allocator

 At the moment the Metal HAL driver just has a very simple
 [`hal::Allocator`][hal-allocator] implementation. It just wraps a `MTLDevice`
 and redirects all allocation requests to the `MTLDevice`. No page/pool/slab or
 whatever. This is only meant to get started. In the future we should have a
 better memory allocation library, probably by layering the
 [Vulkan Memory Allocator][vma] on top of [`MTLHeap`][mtl-heap].

 ### Buffer

 IREE [`hal::Buffer`][hal-buffer] maps Metal `MTLBuffer`. See
 [Memory Management](#memory-management) for more details.

 ### Executable

 IREE [`hal::Executable`][hal-executable] represents a GPU program archive with
 a driver-defined format. It maps naturally to Metal [`MTLLibrary`][mtl-library].
 An entry point in a `MTLLibrary` is a [`MTLFunction`][mtl-function]. We define
 [`hal::metal::MetalKernelLibrary`][metal-kernel-library] to wrap around a
 `MTLLibrary`, its `MTLFunction`s, and also `MTLComputePipelineState` objects
 constructed from `MTLFunction`s.

 ### Executable cache

 IREE [`hal::ExecutableCache`][hal-executable-cache] is modelling a cache of
 preprared GPU executables for a particular device. At the moment the Metal
 HAL driver does not peforming any cache on GPU programs; it simply reads the
 program from the FlatBuffer and hands it over to Metal driver.

 ### DescriptorSetLayout, DescriptorSet, ExecutableLayout

 See [Resource descriptors](#resource-descriptors) for more details.

 ## Compute Pipeline

 ### Shader/kernel compilation

 Metal has [Metal Shading Language (MSL)][msl-spec] for authoring graphics
 shaders and compute kernels. MSL source code can be directly consumed by the
 Metal framework at run-time; it can also be compiled first into an opaque
 library using [command-line tools][msl-cl-library] at build-time.

 IREE uses compilers to compile ML models expressed with high-level op semantics
 down to GPU native source format. This is also the case for the Metal HAL
 driver. Metal does not provide an open intermediate language; we reuse the
 [SPIR-V code generation pipeline][spirv-codegen] and then cross compile the
 generated SPIR-V into MSL source with [SPIRV-Cross][spirv-cross]. This is
 actually a fair common practice for targeting multiple GPU APIs in graphics
 programming world. For example, the Vulkan implmenation in macOS/iOs,
 [MoltenVK][moltenvk], is also doing the same for shaders/kernels. The path
 is actually quite robust, as demonstrated by various games on top of MoltenVK.

 Therefore, in IREE, we have a [`MetalSPIRVTargetBackend`][metal-spirv-target],
 which pulls in the normal MHLO to Linalg and Linalg to SPIR-V passes to form
 the compilation pipeline. The difference would be to provide a suitable
 SPIR-V target environment to drive the compilation, which one can derive from
 the Metal GPU families to target. (Not implemented yet; TODO for the future.)
 The serialization step differs from
 [`VulkanSPIRVTargetBackend`][vulkan-spirv-target] too: following the normal
 SPIR-V serialization step, we additionally need to invoke SPRIV-Cross to
 cross compile the generated SPIR-V into MSL, and then compile and/or serialize
 the MSL source/library.

 IREE uses [FlatBuffer][flatbuffer] to encode the whole workload module,
 including both GPU shader/kernel (called executable in IREE terminology) and
 CPU scheduling logic. The GPU executables are embedded as part of the module's
 FlatBuffer, which are [`mmap`][mmap]ped when IREE runs.

 For the Metal HAL driver, it means we need to embed the MSL kernels inside the
 module FlatBuffer. Right now we just encode the MSL source strings and compile
 them at Metal run-time. In the future this should be changed to allow encoding
 the library instead.

 ### Workgroup/threadgroup size

 When dispatching a compute kernel in Metal, we need to specify the number of
 thread groups in grid and the number of threads in thread group. Both are
 3-D vectors. IREE HAL, which follows Vulkan, calls them workgroup count and
 workgroup size, respectively.

 In Vulkan programming model, workgroup count and workgroup size are specified at
 different places: the former is given when invoking
 [`vkCmdDispatch()`][vulkan-cmd-dispatch], while the later is encoded in the
 dispatched SPIR-V code. This split does not match the Metal model, where we
 specify both in the API with `dispatchThreads:threadsPerThreadgroup:`.

 As said in [shader/kernel compilation](#shader-kernel-compilation), MSL kernels
 are cross compiled from SPIR-V code and then embeded in the module FlatBuffer.
 The module FlatBuffer provides us a way to convey the threadgroup/workgroup size
 information extracted from the SPIR-V code. We encode an additional 3-D vector
 for each entry point and use it as the threadgroup size when later dispatching
 the `MTLFunction` corresponding to the entry point.

 ### Resource descriptors

 A descriptor is an opaque handle pointing to a resource that is accessed in
 the compute kernel. IREE's HAL is inspired by the Vulkan API; it models several
 concepts related to GPU resource management explicitly:

 * [`hal::DescriptorSetLayout`][hal-descriptor-set-layout]: a schema for
   describing an array of descriptor bindings. Each descriptor binding specifies
   the resource type, access mode and other information.
 * [`hal::DescriptorSet`][hal-descriptor-set]: a concrete set of resources that
   gets bound to a compute pipeline in a batch. It must match the
   `DescriptorSetLayout` describing its layout. `DescriptorSet` can be thought as
   the "object" from the `DescriptorSetLayout` "class".
 * [`hal::ExecutableLayout`][hal-executable-layout]: a schema for describing all
   the resources accessed by a compute pipeline. It includes zero or more
   `DescriptorSetLayout`s and (optional) push constants.

 One can create `DescriptorSetLayout`, `DescriptorSet`, and `ExecutableLayout`
 objects beforehand to avoid incurring overhead during tight computing loops
 and also amortize costs by sharing these objects. However, this isn't totally
 matching Metal's paradigm.

 In the Metal framework, the closest concept to `DescriptorSet` would be [argument
 buffer][mtl-argument-buffer]. There is no direct correspondence to
 `DescriptorSetLayout` and `ExecutableLayout`. Rather, the layout is implicitly
 encoded in Metal shaders as MSL structs. The APIs for creating argument buffers
 do not encourage early creation without pipelines: one typically creates them
 for each `MTLFunction`. Besides, unlike Vulkan where different descriptor sets
 can have the same binding number, in Metal even if we have multiple argument
 buffers, the indices for resources are in the same namespace and are typically
 assigned sequentially. That means we need to remap `DescriptorSet`s with a set
 number greater than zero by applying an offset to each of its bindings.

 All of this means it's better to defer the creation of the argument buffer
 until the point of compute pipeline creation and dispatch. Therefore, although
 the Metal HAL driver provides the implementation for `DescriptorSet`
 (i.e., `hal::metal::MetalArgumentBuffer`), `DescriptorSetLayout` (i.e.,
 `hal::metal::MetalArgumentBufferLayout`), and `ExecutableLayout` (i.e.,
 `hal::metal::MetalPipelineArgumentBufferLayout`), they are just containers
 holding the information up until the [command buffer
 dispatch](#command-buffer-dispatch) time.

 With the above said, the overall idea is still to map one descriptor set to one
 argument buffer. It just means we need to condense and remap the bindings.

 ### Command buffer dispatch

 `MetalCommandBuffer::Dispatch()` performs the following steps with the current
 active `MTLComputeCommandEncoder`:

 1. Bind the `MTLComputePipelineState` for the current entry function queried
    from `MetalKernelLibrary`.
 1. For each bound descriptor set at set #`S`:
    1. Create a [`MTLArgumentEncoder`][mtl-argument-encoder] for encoding an
       associated argument `MTLBuffer`.
    1. For each bound resource buffer at binding #`B` in this descriptor set,
       encode it to the argument buffer index #`B` with
       `setBuffer::offset::atIndex:` and inform the `MTLComputeCommandEncoder`
       that the dispatch will use this resource with `useResource:usage:`.
   1. Set the argument `MTLBuffer` to buffer index #`S`.
 1. Dispatch with `dispatchThreadgroups:threadsPerThreadgroup:`.

 (TODO: condense and remap bindings)

 ## Memory Management

 ### Storage type

 Metal provides four [`MTLStorageMode`][mtl-storage-mode] options:

 *   `MTLStorageModeShared`: The resource is stored in system memory and is
     accessible to both the CPU and the GPU.
 *   `MTLStorageModeManaged`: The CPU and GPU may maintain separate copies of the
     resource, and any changes must be explicitly synchronized.
 *   `MTLStorageModePrivate`: The resource can be accessed only by the GPU.
 *   `MTLStorageMemoryless`: The resource’s contents can be accessed only by the
     GPU and only exist temporarily during a render pass.

 Among them, `MTLStorageModeManaged` is only available on macOS.

 IREE HAL defines serveral [`MemoryType`][hal-buffer]. They need to map to the
 above storage modes:

 *   If `kDeviceLocal` but not `kHostVisible`, `MTLStorageModePrivate` is chosen.
 *   If `kDeviceLocal` and `kHostVisible`:
     *   If macOS, `MTLStorageModeManaged` can be chosen.
     *   Otherwise, `MTLStorageModeShared` is chosen.
 *   If not `DeviceLocal` but `kDeviceVisible`, `MTLStorageModeShared` is chosen.
 *   If not `kDeviceLocal` and not `kDeviceVisible`, `MTLStorageModeShared` is
     chosen. (TODO: We should probably use host buffer here.)

 IREE HAL also allows to create buffers with `kHostCoherent` bit. This may still
 be backed by `MTLStorageModeManaged` `MTLBuffer`s in macOS. To respect the
 `kHostCoherent` protocol, the Metal HAL driver will perform necessary
 `InValidate`/`Flush` operations automatically under the hood.

 [macos-version-share]: https://gs.statcounter.com/macos-version-market-share/desktop/worldwide
 [ios-version-share]: https://developer.apple.com/support/app-store/
 [iree-hal]: https://github.com/google/iree/tree/main/iree/hal
 [iree-metal]: https://github.com/google/iree/tree/main/iree/hal/metal
 [iree-refptr]: https://github.com/google/iree/blob/main/iree/base/ref_ptr.h
 [hal-allocator]: https://github.com/google/iree/blob/main/iree/hal/allocator.h
 [hal-buffer]: https://github.com/google/iree/blob/main/iree/hal/buffer.h
 [hal-command-queue]: https://github.com/google/iree/blob/main/iree/hal/command_queue.h
 [hal-command-buffer]: https://github.com/google/iree/blob/main/iree/hal/command_buffer.h
 [hal-descriptor-set]: https://github.com/google/iree/blob/main/iree/hal/descriptor_set.h
 [hal-descriptor-set-layout]: https://github.com/google/iree/blob/main/iree/hal/descriptor_set_layout.h
 [hal-executable-layout]: https://github.com/google/iree/blob/main/iree/hal/executable_layout.h
 [hal-device]: https://github.com/google/iree/blob/main/iree/hal/device.h
 [hal-driver]: https://github.com/google/iree/blob/main/iree/hal/driver.h
 [hal-executable]: https://github.com/google/iree/blob/main/iree/hal/executable.h
 [hal-executable-cache]: https://github.com/google/iree/blob/main/iree/hal/executable_cache.h
 [hal-semaphore]: https://github.com/google/iree/blob/main/iree/hal/semaphore.h
 [metal-command-queue]: https://github.com/google/iree/blob/main/iree/hal/metal/metal_command_queue.h
 [metal-command-buffer]: https://github.com/google/iree/blob/main/iree/hal/metal/metal_command_buffer.h
 [metal-device]: https://github.com/google/iree/blob/main/iree/hal/metal/metal_device.h
 [metal-driver]: https://github.com/google/iree/blob/main/iree/hal/metal/metal_driver.h
 [metal-kernel-library]: https://github.com/google/iree/blob/main/iree/hal/metal/metal_kernel_library.h
 [metal-shared-event]: https://github.com/google/iree/blob/main/iree/hal/metal/metal_shared_event.h
 [metal-spirv-target]: https://github.com/google/iree/tree/hal-metal/iree/compiler/Dialect/HAL/Target/MetalSPIRV
 [mtl-argument-buffer]: https://developer.apple.com/documentation/metal/buffers/about_argument_buffers?language=objc
 [mtl-argument-encoder]: https://developer.apple.com/documentation/metal/mtlargumentencoder?language=objc
 [mtl-buffer]: https://developer.apple.com/documentation/metal/mtlbuffer?language=objc
 [mtl-command-buffer]: https://developer.apple.com/documentation/metal/mtlcommandbuffer?language=objc
 [mtl-command-encoder]: https://developer.apple.com/documentation/metal/mtlcommandencoder?language=objc
 [mtl-command-queue]: https://developer.apple.com/documentation/metal/mtlcommandqueue?language=objc
 [mtl-device]: https://developer.apple.com/documentation/metal/mtldevice?language=objc
 [mtl-function]: https://developer.apple.com/documentation/metal/mtlfunction?language=objc
 [mtl-heap]: https://developer.apple.com/documentation/metal/mtlheap?language=objc
 [mtl-library]: https://developer.apple.com/documentation/metal/mtllibrary?language=objc
 [mtl-shared-event]: https://developer.apple.com/documentation/metal/mtlsharedevent?language=objc
 [mtl-storage-mode]: https://developer.apple.com/documentation/metal/mtlstoragemode?language=objc
 [msl-spec]: https://developer.apple.com/metal/Metal-Shading-Language-Specification.pdf
 [msl-cl-library]: https://developer.apple.com/documentation/metal/libraries/building_a_library_with_metal_s_command-line_tools?language=objc
 [objc-arc]: https://en.wikipedia.org/wiki/Automatic_Reference_Counting
 [objcxx]: https://en.wikipedia.org/wiki/Objective-C#Objective-C++
 [flatbuffer]: https://google.github.io/flatbuffers/
 [mmap]: https://en.wikipedia.org/wiki/Mmap
 [moltenvk]: https://github.com/KhronosGroup/MoltenVK
 [spirv-codegen]: https://google.github.io/iree/design-docs/codegen-passes
 [spirv-cross]: https://github.com/KhronosGroup/SPIRV-Cross
 [vma]: https://github.com/GPUOpen-LibrariesAndSDKs/VulkanMemoryAllocator
 [vulkan-spirv-target]: https://github.com/google/iree/tree/hal-metal/iree/compiler/Dialect/HAL/Target/VulkanSPIRV
 [vulkan-cmd-dispatch]: https://www.khronos.org/registry/vulkan/specs/1.2-extensions/man/html/vkCmdDispatch.html
	# Metal HAL Driver

	This document lists technical details regarding the Metal HAL driver. Note that
	the Metal HAL driver is working in progress; this document is expected to be
	updated along the way.

	IREE provides a [Hardware Abstraction Layer (HAL)][iree-hal] as a common
	interface to different compute accelerators. IREE HAL's design draws inspiration
	from modern GPU architecture and APIs; so implementing a HAL driver using modern
	GPU APIs is generally straightforward. This applies to the Metal HAL driver.

	## Overall Design Choices

	### Metal Versions

	The Metal HAL driver expects Metal 2+. Metal 2 introduces useful features like
	argument buffer, performance shaders, and others, that can improve performance
	and make IREE HAL implementation simpler. Metal 2 was released late 2017 and are
	supported since macOS High Sierra and iOS 11. It is already dominant
	([macOS][macos-version-share], [iOS][ios-version-share]) right now.

	### Programming Languages and Libraries

	The Metal HAL driver lives under the [`iree/hal/metal/`][iree-metal] directory.
	Header (`.h`) and implementation (`.mm`) files are put adjacent to each other.

	The Metal framework only exposes Objective-C or Swift programming language APIs.
	Metal HAL driver needs to inherit from common HAL abstraction classes, which are
	C++. So we use [Objective-C++][objcxx] for implementing the Metal HAL driver.
	The headers try to stay with pure C/C++ syntax when possible, except for
	`#import <Metal/Metal.h>` and using Metal `id` types.

	### Object Lifetime Management

	Objective-C uses refcount for tracking object lifetime and managing memory. This
	is traditionally done manually by sending `retain` and `release` messages to
	Objective-C objects. Modern Objective-C allows developers to opt in to use
	[Automatic Reference Counting][objc-arc] to let the compiler to automatically
	deduce and insert `retain`/`release` where possible to simplify the burdern of
	manual management.

	We don't use ARC in the Metal HAL driver given that IREE has its own object
	[refcount][iree-refptr] and lifetime management mechanism. Metal HAL GPU objects
	are tracked with that to be consistent with others. Each Metal HAL GPU object
	`retain`s the underlying Metal `id<MTL*>` object on construction and `release`s
	on destruction.

	## GPU Objects

	Metal is one of the main modern GPU APIs that provide more explicit control over
	the hardware. The mapping between IREE HAL classes and Metal protocols are
	relatively straightforward:

	IREE HAL Class \| Metal Protocol
	:----------------------------------------: \| :------------:
	[`hal::Driver`][hal-driver] \| N/A
	[`hal::Device`][hal-device] \| [`MTLDevice`][mtl-device]
	[`hal::CommandQueue`][hal-command-queue] \| [`MTLCommandQueue`][mtl-command-queue]
	[`hal::CommandBuffer`][hal-command-buffer] \| [`MTLCommandBuffer`][mtl-command-buffer]
	[`hal::Semaphore`][hal-semaphore] \| [`MTLSharedEvent`][mtl-shared-event]
	[`hal::Allocator`][hal-allocator] \| N/A
	[`hal::Buffer`][hal-buffer] \| [`MTLBuffer`][mtl-buffer]
	[`hal::Executable`][hal-executable] \| [`MTLLibrary`][mtl-library]
	[`hal::ExecutableCache`][hal-executable-cache] \| N/A
	[`hal::DescriptorSetLayout`][hal-descriptor-set-layout] \| N/A
	[`hal::DescriptorSet`][hal-descriptor-set] \| N/A
	[`hal::ExecutableLayout`][hal-executable-layout] \| N/A

	In the following subsections, we go over each pair to provide more details.

	### Driver

	There is no native driver abstraction in Metal. IREE's Metal HAL driver still
	provides a [`hal::metal::MetalDriver`][metal-driver] subclass inheriting from
	common [`hal::Driver`][hal-driver] class. `hal::metal::MetalDriver` just
	`retain`s all available Metal devices in the system during its lifetime to
	provide similar interface as other HAL drivers.

	### Device

	[`hal::metal::MetalDevice`][metal-device] inherits [`hal::Device`][hal-device]
	to provide the interface to Metal GPU device by wrapping a `id<MTLDevice>`. Upon
	construction, `hal::metal::MetalDevice` creates and retains one queue for both
	dispatch and transfer during its lifetime.

	Metal requres command buffers to be created from a `MTLCommandQueue`. In IREE
	HAL, command buffers are directly created from the `hal::Device`.
	`hal::metal::MetalDevice` chooses the proper queue to create the command buffer
	under the hood.

	### Command queue

	IREE HAL command queue follows Vulkan for modelling submission. Specifically,
	`hal::CommandQueue::Submit()` takes a `SubmissionBatch`, which contains a list
	of waiting `hal::Semaphore`s, a list of command buffers, and a list signaling
	`hal::Semaphore`s. There is no direct mapping in Metal; so
	[`hal::metal::MetalCommandQueue`][metal-command-queue] performs the submission
	in three steps:

	1. Create a new `MTLCommandBuffer` to `encodeWaitForEvent:value` for all
	waiting `hal::Semaphore`s and commit this command buffer.
	1. Commit all command buffers in the `SubmissionBatch`.
	1. Create a new `MTLCommandBuffer` to `encodeSignalEvent:value` for all
	signaling `hal::Semaphore`s and commit this command buffer.

	There is also no direct `WaitIdle()` for
	[`MTLCommandQueue`][mtl-command-queue]s. `hal::metal::MetalCommandQueue`
	implements `WaitIdle()` by committing an empty `MTLCommandBuffer` and
	registering a complete handler for it to signal a semaphore to wake the current
	thread, which is put into sleep by waiting on the semaphore.

	### Command buffer

	In Metal, commands are recorded into a command buffer with three different kinds
	of [command encoders][mtl-command-encoder]: `MTLRenderCommandEncoder`,
	`MTLComputeCommandEncoder`, `MTLBlitCommandEncoder`, and
	`MTLParallelRenderCommandEncoder`. Each encoder has its own create/end call.
	There is no overall begin/end call for the whold command buffer. So even
	[`hal::metal::MetalCommandBuffer`][metal-command-buffer] implements an overall
	`Begin()`/`End()` call, under the hood it may create a new command encoder for a
	specific API call.

	### Timeline semaphore

	[`hal::Semaphore`][hal-semaphore] allows host->device, device->host, host->host,
	and device->device synchronization. It maps to Vulkan timeline semaphore. In
	Metal world, the counterpart would be [`MTLSharedEvent`][mtl-shared-event]. Most
	of the `hal::Semaphore` APIs are simple to implement in
	[`MetalSharedEvent`][metal-shared-event], with `Wait()` as an exception. A
	listener is registered on the `MTLSharedEvent` with
	`notifyListener:atValue:block:` to singal a semaphore to wake the current
	thread, which is put into sleep by waiting on the semaphore.

	### Allocator

	At the moment the Metal HAL driver just has a very simple
	[`hal::Allocator`][hal-allocator] implementation. It just wraps a `MTLDevice`
	and redirects all allocation requests to the `MTLDevice`. No page/pool/slab or
	whatever. This is only meant to get started. In the future we should have a
	better memory allocation library, probably by layering the
	[Vulkan Memory Allocator][vma] on top of [`MTLHeap`][mtl-heap].

	### Buffer

	IREE [`hal::Buffer`][hal-buffer] maps Metal `MTLBuffer`. See
	[Memory Management](#memory-management) for more details.

	### Executable

	IREE [`hal::Executable`][hal-executable] represents a GPU program archive with
	a driver-defined format. It maps naturally to Metal [`MTLLibrary`][mtl-library].
	An entry point in a `MTLLibrary` is a [`MTLFunction`][mtl-function]. We define
	[`hal::metal::MetalKernelLibrary`][metal-kernel-library] to wrap around a
	`MTLLibrary`, its `MTLFunction`s, and also `MTLComputePipelineState` objects
	constructed from `MTLFunction`s.

	### Executable cache

	IREE [`hal::ExecutableCache`][hal-executable-cache] is modelling a cache of
	preprared GPU executables for a particular device. At the moment the Metal
	HAL driver does not peforming any cache on GPU programs; it simply reads the
	program from the FlatBuffer and hands it over to Metal driver.

	### DescriptorSetLayout, DescriptorSet, ExecutableLayout

	See [Resource descriptors](#resource-descriptors) for more details.

	## Compute Pipeline

	### Shader/kernel compilation

	Metal has [Metal Shading Language (MSL)][msl-spec] for authoring graphics
	shaders and compute kernels. MSL source code can be directly consumed by the
	Metal framework at run-time; it can also be compiled first into an opaque
	library using [command-line tools][msl-cl-library] at build-time.

	IREE uses compilers to compile ML models expressed with high-level op semantics
	down to GPU native source format. This is also the case for the Metal HAL
	driver. Metal does not provide an open intermediate language; we reuse the
	[SPIR-V code generation pipeline][spirv-codegen] and then cross compile the
	generated SPIR-V into MSL source with [SPIRV-Cross][spirv-cross]. This is
	actually a fair common practice for targeting multiple GPU APIs in graphics
	programming world. For example, the Vulkan implmenation in macOS/iOs,
	[MoltenVK][moltenvk], is also doing the same for shaders/kernels. The path
	is actually quite robust, as demonstrated by various games on top of MoltenVK.

	Therefore, in IREE, we have a [`MetalSPIRVTargetBackend`][metal-spirv-target],
	which pulls in the normal MHLO to Linalg and Linalg to SPIR-V passes to form
	the compilation pipeline. The difference would be to provide a suitable
	SPIR-V target environment to drive the compilation, which one can derive from
	the Metal GPU families to target. (Not implemented yet; TODO for the future.)
	The serialization step differs from
	[`VulkanSPIRVTargetBackend`][vulkan-spirv-target] too: following the normal
	SPIR-V serialization step, we additionally need to invoke SPRIV-Cross to
	cross compile the generated SPIR-V into MSL, and then compile and/or serialize
	the MSL source/library.

	IREE uses [FlatBuffer][flatbuffer] to encode the whole workload module,
	including both GPU shader/kernel (called executable in IREE terminology) and
	CPU scheduling logic. The GPU executables are embedded as part of the module's
	FlatBuffer, which are [`mmap`][mmap]ped when IREE runs.

	For the Metal HAL driver, it means we need to embed the MSL kernels inside the
	module FlatBuffer. Right now we just encode the MSL source strings and compile
	them at Metal run-time. In the future this should be changed to allow encoding
	the library instead.

	### Workgroup/threadgroup size

	When dispatching a compute kernel in Metal, we need to specify the number of
	thread groups in grid and the number of threads in thread group. Both are
	3-D vectors. IREE HAL, which follows Vulkan, calls them workgroup count and
	workgroup size, respectively.

	In Vulkan programming model, workgroup count and workgroup size are specified at
	different places: the former is given when invoking
	[`vkCmdDispatch()`][vulkan-cmd-dispatch], while the later is encoded in the
	dispatched SPIR-V code. This split does not match the Metal model, where we
	specify both in the API with `dispatchThreads:threadsPerThreadgroup:`.

	As said in [shader/kernel compilation](#shader-kernel-compilation), MSL kernels
	are cross compiled from SPIR-V code and then embeded in the module FlatBuffer.
	The module FlatBuffer provides us a way to convey the threadgroup/workgroup size
	information extracted from the SPIR-V code. We encode an additional 3-D vector
	for each entry point and use it as the threadgroup size when later dispatching
	the `MTLFunction` corresponding to the entry point.

	### Resource descriptors

	A descriptor is an opaque handle pointing to a resource that is accessed in
	the compute kernel. IREE's HAL is inspired by the Vulkan API; it models several
	concepts related to GPU resource management explicitly:

	* [`hal::DescriptorSetLayout`][hal-descriptor-set-layout]: a schema for
	describing an array of descriptor bindings. Each descriptor binding specifies
	the resource type, access mode and other information.
	* [`hal::DescriptorSet`][hal-descriptor-set]: a concrete set of resources that
	gets bound to a compute pipeline in a batch. It must match the
	`DescriptorSetLayout` describing its layout. `DescriptorSet` can be thought as
	the "object" from the `DescriptorSetLayout` "class".
	* [`hal::ExecutableLayout`][hal-executable-layout]: a schema for describing all
	the resources accessed by a compute pipeline. It includes zero or more
	`DescriptorSetLayout`s and (optional) push constants.

	One can create `DescriptorSetLayout`, `DescriptorSet`, and `ExecutableLayout`
	objects beforehand to avoid incurring overhead during tight computing loops
	and also amortize costs by sharing these objects. However, this isn't totally
	matching Metal's paradigm.

	In the Metal framework, the closest concept to `DescriptorSet` would be [argument
	buffer][mtl-argument-buffer]. There is no direct correspondence to
	`DescriptorSetLayout` and `ExecutableLayout`. Rather, the layout is implicitly
	encoded in Metal shaders as MSL structs. The APIs for creating argument buffers
	do not encourage early creation without pipelines: one typically creates them
	for each `MTLFunction`. Besides, unlike Vulkan where different descriptor sets
	can have the same binding number, in Metal even if we have multiple argument
	buffers, the indices for resources are in the same namespace and are typically
	assigned sequentially. That means we need to remap `DescriptorSet`s with a set
	number greater than zero by applying an offset to each of its bindings.

	All of this means it's better to defer the creation of the argument buffer
	until the point of compute pipeline creation and dispatch. Therefore, although
	the Metal HAL driver provides the implementation for `DescriptorSet`
	(i.e., `hal::metal::MetalArgumentBuffer`), `DescriptorSetLayout` (i.e.,
	`hal::metal::MetalArgumentBufferLayout`), and `ExecutableLayout` (i.e.,
	`hal::metal::MetalPipelineArgumentBufferLayout`), they are just containers
	holding the information up until the [command buffer
	dispatch](#command-buffer-dispatch) time.

	With the above said, the overall idea is still to map one descriptor set to one
	argument buffer. It just means we need to condense and remap the bindings.

	### Command buffer dispatch

	`MetalCommandBuffer::Dispatch()` performs the following steps with the current
	active `MTLComputeCommandEncoder`:

	1. Bind the `MTLComputePipelineState` for the current entry function queried
	from `MetalKernelLibrary`.
	1. For each bound descriptor set at set #`S`:
	1. Create a [`MTLArgumentEncoder`][mtl-argument-encoder] for encoding an
	associated argument `MTLBuffer`.
	1. For each bound resource buffer at binding #`B` in this descriptor set,
	encode it to the argument buffer index #`B` with
	`setBuffer::offset::atIndex:` and inform the `MTLComputeCommandEncoder`
	that the dispatch will use this resource with `useResource:usage:`.
	1. Set the argument `MTLBuffer` to buffer index #`S`.
	1. Dispatch with `dispatchThreadgroups:threadsPerThreadgroup:`.

	(TODO: condense and remap bindings)

	## Memory Management

	### Storage type

	Metal provides four [`MTLStorageMode`][mtl-storage-mode] options:

	* `MTLStorageModeShared`: The resource is stored in system memory and is
	accessible to both the CPU and the GPU.
	* `MTLStorageModeManaged`: The CPU and GPU may maintain separate copies of the
	resource, and any changes must be explicitly synchronized.
	* `MTLStorageModePrivate`: The resource can be accessed only by the GPU.
	* `MTLStorageMemoryless`: The resource’s contents can be accessed only by the
	GPU and only exist temporarily during a render pass.

	Among them, `MTLStorageModeManaged` is only available on macOS.

	IREE HAL defines serveral [`MemoryType`][hal-buffer]. They need to map to the
	above storage modes:

	* If `kDeviceLocal` but not `kHostVisible`, `MTLStorageModePrivate` is chosen.
	* If `kDeviceLocal` and `kHostVisible`:
	* If macOS, `MTLStorageModeManaged` can be chosen.
	* Otherwise, `MTLStorageModeShared` is chosen.
	* If not `DeviceLocal` but `kDeviceVisible`, `MTLStorageModeShared` is chosen.
	* If not `kDeviceLocal` and not `kDeviceVisible`, `MTLStorageModeShared` is
	chosen. (TODO: We should probably use host buffer here.)

	IREE HAL also allows to create buffers with `kHostCoherent` bit. This may still
	be backed by `MTLStorageModeManaged` `MTLBuffer`s in macOS. To respect the
	`kHostCoherent` protocol, the Metal HAL driver will perform necessary
	`InValidate`/`Flush` operations automatically under the hood.

	[macos-version-share]: https://gs.statcounter.com/macos-version-market-share/desktop/worldwide
	[ios-version-share]: https://developer.apple.com/support/app-store/
	[iree-hal]: https://github.com/google/iree/tree/main/iree/hal
	[iree-metal]: https://github.com/google/iree/tree/main/iree/hal/metal
	[iree-refptr]: https://github.com/google/iree/blob/main/iree/base/ref_ptr.h
	[hal-allocator]: https://github.com/google/iree/blob/main/iree/hal/allocator.h
	[hal-buffer]: https://github.com/google/iree/blob/main/iree/hal/buffer.h
	[hal-command-queue]: https://github.com/google/iree/blob/main/iree/hal/command_queue.h
	[hal-command-buffer]: https://github.com/google/iree/blob/main/iree/hal/command_buffer.h
	[hal-descriptor-set]: https://github.com/google/iree/blob/main/iree/hal/descriptor_set.h
	[hal-descriptor-set-layout]: https://github.com/google/iree/blob/main/iree/hal/descriptor_set_layout.h
	[hal-executable-layout]: https://github.com/google/iree/blob/main/iree/hal/executable_layout.h
	[hal-device]: https://github.com/google/iree/blob/main/iree/hal/device.h
	[hal-driver]: https://github.com/google/iree/blob/main/iree/hal/driver.h
	[hal-executable]: https://github.com/google/iree/blob/main/iree/hal/executable.h
	[hal-executable-cache]: https://github.com/google/iree/blob/main/iree/hal/executable_cache.h
	[hal-semaphore]: https://github.com/google/iree/blob/main/iree/hal/semaphore.h
	[metal-command-queue]: https://github.com/google/iree/blob/main/iree/hal/metal/metal_command_queue.h
	[metal-command-buffer]: https://github.com/google/iree/blob/main/iree/hal/metal/metal_command_buffer.h
	[metal-device]: https://github.com/google/iree/blob/main/iree/hal/metal/metal_device.h
	[metal-driver]: https://github.com/google/iree/blob/main/iree/hal/metal/metal_driver.h
	[metal-kernel-library]: https://github.com/google/iree/blob/main/iree/hal/metal/metal_kernel_library.h
	[metal-shared-event]: https://github.com/google/iree/blob/main/iree/hal/metal/metal_shared_event.h
	[metal-spirv-target]: https://github.com/google/iree/tree/hal-metal/iree/compiler/Dialect/HAL/Target/MetalSPIRV
	[mtl-argument-buffer]: https://developer.apple.com/documentation/metal/buffers/about_argument_buffers?language=objc
	[mtl-argument-encoder]: https://developer.apple.com/documentation/metal/mtlargumentencoder?language=objc
	[mtl-buffer]: https://developer.apple.com/documentation/metal/mtlbuffer?language=objc
	[mtl-command-buffer]: https://developer.apple.com/documentation/metal/mtlcommandbuffer?language=objc
	[mtl-command-encoder]: https://developer.apple.com/documentation/metal/mtlcommandencoder?language=objc
	[mtl-command-queue]: https://developer.apple.com/documentation/metal/mtlcommandqueue?language=objc
	[mtl-device]: https://developer.apple.com/documentation/metal/mtldevice?language=objc
	[mtl-function]: https://developer.apple.com/documentation/metal/mtlfunction?language=objc
	[mtl-heap]: https://developer.apple.com/documentation/metal/mtlheap?language=objc
	[mtl-library]: https://developer.apple.com/documentation/metal/mtllibrary?language=objc
	[mtl-shared-event]: https://developer.apple.com/documentation/metal/mtlsharedevent?language=objc
	[mtl-storage-mode]: https://developer.apple.com/documentation/metal/mtlstoragemode?language=objc
	[msl-spec]: https://developer.apple.com/metal/Metal-Shading-Language-Specification.pdf
	[msl-cl-library]: https://developer.apple.com/documentation/metal/libraries/building_a_library_with_metal_s_command-line_tools?language=objc
	[objc-arc]: https://en.wikipedia.org/wiki/Automatic_Reference_Counting
	[objcxx]: https://en.wikipedia.org/wiki/Objective-C#Objective-C++
	[flatbuffer]: https://google.github.io/flatbuffers/
	[mmap]: https://en.wikipedia.org/wiki/Mmap
	[moltenvk]: https://github.com/KhronosGroup/MoltenVK
	[spirv-codegen]: https://google.github.io/iree/design-docs/codegen-passes
	[spirv-cross]: https://github.com/KhronosGroup/SPIRV-Cross
	[vma]: https://github.com/GPUOpen-LibrariesAndSDKs/VulkanMemoryAllocator
	[vulkan-spirv-target]: https://github.com/google/iree/tree/hal-metal/iree/compiler/Dialect/HAL/Target/VulkanSPIRV
	[vulkan-cmd-dispatch]: https://www.khronos.org/registry/vulkan/specs/1.2-extensions/man/html/vkCmdDispatch.html