[metal] Add technical details README file
diff --git a/experimental/metal/README.md b/experimental/metal/README.md
new file mode 100644
index 0000000..0ca61ac
--- /dev/null
+++ b/experimental/metal/README.md
@@ -0,0 +1,351 @@
+# IREE Metal HAL Driver
+
+This document lists technical details regarding the Metal HAL driver. Note that
+the Metal HAL driver is a work in progress and experimental; this document is
+expected to be updated along the way.
+
+IREE provides a [Hardware Abstraction Layer (HAL)][iree-hal] as a common
+interface to different compute accelerators. IREE HAL's design draws inspiration
+from modern GPU architecture and APIs; so implementing a HAL driver using modern
+GPU APIs is generally straightforward. This applies to the Metal HAL driver.
+
+## Overall Design Choices
+
+### Metal Versions
+
+Currently the Metal HAL driver expects Metal 3 capabilities. Metal 3 was
+released late 2022 and are supported since macOS Ventura and iOS 16.
+It [covers][metal-feature-set] recent Apple silicon GPUs including A13+ and M1+
+chips and others.
+
+In the future, we expect to increase the support to cover Metal 2 capabilities.
+Metal 2 introduces useful features like argument buffer and others that are
+necessary for performance and make IREE HAL implementation simpler. Metal 2 was
+released late 2017 and are supported since macOS High Sierra and iOS 11. It is
+already dominant ([macOS][macos-version-share], [iOS][ios-version-share]).
+
+### Programming Languages and Libraries
+
+The Metal HAL driver lives under the [`experimental/metal/`][iree-metal]
+directory right now. Once more stable, it will be graduated into the
+`runtime/src/iree/hal/drivers/` directory like other HAL drivers.
+
+The Metal framework only exposes Objective-C or Swift programming language APIs.
+Metal HAL driver needs to inherit from common HAL abstraction definitions, which
+are in C. To minimize dependency and binary size and increase performance, we
+use Metal's Objective-C API for implementing the Metal HAL driver.
+Header (`.h`) and implementation (`.m`) files are put adjacent to each other.
+
+### Object Lifetime Management
+
+Objective-C uses refcount for tracking object lifetime and managing memory. This
+is traditionally done manually by sending `retain` and `release` messages to
+Objective-C objects. Modern Objective-C allows developers to opt in to use
+[Automatic Reference Counting][objc-arc] to let the compiler to automatically
+deduce and insert `retain`/`release` where possible to simplify the burdern of
+manual management.
+
+We don't use ARC in the Metal HAL driver given that IREE has its own object
+refcount and lifetime management mechanism. Metal HAL GPU objects are tracked
+with that to be consistent with others. Each Metal HAL GPU object `retain`s
+the underlying Metal `id<MTL*>` object on construction and `release`s on
+destruction.
+
+## GPU Objects
+
+Metal is one of the main modern GPU APIs that provide more explicit control over
+the hardware. The mapping between IREE HAL classes and Metal protocols are
+relatively straightforward:
+
+IREE HAL Class | Metal Protocol
+:-------------------------------------------------------------: | :--------------------------------------:
+[`iree_hal_driver_t`][hal-driver] | N/A
+[`iree_hal_device_t`][hal-device] | [`MTLDevice`][mtl-device]
+[`iree_hal_command_buffer_t`][hal-command-buffer] | [`MTLCommandBuffer`][mtl-command-buffer]
+[`iree_hal_semaphore_t`][hal-semaphore] | [`MTLSharedEvent`][mtl-shared-event]
+[`iree_hal_allocator_t`][hal-allocator] | N/A
+[`iree_hal_buffer_t`][hal-buffer] | [`MTLBuffer`][mtl-buffer]
+[`iree_hal_executable_t`][hal-executable] | [`MTLLibrary`][mtl-library]
+[`iree_hal_executable_cache_t`][hal-executable-cache] | N/A
+[`iree_hal_descriptor_set_layout_t`][hal-descriptor-set-layout] | N/A
+[`iree_hal_pipeline_layout_t`][hal-pipeline-layout] | N/A
+
+In the following subsections, we go over each pair to provide more details.
+
+### Driver
+
+There is no native driver abstraction in Metal. IREE's Metal HAL driver still
+provides a [`iree_hal_metal_driver_t`][metal-driver] struct to implement the
+common [`iree_hal_driver_t`][hal-driver] struct. `iree_hal_metal_driver_t` just
+`retain`s all available Metal devices in the system during its lifetime, to
+guarantee that we have the same `id<MTLDevice>` for device querying and
+creation.
+
+### Device
+
+[`iree_hal_metal_device_t`][metal-device] implements [`iree_hal_device_t`][hal-device]
+to provide the interface to Metal GPU device by wrapping a `id<MTLDevice>`. Upon
+construction, `iree_hal_metal_device_t` creates and retains one queue for both
+dispatch and transfer during its lifetime. In the future we expect to spport
+multiple queues for better concurrency.
+
+#### Command buffer submission
+
+In IREE HAL, command buffers are directly created from the `iree_hal_device_t`.
+It's also directly submitted there via `iree_hal_device_queue_execute()`.
+Each execution takes a batch of command buffers, together with a list of waiting
+`iree_hal_semaphore_t`s and a list signaling `iree_hal_semaphore_t`s.
+There is no direct mapping of such structure in Metal; so we performs the submission
+in three steps:
+
+1. Create a new `MTLCommandBuffer` to `encodeWaitForEvent:value` for all
+ waiting `iree_hal_semaphore_t`s and commit this command buffer.
+1. Commit all command buffers in the submmision batch.
+1. Create a new `MTLCommandBuffer` to `encodeSignalEvent:value` for all
+ signaling `iree_hal_semaphore_t`s and commit this command buffer.
+
+Such submission enables asynchronous execution of the workload on the GPU.
+
+#### Queue-ordered allocation
+
+Queue-ordered asynchronous allocations via `iree_hal_device_queue_alloc` is not fully
+supported yet; it just translates to blocking wait and allocation.
+
+#### Collectives
+
+Collectives suppport is not yet implemented.
+
+#### Profiling
+
+The Metal HAL driver supports profiling via `MTLCaptureManager`. We can either
+capture to a trace file or XCode.
+
+To perform profiling in the command line, attach `--device_profiling_mode=queue
+--device_profiling_file=/path/to/metal.gputrace` to IREE binaries.
+
+### Command buffer
+
+Command buffers are where IREE HAL and Metal API have a major difference.
+
+IREE HAL command buffers follow the flat Vulkan recording model, where all memory
+or dispatch commands are recorded into a command buffer directly.
+Unlike Vulkan, Metal adopts a multi-level command recording model--memory/dispatch
+commands are not directly recorded into a command buffer; rather, they must go
+through the additional level of blit/compute encoders.
+Implementing IREE's HAL using Metal would require switching encoders for
+interleaved memory and dispatch commands.
+Additionally, certain IREE HAL API features do not have direct mapping in Metal
+APIs, e.g., various forms of IREE HAL execution/memory barriers. Translating
+them would require looking at both previous and next commands to decide the
+proper mapping.
+
+Due to these reasons, it's beneficial to have a complete view of the full
+command buffer and extra flexibility during recording, in order to fixup past
+commands, or inspect future commands.
+
+Therefore, to implement IREE HAL command buffers using Metal, we perform two
+steps using a linked list of command segments:
+First we create segments to keep track of all IREE HAL commands and the
+associated data. And then, when finalizing the command buffer, we iterate
+through all the segments and record their contents into a proper
+`MTLCommandBuffer`. A linked list gives us the flexibility to organize
+command sequence in low overhead; and a deferred recording gives us the
+complete picture of the command buffer when really started recording.
+
+The Metal HAL driver right now only support one-shot command buffers, by mapping
+to `MTLCommandBuffer`s.
+
+#### Fill/copy/update buffer
+
+Metal APIs for fill and copy buffers have alignment restrictions on the offset
+and length. `iree_hal_command_buffer_{fill|copy|update}_buffer()` is more
+flexible regarding that. So for cases aren't directly supported by Metal APIs,
+we use [polyfill compute kernels][metal-builtin-kernels] to perform the memory
+operation using GPU threads.
+
+### Semaphore
+
+[`iree_hal_semaphore_t`][hal-semaphore] allows host->device, device->host, host->host,
+and device->device synchronization. It maps to Vulkan timeline semaphore. In
+Metal world, the counterpart would be [`MTLSharedEvent`][mtl-shared-event]. Most
+of the `iree_hal_semaphore_t` APIs are simple to implement in
+[`MetalSharedEvent`][metal-shared-event], with `iree_hal_semaphore_wait()` as an
+exception. A listener is registered on the `MTLSharedEvent` with
+`notifyListener:atValue:block:` to singal a semaphore to wake the current
+thread, which is put into sleep by waiting on the semaphore.
+
+### Allocator
+
+At the moment the Metal HAL driver just has a very simple
+[`iree_hal_allocator_t`][hal-allocator] implementation. It just wraps a `MTLDevice`
+and redirects all allocation requests to the `MTLDevice`. No page/pool/slab or
+whatever. This is meant to be used together with common allocator layers like the
+caching allocator.
+
+### Buffer
+
+IREE [`iree_hal_buffer_t`][hal-buffer] maps Metal `MTLBuffer`. See
+[Memory Management](#memory-management) for more details.
+
+### Executable
+
+IREE [`iree_hal_executable_t`][hal-executable] represents a GPU program archive with
+a driver-defined format. It maps naturally to Metal [`MTLLibrary`][mtl-library].
+An entry point in a `MTLLibrary` is a [`MTLFunction`][mtl-function]. We define
+[`iree_hal_metal_kernel_params_t`][metal-kernel-library] to wrap around a
+`MTLLibrary`, its `MTLFunction`s, and also `MTLComputePipelineState` objects
+constructed from `MTLFunction`s.
+
+### Executable cache
+
+IREE [`iree_hal_executable_cache_t`][hal-executable-cache] is modeling a cache of
+preprared GPU executables for a particular device. At the moment the Metal
+HAL driver does not peforming any caching on GPU programs; it simply reads the
+program from the FlatBuffer and hands it over to Metal driver.
+
+### Descriptor set / pipeline layout
+
+See [Resource descriptors](#resource-descriptors) for more details.
+
+## Compute Pipeline
+
+### Shader/kernel compilation
+
+Metal has [Metal Shading Language (MSL)][msl-spec] for authoring graphics
+shaders and compute kernels. MSL source code can be directly consumed by the
+Metal framework at run-time; it can also be compiled first into an opaque
+library using [command-line tools][msl-cl-library] at build-time.
+
+IREE uses compilers to compile ML models expressed with high-level op semantics
+down to GPU native source format. This is also the case for the Metal HAL
+driver. Metal does not provide an open intermediate language; we reuse the
+SPIR-V code generation pipeline and then cross compile the generated SPIR-V into
+MSL source with [SPIRV-Cross][spirv-cross]. This is actually a fair common
+practice for targeting multiple GPU APIs in graphics programming world. For
+example, the Vulkan implmenation in macOS/iOS, [MoltenVK][moltenvk], is also
+doing the same for shaders/kernels. The path is quite robust, as demonstrated
+by various games on top of MoltenVK.
+
+Therefore, in IREE, we have a [`MetalSPIRVTargetBackend`][metal-spirv-target],
+which pulls in the common SPIR-V passes to form the compilation pipeline.
+The difference would be to provide a suitable SPIR-V target environment to drive
+the compilation, which one can derive from the Metal GPU families to target.
+The serialization step differs from
+[`VulkanSPIRVTargetBackend`][vulkan-spirv-target] too: following the normal
+SPIR-V serialization step, we additionally need to invoke SPRIV-Cross to
+cross compile the generated SPIR-V into MSL, and then compile and/or serialize
+the MSL source/library.
+
+IREE uses [FlatBuffer][flatbuffer] to encode the whole workload module,
+including both GPU shader/kernel (called executable in IREE terminology) and
+CPU scheduling logic. The GPU executables are embedded as part of the module's
+FlatBuffer, which are [`mmap`][mmap]ped when IREE runs.
+
+For the Metal HAL driver, it means we need to embed the MSL kernels inside the
+module FlatBuffer. Right now we can either encode the MSL source strings and
+compile them at Metal run-time, or directly encoding the library instead.
+
+### Workgroup/threadgroup size
+
+When dispatching a compute kernel in Metal, we need to specify the number of
+thread groups in grid and the number of threads in thread group. Both are
+3-D vectors. IREE HAL, which follows Vulkan, calls them workgroup count and
+workgroup size, respectively.
+
+In Vulkan programming model, workgroup count and workgroup size are specified at
+different places: the former is given when invoking
+[`vkCmdDispatch()`][vulkan-cmd-dispatch], while the later is encoded in the
+dispatched SPIR-V code. This split does not match the Metal model, where we
+specify both in the API with `dispatchThreads:threadsPerThreadgroup:`.
+
+As said in [shader/kernel compilation](#shader-kernel-compilation), MSL kernels
+are cross compiled from SPIR-V code and then embeded in the module FlatBuffer.
+The module FlatBuffer provides us a way to convey the threadgroup/workgroup size
+information extracted from the SPIR-V code. We encode an additional 3-D vector
+for each entry point and use it as the threadgroup size when later dispatching
+the `MTLFunction` corresponding to the entry point.
+
+### Resource descriptors
+
+A descriptor is an opaque handle pointing to a resource that is accessed in
+the compute kernel. IREE's HAL models several concepts related to GPU resource
+management explicitly:
+
+* [`iree_hal_descriptor_set_layout_t`][hal-descriptor-set-layout]: a schema for
+ describing an array of descriptor bindings. Each descriptor binding specifies
+ the resource type, access mode and other information.
+* [`iree_hal_pipeline_layout_t`][hal-pipeline-layout]: a schema for describing all
+ the resources accessed by a compute pipeline. It includes zero or more
+ `DescriptorSetLayout`s and (optional) push constants.
+
+However, this isn't totally matching Metal's paradigm.
+In the Metal framework, the closest concept to descriptor sets would be [argument
+buffer][mtl-argument-buffer]. There is no direct correspondence to
+descriptor set layout and pipeline layout. Rather, the layout is implicitly
+encoded in Metal shaders as MSL structs. The APIs for creating argument buffers
+do not encourage early creation without pipelines: one typically creates them
+for each `MTLFunction`.
+
+All of this means it's better to defer the creation of the argument buffer
+until the point of compute pipeline creation and dispatch. Therefore, the Metal
+HAL driver's `iree_hal_metal_descriptor_set_layout_t` and
+`iree_hal_metal_pipeline_layout_t` are just containers holding the information
+up for recording [command buffer dispatch](#command-buffer-dispatch).
+
+### Command buffer dispatch
+
+Metal HAL driver command buffer dispatch recording performs the following steps
+with the current active `MTLComputeCommandEncoder`:
+
+1. Bind the `MTLComputePipelineState` for the current entry function.
+1. Encode the push constants using `setBytes:length:atIndex`.
+1. For each bound descriptor set at set #`S`:
+ 1. Create a [`MTLArgumentEncoder`][mtl-argument-encoder] for encoding an
+ associated argument `MTLBuffer`.
+ 1. For each bound resource buffer at binding #`B` in this descriptor set,
+ encode it to the argument buffer index #`B` with
+ `setBuffer::offset::atIndex:` and inform the `MTLComputeCommandEncoder`
+ that the dispatch will use this resource with `useResource:usage:`.
+ 1. Set the argument `MTLBuffer` to buffer index #`S`.
+1. Dispatch with `dispatchThreadgroups:threadsPerThreadgroup:`.
+
+[metal-feature-set]: https://developer.apple.com/metal/Metal-Feature-Set-Tables.pdf
+[macos-version-share]: https://gs.statcounter.com/macos-version-market-share/desktop/worldwide
+[ios-version-share]: https://developer.apple.com/support/app-store/
+[iree-hal]: https://github.com/openxla/iree/tree/main/runtime/src/iree/hal
+[iree-metal]: https://github.com/openxla/iree/tree/main/experimental/metal
+[hal-allocator]: https://github.com/openxla/iree/blob/main/runtime/src/iree/hal/allocator.h
+[hal-buffer]: https://github.com/openxla/iree/blob/main/runtime/src/iree/hal/buffer.h
+[hal-command-buffer]: https://github.com/openxla/iree/blob/main/runtime/src/iree/hal/command_buffer.h
+[hal-descriptor-set-layout]: https://github.com/openxla/iree/blob/main/runtime/src/iree/hal/pipeline_layout.h
+[hal-pipeline-layout]: https://github.com/openxla/iree/blob/main/runtime/src/iree/hal/pipeline_layout.h
+[hal-device]: https://github.com/openxla/iree/blob/main/runtime/src/iree/hal/device.h
+[hal-driver]: https://github.com/openxla/iree/blob/main/runtime/src/iree/hal/driver.h
+[hal-executable]: https://github.com/openxla/iree/blob/main/runtime/src/iree/hal/executable.h
+[hal-executable-cache]: https://github.com/openxla/iree/blob/main/runtime/src/iree/hal/executable_cache.h
+[hal-semaphore]: https://github.com/openxla/iree/blob/main/runtime/src/iree/hal/semaphore.h
+[metal-device]: https://github.com/openxla/iree/tree/main/experimental/metal/metal_device.h
+[metal-driver]: https://github.com/openxla/iree/tree/main/experimental/metal/metal_driver.h
+[metal-kernel-library]: https://github.com/openxla/iree/tree/main/experimental/metal/metal_kernel_library.h
+[metal-shared-event]: https://github.com/openxla/iree/tree/main/experimental/metal/metal_shared_event.h
+[metal-spirv-target]: https://github.com/openxla/iree/tree/main/compiler/src/iree/compiler/Dialect/HAL/Target/MetalSPIRV
+[metal-builtin-kernels]: https://github.com/openxla/iree/tree/main/experimental/metal/builtin/
+[mtl-argument-buffer]: https://developer.apple.com/documentation/metal/buffers/about_argument_buffers?language=objc
+[mtl-argument-encoder]: https://developer.apple.com/documentation/metal/mtlargumentencoder?language=objc
+[mtl-buffer]: https://developer.apple.com/documentation/metal/mtlbuffer?language=objc
+[mtl-command-buffer]: https://developer.apple.com/documentation/metal/mtlcommandbuffer?language=objc
+[mtl-command-encoder]: https://developer.apple.com/documentation/metal/mtlcommandencoder?language=objc
+[mtl-device]: https://developer.apple.com/documentation/metal/mtldevice?language=objc
+[mtl-function]: https://developer.apple.com/documentation/metal/mtlfunction?language=objc
+[mtl-library]: https://developer.apple.com/documentation/metal/mtllibrary?language=objc
+[mtl-shared-event]: https://developer.apple.com/documentation/metal/mtlsharedevent?language=objc
+[mtl-storage-mode]: https://developer.apple.com/documentation/metal/mtlstoragemode?language=objc
+[msl-spec]: https://developer.apple.com/metal/Metal-Shading-Language-Specification.pdf
+[msl-cl-library]: https://developer.apple.com/documentation/metal/libraries/building_a_library_with_metal_s_command-line_tools?language=objc
+[objc-arc]: https://en.wikipedia.org/wiki/Automatic_Reference_Counting
+[flatbuffer]: https://google.github.io/flatbuffers/
+[mmap]: https://en.wikipedia.org/wiki/Mmap
+[moltenvk]: https://github.com/KhronosGroup/MoltenVK
+[spirv-cross]: https://github.com/KhronosGroup/SPIRV-Cross
+[vulkan-spirv-target]: https://github.com/openxla/iree/tree/main/compiler/src/iree/compiler/Dialect/HAL/Target/VulkanSPIRV
+[vulkan-cmd-dispatch]: https://registry.khronos.org/vulkan/specs/1.3-extensions/man/html/vkCmdDispatch.html