[HAL/AMDGPU] Initial host-side AMDGPU HAL implementation (#24298) This PR lands IREE's native AMDGPU HAL driver: a direct HSA/ROCR backend that owns queue submission, packet construction, memory placement, command-buffer recording/replay, profiling, counters, device-library selection, and future scheduling policy inside IREE instead of routing normal execution through HIP. The cost is ~70kLoC but that gives IREE direct ownership of AMD GPU execution instead of routing through HIP streams and HIP graphs. The critical unlocks happen because IREE already knows the real program structure that HIP tries to guess at: explicit semaphore frontiers, queue affinity, memory types, binding tables, reusable command-buffer blocks, executable metadata, profiling scopes, and replay captures. The native driver turns that structure directly into AQL packets and queue-local completion state, which lets us do things HIP cannot naturally express: low-overhead dynamic command buffers, heterogeneous HAL device groups, future remote execution, device-side fixup/scheduling, and profiling/replay from the same command model. The early numbers show the shape of the win: ~12.5x lower submit overhead for cross-queue dependency edges, ~22x lower dynamic graph construction tax versus HIP graphs on a 512-dispatch chain, and ~20x lower steady-state host CPU time on queue-heavy submission paths. This is v0, but it is already the architecture we want to optimize: fewer compatibility layers, more explicit contracts, and a path where AMD GPUs participate in the full HAL ecosystem instead of living behind a HIP-shaped abstraction boundary. This is intentionally a large PR. The driver is not a thin shim around one runtime call; it is the runtime boundary for AMD GPUs. The branch contains the native driver plus the AMDGPU-specific hardening that made the final shape reviewable: command-buffer replay cleanup, queue/pool integration, profiling producers, target-library selection, device capability handling, tests, and developer documentation. The headlines: - IREE now has a native AMDGPU execution path based on HSA queues and AQL packets. - The driver can run normal HAL dispatches and reusable HAL command buffers without HIP streams or HIP graphs. - The command-buffer representation is designed as a durable block program that can be replayed by host processors now and device-side processors later. - The profiling path can expose queue, dispatch, executable, counter, device metric, and ATT/SQTT trace data through the HAL profile tooling. - The hot paths are structured so static production replay does not pay for optional profiling, trace, upload, or future device-fixup machinery. ## Why HIP is a useful compatibility layer and comparison point, but it is not the right abstraction boundary for the runtime work IREE wants to do. IREE needs to be able to control: - how HAL queue operations become AQL/PM4 packets; - where kernargs, command-buffer templates, transient buffers, and staging records live; - how semaphore dependencies map to queue frontiers and completion epochs; - how reusable command buffers are recorded, validated, replayed, and profiled; - where host work ends and queue-ordered device work begins; - how to capture profiling data without turning the production queue into a debug path; and - how to evolve toward device-side command-buffer scheduling and fixup. HIP graphs are especially awkward for IREE's dynamic command-buffer use case. They can be expensive to construct, hard to introspect, and difficult to shape around IREE's own async allocation and replay contracts. The native driver gives IREE a graph-like reusable command stream while keeping the command stream in IREE's own ABI. ## Design Principles The implementation follows a few constraints that are worth making explicit for review. **Own the production hot path.** Queue submission, command-buffer replay, kernarg formation, packet publication, and completion are explicit IREE code. Optional features are allowed only when they do not tax the default path. For example, profiling, ATT/SQTT capture, queue-control upload rings, and future device-side fixup all have opt-in storage and control flow. **Record facts once.** Command buffers are allowed to do work while recording and finalizing so replay can be simple. Binding counts, patch counts, packet counts, barrier requirements, prepublication eligibility, rodata references, and block terminators are recorded in the command-buffer program instead of rediscovered by scanning command records during submission. **Keep host and device processors pointed at the same ABI.** The AMDGPU command buffer is a block program, not a host-only replay script. The current host AQL block processor consumes that program; future device-side processors should consume the same block format for command-buffer continuations, scheduling, and kernarg fixup. **Separate invariant clusters.** The driver is split by subsystem rather than growing one giant queue file. There are distinct files for queue submission, queue waits, command-buffer block processing, command-buffer replay, profiling augmentation, staging/file paths, memory operations, executable handling, topology, device capabilities, and utility rings. **Fail loud on unsupported strategies.** Unsupported memory paths, command forms, profiling modes, and device capabilities should fail with a concrete status instead of silently falling back through the wrong mechanism. **Make platform/device variation explicit.** The code names the places where HSA memory-pool access, HDP publication, topology links, target IDs, device-library coverage, Linux KFD metrics, and optional ROCm profiling libraries affect behavior. ## Architecture Overview ### Driver And Device Model The driver dynamically loads HSA/ROCR, discovers CPU and GPU agents, and creates logical HAL devices over one or more physical AMDGPU agents. The main object split is: - driver: HSA discovery, option parsing, and logical-device creation; - logical device: HAL-facing device object and shared runtime state; - physical device: one HSA GPU agent with queues, memory pools, executable cache, device-library selection, profiling state, device metrics, and topology facts; - host queue: HSA queue plus IREE's AQL, kernarg, notification, completion, and reclaim state; and - virtual queue: the internal interface used so command-buffer, direct dispatch, memory, file, and profiling paths route through one queue contract. Device selection supports all visible AMDGPU agents by default, single-device selection, UUID-based selection, ordinal selection, and multi-device logical devices. The topology code records HSA memory-pool access, link class, NUMA distance, coherency, atomics, and interop capability facts so future placement and transfer strategies can reason about PCIe, xGMI, and other link types without hard-coded assumptions. ### Executables And Device Libraries AMDGPU executables are loaded from HSACO/code-object data and matched against the selected physical device. The runtime also embeds AMDGPU device libraries used for builtin operations such as fill/copy helpers, timestamp helpers, and dispatch-side utilities. The device-library target map is single-sourced from generated target metadata. Builds can select exact targets, LLVM generic targets, TheRock-style generic families, or product bundles. This keeps package size and device coverage under explicit build-system control while letting the runtime fail clearly when a required target was not embedded. ### Memory, Pools, And Publication The driver integrates with the HAL pool substrate and AMDGPU HSA memory pools instead of treating all buffers as generic allocations. The implementation distinguishes: - device-local memory; - CPU-visible fine-grained host memory; - CPU-visible coarse-grained device memory; - queue-owned kernarg memory; - optional queue-control upload memory; - transient allocation pools; - file/staging storage; and - host-side block/slab pools used by queue and profiling data structures. HDP publication is represented as a selected capability of the memory path, not as an ad hoc flush sprinkled through dispatch code. If CPU writes to memory that the GPU will consume require publication on a device, the queue-owned memory path knows how to publish those writes before the relevant packet headers become visible. The default queue-control upload ring is disabled until a production consumer opts in. That keeps the future device-side fixup path available without charging every queue an unused HSA allocation. ### Queue Submission And Completion Host queues own an HSA AQL queue and maintain: - an AQL ring view for packet reservation/publicat > <img width="403" height="222" alt="image" src="https://github.com/user-attachments/assets/11a9ef26-dc32-427c-a01e-7969fd24ec2d" /> (kind of, consider this the reference implementation) ion; - a kernarg ring for queue-owned dispatch arguments; - an epoch/notification ring mapping GPU completions to HAL semaphore signals; - a queue frontier snapshot for dependency tracking; - one completion thread that drains queue epochs and publishes user-visible semaphore completions; - optional PM4 IB slots indexed by AQL packet id on hardware that supports AQL PM4 packets; and - optional profiling/counter/trace state. Submission is serialized per queue, but independent queues do not synchronize with each other. The queue submission path reserves AQL packets, kernargs, and notification entries before publishing headers. If admission fails, reclaim is routed through the same notification/reclaim machinery instead of inventing a parallel cleanup path. HAL ordering is represented by semaphore/frontier dependencies, not by assuming FIFO execution. The queue frontier machinery lets the driver elide redundant waits when the dependency is already known to be satisfied, while preserving correctness when the frontier overflows or cannot prove elision. ### Direct Dispatch And Builtin Operations Direct `queue_dispatch` resolves executable metadata, validates dispatch shape, forms kernargs, retains the executable/buffer resources required by the submission, and emits AQL packets through the common queue submission path. Queue buffer operations are implemented through explicit strategies. Builtin device kernels cover fill/copy/update paths and are selected based on alignment, size, and available device-library kernels. The code leaves room for SDMA, PM4, P2P, and future direct-storage strategies without conflating those with the current kernel-dispatch path. ### Command Buffers The AMDGPU command-buffer ABI is the center of the rewrite. Recorded command buffers are stored as a program of blocks. Each block has a fixed header with command counts, binding-source counts, packet/kernarg worst case, rodata extent, dispatch/profile-marker counts, barrier metadata, and a terminator. Commands include barriers, dispatches, fills, copies, updates, profile markers, branches, conditional branches, and returns. The important split is: - the command buffer owns the durable block program and rodata; - the AQL block processor consumes one block and writes reserved packet/kernarg storage; - host queue replay is the container/orchestration layer that initializes a processor, invokes blocks, handles continuations, and integrates with semaphores/reclaim; and - profiling processors are separate variants that augment replay only when profiling was explicitly requested. This shape is deliberate. A block processor is close to a small interpreter over the block ABI. It is suitable for dedicated tests today and for device-side processor variants later. Host queue code should not need to know how every command body becomes AQL packets. Replay hot paths are specialized: - static reusable dispatches can use prepublished kernargs; - all-dynamic dispatches use a direct binding-pointer scatter path; - mixed static/dynamic reusable dispatches use immutable templates plus recorded dynamic patch sources; - indirect dispatch parameters stay on the generic path where required; and - profile-disabled replay bypasses profile sidecars and trace/counter logic. Dynamic binding sources retain the original `queue_execute` binding table slot for the entire command-buffer lifetime. There is no per-block binding remap sidecar, and no finalization scan that rewrites binding slots. Future device-side fixup should consume recorded patch records directly: `patch offset + binding table slot + binding offset`. ### Profiling, Counters, Traces, And Replay The driver is a first-class producer for the HAL-native profiling and replay stack. Supported profiling/data modes include: - host-side memory and queue events; - device-side queue timestamps; - per-dispatch timestamps; - executable/export metadata; - hardware/software counters; - queue-range PMC sampling; - device metrics from platform-specific sources; - filtered ATT/SQTT executable traces through dynamically loaded ROCm profiling libraries; and - replay captures that can be run, benchmarked, dumped, and profiled outside the original application. Normal execution does not require ROCm profiling libraries. The aqlprofile path is dynamically loaded only for modes that need counters or executable traces. Linux-specific device-metric support is isolated behind a platform source so the core driver remains structured for future Windows and macOS HSA support. ## Performance Evidence The main apples-to-apples GPU comparison uses the SDXL CLIP prompt encoder: a real sharktank workload with 792 dispatches, 28 executables, and enough queue traffic to exercise command-buffer replay and host/runtime overhead. Post-cleanup optimized non-Tracy medians: | Shape | AMDGPU wall | HIP stream wall | AMDGPU vs stream | HIP graph wall | AMDGPU vs graph | AMDGPU host CPU | | --- | ---: | ---: | ---: | ---: | ---: | ---: | | c1/d1 | 10.9508 ms | 11.5456 ms | 5.15% faster | 11.6199 ms | 5.76% faster | 0.618 ms | | c1/d16 | 0.7035 ms/item | 0.7311 ms/item | 3.78% faster | 0.7335 ms/item | 4.09% faster | 0.036 ms/item | | c2/d16 | 0.7073 ms/item | 0.7298 ms/item | 3.08% faster | 0.7330 ms/item | 3.50% faster | 0.037 ms/item | | c4/d16 | 0.7066 ms/item | 0.7278 ms/item | 2.92% faster | 0.7288 ms/item | 3.05% faster | 0.037 ms/item | | c8/d16 | 0.7058 ms/item | 0.7322 ms/item | 3.60% faster | 0.7333 ms/item | 3.75% faster | 0.038 ms/item | The broader model spread is consistent with the same story: native AMDGPU is usually ahead of HIP stream, usually ahead of HIP graph when HIP graph can import the workload, and uses much less host CPU on queue-heavy paths. Representative additional rows: | Workload | Shape | AMDGPU | HIP stream | HIP graph | Notes | | --- | --- | ---: | ---: | ---: | --- | | MNIST-12 | c1/d1 | 0.0978 ms | 0.1423 ms | 0.1425 ms | Small classifier, high runtime-overhead sensitivity. | | SqueezeNet 1.0 | c1/d1 | 1.1428 ms | 1.2043 ms | 1.1988 ms | Compact CNN. | | toy CLIP bf16 | c1/d1 | 0.2227 ms | 0.2578 ms | 0.2597 ms | Transformer-ish toy encoder. | | MobileNetV2-12 | c1/d1 | 1.8462 ms | 1.9316 ms | crash | Depthwise/mobile CNN; HIP graph crashes locally. | | TinyYOLOv2-8 | c1/d1 | 7.6516 ms | 8.0490 ms | 8.5600 ms | Object detection graph. | | ResNet50-v1-12 | c1/d1 | 9.5364 ms | 9.6900 ms | import fails | HIP graph node limit. | | SDXL scheduled UNet | c1/d1 body | 204.36 ms | 215.19 ms | 216.43 ms | Direct `run_forward` body. | | SDXL CLIP prompt encoder | c8/d16 | 0.692 ms | 0.721 ms | 0.725 ms | Byte-identical HSACO/no-prefetch row. | We also compared raw C HAL command-buffer construction/replay against raw C HIP graph construction/launch for a 512 dispatch/barrier chain, avoiding VM overhead on both sides: | Path | Prebuilt wall | Dynamic wall | Extra wall | Extra wall / dispatch | Extra CPU / dispatch | | --- | ---: | ---: | ---: | ---: | ---: | | HAL command buffer, validated | 2096.4 us | 2177.0 us | 80.5 us | 0.157 us | 0.582 us | | HAL command buffer, unvalidated | 2096.4 us | 2143.3 us | 46.9 us | 0.092 us | 0.526 us | | HIP graph | 2983.7 us | 4022.9 us | 1039.3 us | 2.030 us | 2.308 us | That is the key dynamic-command-buffer result: unvalidated HAL command-buffer recording/replay adds tens of microseconds for the 512-pair chain, while HIP graph construction adds about a millisecond in the same harness. Queue-stress microbenchmarks isolate the pathological submission streams that large distributed and graph-style applications care about. The current-head HAL rows below use the checked-in AMDGPU `queue_benchmark` built optimized with release ThinLTO/O3/native flags, pinned to one CPU and one local RDNA3 GPU. HIP rows use the matching HIP event ping-pong harness on the same CPU/GPU pin. The end-to-end rows measure 512 cross-queue dependency edges plus one public host-visible completion: | Shape | AMDGPU end-to-end / edge | HIP end-to-end / edge | Read | | --- | ---: | ---: | --- | | Cross-queue dependency edge | 4.58 us | 11.20 us | AMDGPU is 2.4x faster. | | Edge + 4-byte device copy | 11.65 us | 14.62 us | AMDGPU is 1.25x faster. | | Edge + 4-byte device fill | 10.98 us | 15.20 us | AMDGPU is 1.38x faster. | | Edge + tiny dispatch | 10.55 us | 14.59 us | AMDGPU is 1.38x faster. | | Edge + no-op dispatch packet | 4.56 us | n/a | AMDGPU stays near the pure dependency floor when payload work is empty. | The pure submit-only dependency row is the sharpest host-path comparison: AMDGPU submits a cross-queue dependency edge for about 0.42 us/edge, while HIP events cost about 5.23 us/edge in the same pinned harness. That is about 12.5x less host-side submission overhead for the synchronization pattern used by tensor-parallel and pipeline-parallel programs. This is not just an implementation-speed comparison. HIP stream events and HIP graphs sit above a compatibility runtime that has to rediscover intent from streams, events, graph nodes, kernel parameters, and raw pointer arguments. IREE already has that intent in structured HAL commands: explicit semaphore frontiers, queue affinity, binding tables, memory types, command-buffer blocks, and executable metadata. The AMDGPU HAL can turn those contracts directly into AQL packets and queue-local completion state without routing every operation through HIP's public stream/event/graph abstraction. That structural difference is why the CPU-time story is as important as the wall-time story. On the SDXL CLIP prompt encoder, AMDGPU runs the steady-state batched path with roughly 0.036-0.038 ms/item of host CPU time while HIP stream and HIP graph paths are around 0.74-0.76 ms/item. That is a roughly 20x host CPU reduction on the queue-heavy path. On systems with many accelerators, expensive prefill/decode scheduling, or small CPU budgets, that difference is the difference between the CPU being orchestration glue and the CPU becoming the bottleneck. The same abstraction boundary is also what lets HAL scale beyond HIP's world model. HAL command buffers, semaphores, queue affinity, memory files, and device groups can describe local GPUs, CPU devices, remote devices, and heterogeneous execution without changing the program's synchronization model. The upcoming remote HAL work can use the same command/dependency concepts across process or machine boundaries; HIP cannot represent that kind of heterogeneous or remote execution graph without collapsing it back into host-side framework logic. This rewrite puts AMDGPU on the same HAL substrate as local-task, local-sync, profiling, replay, and future remote execution instead of treating AMD GPUs as a HIP-shaped island. Tracy and Perfetto captures were used as structural evidence for queue shape, host/runtime gaps, worker behavior, dispatch timing, counter ranges, and device metric sampling. Non-Tracy optimized runs are the source of the wall-time numbers above. ## Portability And Hardware Coverage The current implementation has been exercised primarily on local RDNA3/gfx1100 Linux hardware, but the code is structured for broader AMDGPU support. Cross-device preparation in this PR includes: - target ID parsing and generated target maps for exact, generic, family, and product-bundle device-library selection; - explicit HSA memory-pool access and link-topology modeling; - CPU-visible device-coarse memory capability selection with HDP publication; - queue-owned kernarg publication policy; - PM4 capability detection and AQL PM4 IB infrastructure where supported; - generic device-library target selection instead of hard-coding gfx1100; and - tests around target IDs, code-object target selection, topology, memory access, device-library lookup, and PM4/AQL emitters. Cross-platform preparation includes: - dynamic HSA loading instead of a direct link dependency; - platform-isolated Linux KFD/device-metric support; - optional dynamic loading of ROCm profiling libraries; - public HAL abstractions for profiling/replay rather than AMDGPU-only tool hooks; and - explicit failure for unsupported platform features. This PR does not claim every modern RDNA/CDNA target is fully proven. It gives us the driver architecture, target map, and capability seams required to harden that matrix as more hardware and platform HSA stacks become available. ## Forward-Looking Work Enabled By This Shape Several important features are intentionally not completed in this PR, but the landed architecture is designed around them. **Device-side dynamic kernarg fixup.** Dynamic command buffers currently patch queue-owned kernargs on the host. The planned production path is to upload a small per-submission binding table/control record and dispatch a device-side fixup kernel that copies template kernargs and patches dynamic qwords before payload dispatches execute. The recorded command-buffer patch records already carry the essential facts: target patch location, original binding-table slot, and binding offset. **Device-side command-buffer scheduling.** The block-program ABI gives us a clean path to device-side processors. A device queue can invoke block processors, advance command-buffer continuations, and schedule independent blocks without forcing host queue code to understand every command body. **Command-buffer control flow.** The ABI already reserves branch, conditional branch, and return terminators. Host replay currently supports the subset needed by the landed workloads; the representation is intentionally shaped so richer control flow can become an execution feature rather than a new command-buffer format. **Binding-table-indirect dispatch ABI.** A future dispatch ABI may avoid dynamic kernarg pointer fixup by passing an invocation-local binding table base and loading buffer pointers indirectly in kernels. That needs compiler/runtime experiments to measure the cost of an extra scalar load versus raw pointer kernargs, but the current direct binding-table slot invariant is compatible with that direction. **PM4-backed queues and operations.** The driver now has PM4 emitters, PM4 program utilities, capability detection, and AQL PM4 IB slots on supported hardware. That creates room for PM4-backed waits, transfers, profiling snippets, and potentially lower-level queue strategies where HSA/AQL alone is not the best mechanism. **Transfer strategy expansion.** Current transfer paths use explicit builtin device kernels and staging strategies. The queue/file/memory split leaves room for SDMA, P2P, direct storage, and topology-aware copy selection without rewriting the core queue completion path. **Broader profiling.** CDNA devices should expose richer counter options than the initial local setup. The queue-range PMC and profile-bundle infrastructure are meant to scale into that environment without changing the normal execution path. ## Review Guide Good entry points for review: - `runtime/src/iree/hal/drivers/amdgpu/README.md`: user-facing driver overview, build flags, runtime selection, profiling, and target-library notes. - `runtime/src/iree/hal/drivers/amdgpu/api.h`: public driver/device options. - `runtime/src/iree/hal/drivers/amdgpu/driver.c`: driver registration, HSA loading, and device creation. - `runtime/src/iree/hal/drivers/amdgpu/logical_device.c`: HAL device methods, profiling/replay integration, and physical-device orchestration. - `runtime/src/iree/hal/drivers/amdgpu/physical_device.c`: HSA agent setup, queue creation, memory pools, executable caches, device libraries, profiling, and topology state. - `runtime/src/iree/hal/drivers/amdgpu/host_queue.c`: queue ownership, completion thread, submission state, and reclaim lifetime. - `runtime/src/iree/hal/drivers/amdgpu/host_queue_submission.c`: common submission admission, publication, and failure/reclaim path. - `runtime/src/iree/hal/drivers/amdgpu/aql_command_buffer.c`: command-buffer recording, layout, prepublication, dynamic binding strategy, and block construction. - `runtime/src/iree/hal/drivers/amdgpu/abi/command_buffer.h`: durable command-buffer block ABI. - `runtime/src/iree/hal/drivers/amdgpu/aql_block_processor.c`: unprofiled AQL block processor. - `runtime/src/iree/hal/drivers/amdgpu/aql_block_processor_profile.c`: profiling-augmented block processor. - `runtime/src/iree/hal/drivers/amdgpu/host_queue_command_buffer*.c`: host replay orchestration, block submission, packet policy, scratch storage, and profiling integration. - `runtime/src/iree/hal/drivers/amdgpu/profile_*.c`: profile producers for events, metadata, counters, device metrics, and traces. - `runtime/src/iree/hal/drivers/amdgpu/device/*.c`: embedded device-side helper kernels and host-side packet/kernarg formation helpers. - `runtime/src/iree/hal/drivers/amdgpu/util/*.c`: HSA loading, target IDs, code-object metadata, rings, signals, PM4/AQL emitters, topology, and KFD utilities. ## Validation Validation covered both source-level unit tests and workload-level evidence: - focused AMDGPU unit tests for HSA loading, target IDs, code-object metadata, device libraries, topology, capabilities, pools, signals, rings, emitters, executables, semaphores, allocators, command buffers, block processors, host queue submission, staging, profiling metadata/events, and CTS backends; - AMDGPU HAL CTS dispatch/executable coverage; - focused Linux Bazel ASAN builds/tests for the AMDGPU runtime targets; - focused CMake configure/build/test coverage for AMDGPU runtime libraries and generated CTS artifacts; - Windows and macOS CMake validation of the shared HAL/async/profile/replay substrate that this driver depends on; - SDXL CLIP correctness on both visible local AMDGPU devices with the same weights, inputs, and expected outputs used for CPU validation; - SDXL CLIP, SDXL UNet, model-spread, command-buffer-vs-HIP-graph, Tracy, Perfetto, device-metrics, PMC, and ATT/SQTT profiling runs; and - pre-commit formatting/check generation hooks for the final branch. The performance numbers in this PR are from optimized non-Tracy runs on my machine, YMMV. Tracy, Perfetto, counters, and device metrics were used to explain structure and validate behavior, not as the source of wall-clock claims.
IREE (Intermediate Representation Execution Eenvironment, pronounced as “eerie”) is an MLIR-based end-to-end compiler and runtime that lowers Machine Learning (ML) models to a unified IR that scales up to meet the needs of the datacenter and down to satisfy the constraints and special considerations of mobile and edge deployments.
See our website for project details, user guides, and instructions on building from source.
Releases notes are published on GitHub releases.
| Package | Release status |
|---|---|
| GitHub release (stable) | |
| GitHub release (nightly) | |
iree-base-compiler | |
iree-base-runtime |
For more details on the release process, see https://iree.dev/developers/general/release-management/.
| Operating system | Build status |
|---|---|
| Linux | |
| macOS | |
| macOS |
For the full list of workflows see https://iree.dev/developers/general/github-actions/.
See our website for more information.
Community meeting recordings: IREE YouTube channel
| Date | Title | Recording | Slides |
|---|---|---|---|
| 2025-06-10 | Data-Tiling in IREE: Achieving High Performance Through Compiler Design (AsiaLLVM) | recording | slides |
| 2025-05-17 | Introduction to GPU architecture and IREE's GPU CodeGen Pipeline | recording | slides |
| 2025-02-12 | The Long Tail of AI: SPIR-V in IREE and MLIR (Vulkanised) | recording | slides |
| 2024-10-01 | Unveiling the Inner Workings of IREE: An MLIR-Based Compiler for Diverse Hardware | recording | |
| 2021-06-09 | IREE Runtime Design Tech Talk | recording | slides |
| 2020-08-20 | IREE CodeGen (MLIR Open Design Meeting) | recording | slides |
| 2020-03-18 | Interactive HAL IR Walkthrough | recording | |
| 2020-01-31 | End-to-end MLIR Workflow in IREE (MLIR Open Design Meeting) | recording | slides |
IREE is licensed under the terms of the Apache 2.0 License with LLVM Exceptions. See LICENSE for more information.