commit: dfe81344abeed40c6c8bcb664eb35f113deee19c
[log]
author: Ben Vanik <ben.vanik@gmail.com>
Wed Apr 29 22:20:15 2026 -0700
committer: GitHub <noreply@github.com>
Wed Apr 29 22:20:15 2026 -0700
tree: 5ec34b900fd8d1e4af1c5f85c55a464ed21edba5
parent: 835d1b88ab3f3a7cdd0fc952c28337143602c2d8 [diff]
parent: b42f44c0d93ea2e1c2198378cd2f795d5cf7af3c [diff]
[HAL/AMDGPU] Initial host-side AMDGPU HAL implementation (#24298)

This PR lands IREE's native AMDGPU HAL driver: a direct HSA/ROCR backend
that owns queue submission, packet construction, memory placement,
command-buffer recording/replay, profiling, counters, device-library
selection, and future scheduling policy inside IREE instead of routing
normal execution through HIP. The cost is ~70kLoC but that gives IREE
direct ownership of AMD GPU execution instead of routing through HIP
streams and HIP graphs. The critical unlocks happen because IREE already
knows the real program structure that HIP tries to guess at: explicit
semaphore frontiers, queue affinity, memory types, binding tables,
reusable command-buffer blocks, executable metadata, profiling scopes,
and replay captures. The native driver turns that structure directly
into AQL packets and queue-local completion state, which lets us do
things HIP cannot naturally express: low-overhead dynamic command
buffers, heterogeneous HAL device groups, future remote execution,
device-side fixup/scheduling, and profiling/replay from the same command
model. The early numbers show the shape of the win: ~12.5x lower submit
overhead for cross-queue dependency edges, ~22x lower dynamic graph
construction tax versus HIP graphs on a 512-dispatch chain, and ~20x
lower steady-state host CPU time on queue-heavy submission paths. This
is v0, but it is already the architecture we want to optimize: fewer
compatibility layers, more explicit contracts, and a path where AMD GPUs
participate in the full HAL ecosystem instead of living behind a
HIP-shaped abstraction boundary.

This is intentionally a large PR. The driver is not a thin shim around
one runtime call; it is the runtime boundary for AMD GPUs. The branch
contains the native driver plus the AMDGPU-specific hardening that made
the final shape reviewable: command-buffer replay cleanup, queue/pool
integration, profiling producers, target-library selection, device
capability handling, tests, and developer documentation.

The headlines:

- IREE now has a native AMDGPU execution path based on HSA queues and
AQL packets.
- The driver can run normal HAL dispatches and reusable HAL command
buffers without HIP streams or HIP graphs.
- The command-buffer representation is designed as a durable block
program that can be replayed by host processors now and device-side
processors later.
- The profiling path can expose queue, dispatch, executable, counter,
device metric, and ATT/SQTT trace data through the HAL profile tooling.
- The hot paths are structured so static production replay does not pay
for optional profiling, trace, upload, or future device-fixup machinery.

## Why

HIP is a useful compatibility layer and comparison point, but it is not
the right abstraction boundary for the runtime work IREE wants to do.

IREE needs to be able to control:

- how HAL queue operations become AQL/PM4 packets;
- where kernargs, command-buffer templates, transient buffers, and
staging records live;
- how semaphore dependencies map to queue frontiers and completion
epochs;
- how reusable command buffers are recorded, validated, replayed, and
profiled;
- where host work ends and queue-ordered device work begins;
- how to capture profiling data without turning the production queue
into a debug path; and
- how to evolve toward device-side command-buffer scheduling and fixup.

HIP graphs are especially awkward for IREE's dynamic command-buffer use
case. They can be expensive to construct, hard to introspect, and
difficult to shape around IREE's own async allocation and replay
contracts. The native driver gives IREE a graph-like reusable command
stream while keeping the command stream in IREE's own ABI.

## Design Principles

The implementation follows a few constraints that are worth making
explicit for review.

**Own the production hot path.** Queue submission, command-buffer
replay, kernarg formation, packet publication, and completion are
explicit IREE code. Optional features are allowed only when they do not
tax the default path. For example, profiling, ATT/SQTT capture,
queue-control upload rings, and future device-side fixup all have opt-in
storage and control flow.

**Record facts once.** Command buffers are allowed to do work while
recording and finalizing so replay can be simple. Binding counts, patch
counts, packet counts, barrier requirements, prepublication eligibility,
rodata references, and block terminators are recorded in the
command-buffer program instead of rediscovered by
scanning command records during submission.

**Keep host and device processors pointed at the same ABI.** The AMDGPU
command buffer is a block program, not a host-only replay script. The
current host AQL block processor consumes that program; future
device-side processors should consume the same block format for
command-buffer continuations, scheduling, and kernarg fixup.

**Separate invariant clusters.** The driver is split by subsystem rather
than growing one giant queue file. There are distinct files for queue
submission, queue waits, command-buffer block processing, command-buffer
replay, profiling augmentation, staging/file paths, memory operations,
executable handling, topology, device capabilities, and utility rings.

**Fail loud on unsupported strategies.** Unsupported memory paths,
command forms, profiling modes, and device capabilities should fail with
a concrete status instead of silently falling back through the wrong
mechanism.

**Make platform/device variation explicit.** The code names the places
where HSA memory-pool access, HDP publication, topology links, target
IDs, device-library coverage, Linux KFD metrics, and optional ROCm
profiling libraries affect behavior.

## Architecture Overview

### Driver And Device Model

The driver dynamically loads HSA/ROCR, discovers CPU and GPU agents, and
creates logical HAL devices over one or more physical AMDGPU agents.

The main object split is:

- driver: HSA discovery, option parsing, and logical-device creation;
- logical device: HAL-facing device object and shared runtime state;
- physical device: one HSA GPU agent with queues, memory pools,
executable cache, device-library selection, profiling state, device
metrics, and topology facts;
- host queue: HSA queue plus IREE's AQL, kernarg, notification,
completion, and reclaim state; and
- virtual queue: the internal interface used so command-buffer, direct
dispatch, memory, file, and profiling paths route through one queue
contract.

Device selection supports all visible AMDGPU agents by default,
single-device selection, UUID-based selection, ordinal selection, and
multi-device logical devices. The topology code records HSA memory-pool
access, link class, NUMA distance, coherency, atomics, and interop
capability facts so future placement and transfer strategies can reason
about PCIe, xGMI, and other link types without hard-coded assumptions.

### Executables And Device Libraries

AMDGPU executables are loaded from HSACO/code-object data and matched
against the selected physical device. The runtime also embeds AMDGPU
device libraries used for builtin operations such as fill/copy helpers,
timestamp helpers, and dispatch-side utilities.

The device-library target map is single-sourced from generated target
metadata. Builds can select exact targets, LLVM generic targets,
TheRock-style generic families, or product bundles. This keeps package
size and device coverage under explicit build-system control while
letting the runtime fail clearly when a required target was not
embedded.

### Memory, Pools, And Publication

The driver integrates with the HAL pool substrate and AMDGPU HSA memory
pools instead of treating all buffers as generic allocations.

The implementation distinguishes:

- device-local memory;
- CPU-visible fine-grained host memory;
- CPU-visible coarse-grained device memory;
- queue-owned kernarg memory;
- optional queue-control upload memory;
- transient allocation pools;
- file/staging storage; and
- host-side block/slab pools used by queue and profiling data
structures.

HDP publication is represented as a selected capability of the memory
path, not as an ad hoc flush sprinkled through dispatch code. If CPU
writes to memory that the GPU will consume require publication on a
device, the queue-owned memory path knows how to publish those writes
before the relevant packet headers become
visible.

The default queue-control upload ring is disabled until a production
consumer opts in. That keeps the future device-side fixup path available
without charging every queue an unused HSA allocation.

### Queue Submission And Completion

Host queues own an HSA AQL queue and maintain:

- an AQL ring view for packet reservation/publicat
> <img width="403" height="222" alt="image"
src="https://github.com/user-attachments/assets/11a9ef26-dc32-427c-a01e-7969fd24ec2d"
/> (kind of, consider this the reference implementation)

ion;
- a kernarg ring for queue-owned dispatch arguments;
- an epoch/notification ring mapping GPU completions to HAL semaphore
signals;
- a queue frontier snapshot for dependency tracking;
- one completion thread that drains queue epochs and publishes
user-visible semaphore completions;
- optional PM4 IB slots indexed by AQL packet id on hardware that
supports AQL PM4 packets; and
- optional profiling/counter/trace state.

Submission is serialized per queue, but independent queues do not
synchronize with each other. The queue submission path reserves AQL
packets, kernargs, and notification entries before publishing headers.
If admission fails, reclaim is routed through the same
notification/reclaim machinery instead of inventing a
parallel cleanup path.

HAL ordering is represented by semaphore/frontier dependencies, not by
assuming FIFO execution. The queue frontier machinery lets the driver
elide redundant waits when the dependency is already known to be
satisfied, while preserving correctness when the frontier overflows or
cannot prove elision.

### Direct Dispatch And Builtin Operations

Direct `queue_dispatch` resolves executable metadata, validates dispatch
shape, forms kernargs, retains the executable/buffer resources required
by the submission, and emits AQL packets through the common queue
submission path.

Queue buffer operations are implemented through explicit strategies.
Builtin device kernels cover fill/copy/update paths and are selected
based on alignment, size, and available device-library kernels. The code
leaves room for SDMA, PM4, P2P, and future direct-storage strategies
without conflating those with the current kernel-dispatch path.

### Command Buffers

The AMDGPU command-buffer ABI is the center of the rewrite.

Recorded command buffers are stored as a program of blocks. Each block
has a fixed header with command counts, binding-source counts,
packet/kernarg worst case, rodata extent, dispatch/profile-marker
counts, barrier metadata, and a terminator. Commands include barriers,
dispatches, fills, copies, updates,
profile markers, branches, conditional branches, and returns.

The important split is:

- the command buffer owns the durable block program and rodata;
- the AQL block processor consumes one block and writes reserved
packet/kernarg storage;
- host queue replay is the container/orchestration layer that
initializes a processor, invokes blocks, handles continuations, and
integrates with semaphores/reclaim; and
- profiling processors are separate variants that augment replay only
when profiling was explicitly requested.

This shape is deliberate. A block processor is close to a small
interpreter over the block ABI. It is suitable for dedicated tests today
and for device-side processor variants later. Host queue code should not
need to know how every command body becomes AQL packets.

Replay hot paths are specialized:

- static reusable dispatches can use prepublished kernargs;
- all-dynamic dispatches use a direct binding-pointer scatter path;
- mixed static/dynamic reusable dispatches use immutable templates plus
recorded dynamic patch sources;
- indirect dispatch parameters stay on the generic path where required;
and
- profile-disabled replay bypasses profile sidecars and trace/counter
logic.

Dynamic binding sources retain the original `queue_execute` binding
table slot for the entire command-buffer lifetime. There is no per-block
binding remap sidecar, and no finalization scan that rewrites binding
slots. Future
device-side fixup should consume recorded patch records directly: `patch
offset + binding table slot + binding offset`.

### Profiling, Counters, Traces, And Replay

The driver is a first-class producer for the HAL-native profiling and
replay stack.

Supported profiling/data modes include:

- host-side memory and queue events;
- device-side queue timestamps;
- per-dispatch timestamps;
- executable/export metadata;
- hardware/software counters;
- queue-range PMC sampling;
- device metrics from platform-specific sources;
- filtered ATT/SQTT executable traces through dynamically loaded ROCm
profiling libraries; and
- replay captures that can be run, benchmarked, dumped, and profiled
outside the original application.

Normal execution does not require ROCm profiling libraries. The
aqlprofile path is dynamically loaded only for modes that need counters
or executable traces. Linux-specific device-metric support is isolated
behind a platform source so the core driver remains structured for
future Windows and macOS HSA support.

## Performance Evidence

The main apples-to-apples GPU comparison uses the SDXL CLIP prompt
encoder: a real sharktank workload with 792 dispatches, 28 executables,
and enough queue traffic to exercise command-buffer replay and
host/runtime overhead.

Post-cleanup optimized non-Tracy medians:

| Shape | AMDGPU wall | HIP stream wall | AMDGPU vs stream | HIP graph
wall | AMDGPU vs graph | AMDGPU host CPU |
| --- | ---: | ---: | ---: | ---: | ---: | ---: |
| c1/d1 | 10.9508 ms | 11.5456 ms | 5.15% faster | 11.6199 ms | 5.76%
faster | 0.618 ms |
| c1/d16 | 0.7035 ms/item | 0.7311 ms/item | 3.78% faster | 0.7335
ms/item | 4.09% faster | 0.036 ms/item |
| c2/d16 | 0.7073 ms/item | 0.7298 ms/item | 3.08% faster | 0.7330
ms/item | 3.50% faster | 0.037 ms/item |
| c4/d16 | 0.7066 ms/item | 0.7278 ms/item | 2.92% faster | 0.7288
ms/item | 3.05% faster | 0.037 ms/item |
| c8/d16 | 0.7058 ms/item | 0.7322 ms/item | 3.60% faster | 0.7333
ms/item | 3.75% faster | 0.038 ms/item |

The broader model spread is consistent with the same story: native
AMDGPU is usually ahead of HIP stream, usually ahead of HIP graph when
HIP graph can import the workload, and uses much less host CPU on
queue-heavy paths.

Representative additional rows:

| Workload | Shape | AMDGPU | HIP stream | HIP graph | Notes |
| --- | --- | ---: | ---: | ---: | --- |
| MNIST-12 | c1/d1 | 0.0978 ms | 0.1423 ms | 0.1425 ms | Small
classifier, high runtime-overhead sensitivity. |
| SqueezeNet 1.0 | c1/d1 | 1.1428 ms | 1.2043 ms | 1.1988 ms | Compact
CNN. |
| toy CLIP bf16 | c1/d1 | 0.2227 ms | 0.2578 ms | 0.2597 ms |
Transformer-ish toy encoder. |
| MobileNetV2-12 | c1/d1 | 1.8462 ms | 1.9316 ms | crash |
Depthwise/mobile CNN; HIP graph crashes locally. |
| TinyYOLOv2-8 | c1/d1 | 7.6516 ms | 8.0490 ms | 8.5600 ms | Object
detection graph. |
| ResNet50-v1-12 | c1/d1 | 9.5364 ms | 9.6900 ms | import fails | HIP
graph node limit. |
| SDXL scheduled UNet | c1/d1 body | 204.36 ms | 215.19 ms | 216.43 ms |
Direct `run_forward` body. |
| SDXL CLIP prompt encoder | c8/d16 | 0.692 ms | 0.721 ms | 0.725 ms |
Byte-identical HSACO/no-prefetch row. |

We also compared raw C HAL command-buffer construction/replay against
raw C HIP graph construction/launch for a 512 dispatch/barrier chain,
avoiding VM overhead on both sides:

| Path | Prebuilt wall | Dynamic wall | Extra wall | Extra wall /
dispatch | Extra CPU / dispatch |
| --- | ---: | ---: | ---: | ---: | ---: |
| HAL command buffer, validated | 2096.4 us | 2177.0 us | 80.5 us |
0.157 us | 0.582 us |
| HAL command buffer, unvalidated | 2096.4 us | 2143.3 us | 46.9 us |
0.092 us | 0.526 us |
| HIP graph | 2983.7 us | 4022.9 us | 1039.3 us | 2.030 us | 2.308 us |

That is the key dynamic-command-buffer result: unvalidated HAL
command-buffer recording/replay adds tens of microseconds for the
512-pair chain, while HIP graph construction adds about a millisecond in
the same harness.

Queue-stress microbenchmarks isolate the pathological submission streams
that large distributed and graph-style applications care about. The
current-head HAL rows below use the checked-in AMDGPU `queue_benchmark`
built optimized with release ThinLTO/O3/native flags, pinned to one CPU
and one local RDNA3 GPU. HIP rows use the matching HIP event ping-pong
harness on the same CPU/GPU pin. The end-to-end rows measure 512
cross-queue dependency edges plus one public host-visible completion:

| Shape | AMDGPU end-to-end / edge | HIP end-to-end / edge | Read |
| --- | ---: | ---: | --- |
| Cross-queue dependency edge | 4.58 us | 11.20 us | AMDGPU is 2.4x
faster. |
| Edge + 4-byte device copy | 11.65 us | 14.62 us | AMDGPU is 1.25x
faster. |
| Edge + 4-byte device fill | 10.98 us | 15.20 us | AMDGPU is 1.38x
faster. |
| Edge + tiny dispatch | 10.55 us | 14.59 us | AMDGPU is 1.38x faster. |
| Edge + no-op dispatch packet | 4.56 us | n/a | AMDGPU stays near the
pure dependency floor when payload work is empty. |

The pure submit-only dependency row is the sharpest host-path
comparison: AMDGPU submits a cross-queue dependency edge for about 0.42
us/edge, while HIP events cost about 5.23 us/edge in the same pinned
harness. That is about 12.5x less host-side submission overhead for the
synchronization pattern used by tensor-parallel and pipeline-parallel
programs.

This is not just an implementation-speed comparison. HIP stream events
and HIP graphs sit above a compatibility runtime that has to rediscover
intent from streams, events, graph nodes, kernel parameters, and raw
pointer arguments. IREE already has that intent in structured HAL
commands: explicit semaphore frontiers, queue affinity, binding tables,
memory types, command-buffer blocks, and executable metadata. The AMDGPU
HAL can turn those contracts directly into AQL packets and queue-local
completion state without routing every operation
through HIP's public stream/event/graph abstraction.

That structural difference is why the CPU-time story is as important as
the wall-time story. On the SDXL CLIP prompt encoder, AMDGPU runs the
steady-state batched path with roughly 0.036-0.038 ms/item of host CPU
time while HIP stream and HIP graph paths are around 0.74-0.76 ms/item.
That is a roughly 20x host CPU reduction on the queue-heavy path. On
systems with many accelerators, expensive prefill/decode scheduling, or
small CPU budgets, that difference is the difference between the CPU
being orchestration glue and the CPU becoming the
bottleneck.

The same abstraction boundary is also what lets HAL scale beyond HIP's
world model. HAL command buffers, semaphores, queue affinity, memory
files, and device groups can describe local GPUs, CPU devices, remote
devices, and heterogeneous execution without changing the program's
synchronization model. The upcoming remote HAL work can use the same
command/dependency concepts across process or machine boundaries; HIP
cannot represent that kind of heterogeneous or remote execution graph
without collapsing it back into host-side framework logic. This rewrite
puts AMDGPU on the same HAL substrate as local-task, local-sync,
profiling, replay, and future remote execution instead of treating AMD
GPUs as a HIP-shaped island.

Tracy and Perfetto captures were used as structural evidence for queue
shape, host/runtime gaps, worker behavior, dispatch timing, counter
ranges, and device metric sampling. Non-Tracy optimized runs are the
source of the wall-time numbers above.

## Portability And Hardware Coverage

The current implementation has been exercised primarily on local
RDNA3/gfx1100 Linux hardware, but the code is structured for broader
AMDGPU support.

Cross-device preparation in this PR includes:

- target ID parsing and generated target maps for exact, generic,
family, and product-bundle device-library selection;
- explicit HSA memory-pool access and link-topology modeling;
- CPU-visible device-coarse memory capability selection with HDP
publication;
- queue-owned kernarg publication policy;
- PM4 capability detection and AQL PM4 IB infrastructure where
supported;
- generic device-library target selection instead of hard-coding
gfx1100; and
- tests around target IDs, code-object target selection, topology,
memory access, device-library lookup, and PM4/AQL emitters.

Cross-platform preparation includes:

- dynamic HSA loading instead of a direct link dependency;
- platform-isolated Linux KFD/device-metric support;
- optional dynamic loading of ROCm profiling libraries;
- public HAL abstractions for profiling/replay rather than AMDGPU-only
tool hooks; and
- explicit failure for unsupported platform features.

This PR does not claim every modern RDNA/CDNA target is fully proven. It
gives us the driver architecture, target map, and capability seams
required to harden that matrix as more hardware and platform HSA stacks
become available.

## Forward-Looking Work Enabled By This Shape

Several important features are intentionally not completed in this PR,
but the landed architecture is designed around them.

**Device-side dynamic kernarg fixup.** Dynamic command buffers currently
patch queue-owned kernargs on the host. The planned production path is
to upload a small per-submission binding table/control record and
dispatch a device-side fixup kernel that copies template kernargs and
patches dynamic qwords before
payload dispatches execute. The recorded command-buffer patch records
already carry the essential facts: target patch location, original
binding-table slot, and binding offset.

**Device-side command-buffer scheduling.** The block-program ABI gives
us a clean path to device-side processors. A device queue can invoke
block processors, advance command-buffer continuations, and schedule
independent blocks without forcing host queue code to understand every
command body.

**Command-buffer control flow.** The ABI already reserves branch,
conditional branch, and return terminators. Host replay currently
supports the subset needed by the landed workloads; the representation
is intentionally shaped so richer control flow can become an execution
feature rather than a new command-buffer
format.

**Binding-table-indirect dispatch ABI.** A future dispatch ABI may avoid
dynamic kernarg pointer fixup by passing an invocation-local binding
table base and loading buffer pointers indirectly in kernels. That needs
compiler/runtime experiments to measure the cost of an extra scalar load
versus raw pointer kernargs, but the current direct binding-table slot
invariant is compatible with that direction.

**PM4-backed queues and operations.** The driver now has PM4 emitters,
PM4 program utilities, capability detection, and AQL PM4 IB slots on
supported hardware. That creates room for PM4-backed waits, transfers,
profiling snippets, and potentially lower-level queue strategies where
HSA/AQL alone is not the best mechanism.

**Transfer strategy expansion.** Current transfer paths use explicit
builtin device kernels and staging strategies. The queue/file/memory
split leaves room for SDMA, P2P, direct storage, and topology-aware copy
selection without rewriting the core queue completion path.

**Broader profiling.** CDNA devices should expose richer counter options
than the initial local setup. The queue-range PMC and profile-bundle
infrastructure are meant to scale into that environment without changing
the normal execution path.

## Review Guide

Good entry points for review:

- `runtime/src/iree/hal/drivers/amdgpu/README.md`: user-facing driver
overview, build flags, runtime selection, profiling, and target-library
notes.
- `runtime/src/iree/hal/drivers/amdgpu/api.h`: public driver/device
options.
- `runtime/src/iree/hal/drivers/amdgpu/driver.c`: driver registration,
HSA loading, and device creation.
- `runtime/src/iree/hal/drivers/amdgpu/logical_device.c`: HAL device
methods, profiling/replay integration, and physical-device
orchestration.
- `runtime/src/iree/hal/drivers/amdgpu/physical_device.c`: HSA agent
setup, queue creation, memory pools, executable caches, device
libraries, profiling, and topology state.
- `runtime/src/iree/hal/drivers/amdgpu/host_queue.c`: queue ownership,
completion thread, submission state, and reclaim lifetime.
- `runtime/src/iree/hal/drivers/amdgpu/host_queue_submission.c`: common
submission admission, publication, and failure/reclaim path.
- `runtime/src/iree/hal/drivers/amdgpu/aql_command_buffer.c`:
command-buffer recording, layout, prepublication, dynamic binding
strategy, and block construction.
- `runtime/src/iree/hal/drivers/amdgpu/abi/command_buffer.h`: durable
command-buffer block ABI.
- `runtime/src/iree/hal/drivers/amdgpu/aql_block_processor.c`:
unprofiled AQL block processor.
- `runtime/src/iree/hal/drivers/amdgpu/aql_block_processor_profile.c`:
profiling-augmented block processor.
- `runtime/src/iree/hal/drivers/amdgpu/host_queue_command_buffer*.c`:
host replay orchestration, block submission, packet policy, scratch
storage, and profiling integration.
- `runtime/src/iree/hal/drivers/amdgpu/profile_*.c`: profile producers
for events, metadata, counters, device metrics, and traces.
- `runtime/src/iree/hal/drivers/amdgpu/device/*.c`: embedded device-side
helper kernels and host-side packet/kernarg formation helpers.
- `runtime/src/iree/hal/drivers/amdgpu/util/*.c`: HSA loading, target
IDs, code-object metadata, rings, signals, PM4/AQL emitters, topology,
and KFD utilities.

## Validation

Validation covered both source-level unit tests and workload-level
evidence:

- focused AMDGPU unit tests for HSA loading, target IDs, code-object
metadata, device libraries, topology, capabilities, pools, signals,
rings, emitters, executables, semaphores, allocators, command buffers,
block processors, host queue submission, staging, profiling
metadata/events, and CTS backends;
- AMDGPU HAL CTS dispatch/executable coverage;
- focused Linux Bazel ASAN builds/tests for the AMDGPU runtime targets;
- focused CMake configure/build/test coverage for AMDGPU runtime
libraries and generated CTS artifacts;
- Windows and macOS CMake validation of the shared
HAL/async/profile/replay substrate that this driver depends on;
- SDXL CLIP correctness on both visible local AMDGPU devices with the
same weights, inputs, and expected outputs used for CPU validation;
- SDXL CLIP, SDXL UNet, model-spread, command-buffer-vs-HIP-graph,
Tracy, Perfetto, device-metrics, PMC, and ATT/SQTT profiling runs; and
- pre-commit formatting/check generation hooks for the final branch.

The performance numbers in this PR are from optimized non-Tracy runs on
my machine, YMMV. Tracy, Perfetto, counters, and device metrics were
used to explain structure and validate behavior, not as the source of
wall-clock claims.
tree: 5ec34b900fd8d1e4af1c5f85c55a464ed21edba5
README.md
IREE: Intermediate Representation Execution Environment

IREE (Intermediate Representation Execution Eenvironment, pronounced as “eerie”) is an MLIR-based end-to-end compiler and runtime that lowers Machine Learning (ML) models to a unified IR that scales up to meet the needs of the datacenter and down to satisfy the constraints and special considerations of mobile and edge deployments.
See our website for project details, user guides, and instructions on building from source.
Project news

Project status

Release status

Releases notes are published on GitHub releases.
Package	Release status
GitHub release (stable)
GitHub release (nightly)
`iree-base-compiler`
`iree-base-runtime`
For more details on the release process, see https://iree.dev/developers/general/release-management/.
Build status

Nightly build status

Operating system	Build status
Linux
macOS
macOS
For the full list of workflows see https://iree.dev/developers/general/github-actions/.
Communication channels

GitHub issues: Feature requests, bugs, and other work tracking
IREE Discord server: Daily development discussions with the core team and collaborators
(New) iree-announce email list: Announcements
(New) iree-technical-discussion email list: General and low-priority discussion
(Legacy) iree-discuss email list: Announcements, general and low-priority discussion
Related project channels

MLIR topic within LLVM Discourse: IREE is enabled by and heavily relies on MLIR. IREE sometimes is referred to in certain MLIR discussions. Useful if you are also interested in MLIR evolution.
Architecture overview

IREE Architecture
See our website for more information.
Presentations and talks

Community meeting recordings: IREE YouTube channel
Date	Title	Recording	Slides
2025-06-10	Data-Tiling in IREE: Achieving High Performance Through Compiler Design (AsiaLLVM)	recording	slides
2025-05-17	Introduction to GPU architecture and IREE's GPU CodeGen Pipeline	recording	slides
2025-02-12	The Long Tail of AI: SPIR-V in IREE and MLIR (Vulkanised)	recording	slides
2024-10-01	Unveiling the Inner Workings of IREE: An MLIR-Based Compiler for Diverse Hardware	recording
2021-06-09	IREE Runtime Design Tech Talk	recording	slides
2020-08-20	IREE CodeGen (MLIR Open Design Meeting)	recording	slides
2020-03-18	Interactive HAL IR Walkthrough	recording
2020-01-31	End-to-end MLIR Workflow in IREE (MLIR Open Design Meeting)	recording	slides
License

IREE is licensed under the terms of the Apache 2.0 License with LLVM Exceptions. See LICENSE for more information.