[HAL/Replay] Adding a HAL capture API and replay tool. (#24288)
> Risk here _should_ be relatively low - this is mostly just
iree/hal/replay/, the command line tools, and the wiring for replay
capture with few changes outside. Probably just potential packaging
issues. This was a day of work and enough for the tuning my agents were
running but if anyone else starts using it we can clean things up.
Embedding _should_ mostly work but some corner cases (importing buffers,
external memory, etc) are not going to work - embedders should have
modes that do semantically-aware logic (cloning/snapshotting, etc)
around this instead of trying to make this do crazy dirty page tracking
via guard pages and such.
This PR adds HAL-level replay capture and replay tooling. A replay
captures the HAL resource graph and operation stream issued by a
program, then lets us run, benchmark, profile, dump, remap, or
substitute pieces of that workload without re-running the original
VM/python/etc invocation.
The important product shape is that replay sits at the HAL boundary. A
capture contains devices, allocators, executable caches, executable
payloads, command buffers, semaphores, buffers, files, host-visible
writes, queue operations, and payload ranges. Replay reconstructs those
HAL objects and reissues supported operations against a caller-provided
device group.
## Why
Device performance work needs reproducible HAL workloads. Model
invocation time can include VM startup, argument parsing, input setup,
parameter archive plumbing, queue operations, command-buffer recording,
executable loads, and synchronization. For driver and runtime work, most
of that is noise. Replay captures the device-facing HAL stream below the
VM so we can debug, benchmark, and profile it directly. When IREE is
driven from python bindings or other framework layers it can often be
impossible to get precise reproducers or benchmarks with them in the
loop.
This also makes kernel iteration practical. The captured replay can
preserve the workload shape while substituting executable payloads at
replay time. That lets us answer "what happens if only this kernel
changes?" without rebuilding or re-invoking the original model stack.
Replay also composes with device profiling: `.ireereplay` says what HAL
work to run, and `.ireeprof` says what happened while it ran.
## What Lands
- `runtime/src/iree/hal/replay/` with the versioned `.ireereplay`
container, reader/writer helpers, lightweight digests, text and JSONL
dump projections, and replay execution.
- Recorder wrappers for HAL device groups and resources. Tooling swaps
in the recorder only when `--device_replay_output` is present. Normal
execution uses the unwrapped device group and does not allocate replay
state or wrap HAL resources.
- `iree-run-replay`, `iree-benchmark-replay`, and `iree-dump-replay`.
- Runtime Python package wrappers, CMake install/symlink plumbing,
package-test coverage, and a lit smoke test so the replay tools ship
with the same package surface as the existing run/benchmark/dump module
tools.
- `iree-dump-replay --format=text|jsonl`. Both projections validate the
replay container and report payload/blob data as byte ranges in the
original `.ireereplay` file instead of materializing large bytes in text
output.
- External file capture policies:
- `reference`: record path and identity metadata without copying bytes.
- `capture-ranges`: embed bytes read by recorded queue reads.
- `capture-all`: embed full imported files.
- `fail`: reject fd-backed files to force hermeticity decisions.
- Referenced-file validation modes:
- `identity`: cheap length/device/inode/mtime metadata.
- `digest`: opt-in full-file digest at capture and replay time.
- Replay path remapping for referenced external files.
- Executable substitution for kernel iteration.
- Named replay scope markers. `iree-run-module` captures standard
`init`, `execute`, and `deinit` phases; `iree-benchmark-module` captures
the Google Benchmark timed region as repeated `execute` scopes;
embedders can record their own scopes; and `iree-benchmark-replay
--replay_scope=name` reports just the selected phase while still
executing the full replay stream around it.
- Prepared replay plans. Repeated replay execution validates and lowers
the file stream once, then executes the prepared record plan for each
benchmark iteration.
- `iree-benchmark-replay` profiling support where profile flushes are
kept out of the timed benchmark region.
- Tool-owned help text and `--agents_md` output. The replay runtime
library does not carry CLI help; `iree-run-replay --agents_md` owns the
shared replay playbook, while capture, benchmark, and dump tools print
only tool-specific guidance that points back to it.
- Website docs for the device replay workflow and profiling story.
## Tool Flow
Capture from normal module tooling:
```shell
iree-run-module \
--device=local-task \
--module=/tmp/model.vmfb \
--function=main \
--device_replay_output=/tmp/model.ireereplay \
--device_replay_file_policy=reference
```
Replay, benchmark, and inspect:
```shell
iree-run-replay --device=local-task /tmp/model.ireereplay
iree-benchmark-replay \
--device=local-task \
--benchmark_min_time=50x \
/tmp/model.ireereplay
iree-dump-replay --format=jsonl /tmp/model.ireereplay
```
Capture from benchmark-module to preserve the same region Google
Benchmark times:
```shell
iree-benchmark-module \
--device=local-task \
--module=/tmp/model.vmfb \
--function=main \
--input=... \
--benchmark_min_time=100x \
--device_replay_output=/tmp/model-benchmark.ireereplay
iree-benchmark-replay \
--device=local-task \
--replay_scope=execute \
/tmp/model-benchmark.ireereplay
```
Capture replay profiling without charging profile flushes to the
benchmarked
iteration:
```shell
iree-benchmark-replay \
--device=local-task \
--benchmark_min_time=50x \
--device_profiling_mode=queue-events,host-execution,command-region-events,memory-events,executable-metadata \
--device_profiling_output=/tmp/model-replay.ireeprof \
/tmp/model.ireereplay
```
## Replay Scopes
Replay scopes are metadata records in the `.ireereplay` stream. They are
not HAL operations, they do not change normal replay execution, and they
are only observed by replay consumers that register a scope callback or
inspect dumps.
`iree-run-module` emits three standard scopes when capture is enabled:
| Scope | Meaning |
| --- | --- |
| `init` | HAL device wrapping and HAL module construction after the
recorder is created. |
| `execute` | Input parsing/staging, function invocation, waits, output
processing, and output reads. |
| `deinit` | VM context/resource teardown before the recorder is closed.
|
`iree-benchmark-module` emits one `execute` scope for each benchmark
iteration. The scope boundaries match the region where Google Benchmark
timing is resumed. For asynchronous benchmarks, setup and cleanup remain
outside the scope just as they remain outside the reported benchmark
time.
Embedders can add their own application-level phases around a
replay-wrapped device group:
```c
IREE_RETURN_IF_ERROR(
iree_hal_replay_recorder_scope_begin(recorder, IREE_SV("prefill")));
/* Run the application phase using the replay-wrapped HAL device group. */
IREE_RETURN_IF_ERROR(
iree_hal_replay_recorder_scope_end(recorder, IREE_SV("prefill")));
```
Scoped benchmarking still executes the complete replay each iteration.
The benchmark row is registered with Google Benchmark manual timing and
reports the accumulated wall-clock interval between matching scope
begin/end markers:
```shell
iree-benchmark-replay \
--device=local-task \
--benchmark_min_time=50x \
--replay_scope=execute \
/tmp/model.ireereplay
```
Note that because the init is replayed if the target has any warmup
costs the per-invocation iteration count must cover that.
Scope records are visible in dumps:
```shell
iree-dump-replay --format=jsonl /tmp/model.ireereplay | \
jq 'select(.payload_type=="replay_scope") | {operation, name: .payload.name}'
```
## Executable Substitution
Replay can replace captured executable payloads at execution time. The
same syntax is supported by `iree-run-replay` and
`iree-benchmark-replay`:
```shell
iree-run-replay \
--device=amdgpu \
--replay_executable_substitution=4=/tmp/new-kernel.hsaco \
/tmp/model.ireereplay
iree-benchmark-replay \
--device=amdgpu \
--benchmark_min_time=50x \
--replay_executable_substitution=4@amdgcn-amd-amdhsa--gfx1100=/tmp/new-kernel.hsaco \
/tmp/model.ireereplay
iree-run-replay \
--device=local-task \
--replay_executable_substitution=all@embedded-elf-x86_64=/tmp/replacement.so \
/tmp/model.ireereplay
```
Selectors are either `EXECUTABLE_ID`, `EXECUTABLE_ID@FORMAT`, `all`, or
`all@FORMAT`. Substitution is strict: replay validates the captured
executable metadata, export layout, and backend format expectations
instead of silently accepting a replacement with the wrong ABI shape.
Ergonomics improvements are needed but will happen based on feedback,
the mechanism is solid.
## Fidelity Contracts
Replay fails loudly when it cannot reproduce the captured HAL work.
These failures are part of the contract:
- Missing or identity-mismatched external files mean replay might point
at the wrong parameter archive. Fix the path with `--replay_file_remap`
or restore the referenced file.
- Persistent host write maps without an observable flush or unmap
boundary are rejected because replay cannot see the final byte contents.
- Host calls, channels, collectives, allocator import/export, and opaque
external handles are visible in dumps and fail in strict execution until
they have replay semantics.
- Imported or exported external buffers are not replayed as best-effort
snapshots because the application can mutate them outside observable HAL
map, flush, or update operations.
- Target topology matters. Select a device group whose device count and
capabilities match the captured workload.
These constraints keep replay useful for correctness and performance
work: a successful replay means the HAL stream was actually reproduced,
not that unsupported operations were skipped.
## Evidence
Toy LLaMA 3.1 `prefill_bs1`, captured and replayed on `local-task`:
| Property | Value |
| --- | ---: |
| Replay size | 6.4 MiB |
| Records | 308 |
| Objects | 70 |
| Operations | 237 |
| Unsupported records | 0 |
| Strict replay supported | yes |
| Referenced external files | 1 |
| External file bytes | 3,117,056 |
| Inline file bytes | 16,384 |
| Captured read bytes | 0 |
Operation mix:
| Operation | Count |
| --- | ---: |
| `command_buffer.dispatch` | 75 |
| `command_buffer.execution_barrier` | 37 |
| `device.queue_alloca` | 33 |
| `device.queue_read` | 31 |
| `device.create_semaphore` | 24 |
| `allocator.allocate_buffer` | 5 |
| `buffer.map_range` / `buffer.unmap_range` | 4 / 4 |
| `device.queue_execute` | 4 |
| `command_buffer.begin` / `command_buffer.end` | 2 / 2 |
| `device.create_command_buffer` | 2 |
| `executable_cache.prepare_executable` | 1 |
Baseline without replay, using the same VMFB, parameter archive,
function, and inputs through `iree-benchmark-module` on `local-task`:
| Benchmark | Iterations per rep | Median wall | Median CPU | CV wall |
CV CPU |
| --- | ---: | ---: | ---: | ---: | ---: |
| `BM_prefill_bs1/process_time/real_time` | 1000 | 0.552 ms | 7.391 ms |
0.96% | 0.38% |
Scoped replay of a benchmark-module capture, using
`--replay_scope=execute` so only the Google Benchmark timed region is
reported:
| Capture shape | Executor | Replay reps | Median scoped wall | Per
execute scope | Median CPU | CV wall | CV CPU |
| --- | --- | ---: | ---: | ---: | ---: | ---: | ---: |
| 100 captured execute scopes | prepared plan | 10 | 55.801 ms | 0.558
ms | 7.402 ms | 0.89% | 0.48% |
The scoped replay comparison is the relevant one for hot function
timing. Single-scope captures include first-iteration runtime/device
warmup. Repeated scopes with the prepared replay plan match the
benchmark-module median on this machine while bypassing the VM and
executing the exact captured HAL stream. It's not perfect (turns out,
the VM isn't slow :) but close enough for confidence. Of course, if the
original application/framework embedding IREE and capturing the replay
commits sins of its own we will not replicate them, and the replay
should be considered the best case performance once eliminating any
application layer issues.
The replay executor keeps common small transient lists inline: command
buffer dispatch bindings, queue binding tables, semaphore lists, and
one-entry barrier lists avoid per-record heap allocation.
Trace/profile validation was also captured for the same replay workload.
The profile render produced a Perfetto trace with 52,405 records,
including 14,400 host execution slices, 7,500 command-operation
instants, 13,800 queue instants, 10,100 memory instants, and 3,400
counter samples. iree-profile and the replay mechanism are designed to
work together and have near-identical results, though a perfect match
will not always be possible.
ci-extra: allIREE (Intermediate Representation Execution Eenvironment, pronounced as “eerie”) is an MLIR-based end-to-end compiler and runtime that lowers Machine Learning (ML) models to a unified IR that scales up to meet the needs of the datacenter and down to satisfy the constraints and special considerations of mobile and edge deployments.
See our website for project details, user guides, and instructions on building from source.
Releases notes are published on GitHub releases.
| Package | Release status |
|---|---|
| GitHub release (stable) | |
| GitHub release (nightly) | |
iree-base-compiler | |
iree-base-runtime |
For more details on the release process, see https://iree.dev/developers/general/release-management/.
| Operating system | Build status |
|---|---|
| Linux | |
| macOS | |
| macOS |
For the full list of workflows see https://iree.dev/developers/general/github-actions/.
See our website for more information.
Community meeting recordings: IREE YouTube channel
| Date | Title | Recording | Slides |
|---|---|---|---|
| 2025-06-10 | Data-Tiling in IREE: Achieving High Performance Through Compiler Design (AsiaLLVM) | recording | slides |
| 2025-05-17 | Introduction to GPU architecture and IREE's GPU CodeGen Pipeline | recording | slides |
| 2025-02-12 | The Long Tail of AI: SPIR-V in IREE and MLIR (Vulkanised) | recording | slides |
| 2024-10-01 | Unveiling the Inner Workings of IREE: An MLIR-Based Compiler for Diverse Hardware | recording | |
| 2021-06-09 | IREE Runtime Design Tech Talk | recording | slides |
| 2020-08-20 | IREE CodeGen (MLIR Open Design Meeting) | recording | slides |
| 2020-03-18 | Interactive HAL IR Walkthrough | recording | |
| 2020-01-31 | End-to-end MLIR Workflow in IREE (MLIR Open Design Meeting) | recording | slides |
IREE is licensed under the terms of the Apache 2.0 License with LLVM Exceptions. See LICENSE for more information.