)]}'
{
  "commit": "dfe81344abeed40c6c8bcb664eb35f113deee19c",
  "tree": "5ec34b900fd8d1e4af1c5f85c55a464ed21edba5",
  "parents": [
    "835d1b88ab3f3a7cdd0fc952c28337143602c2d8",
    "b42f44c0d93ea2e1c2198378cd2f795d5cf7af3c"
  ],
  "author": {
    "name": "Ben Vanik",
    "email": "ben.vanik@gmail.com",
    "time": "Wed Apr 29 22:20:15 2026 -0700"
  },
  "committer": {
    "name": "GitHub",
    "email": "noreply@github.com",
    "time": "Wed Apr 29 22:20:15 2026 -0700"
  },
  "message": "[HAL/AMDGPU] Initial host-side AMDGPU HAL implementation (#24298)\n\nThis PR lands IREE\u0027s native AMDGPU HAL driver: a direct HSA/ROCR backend\nthat owns queue submission, packet construction, memory placement,\ncommand-buffer recording/replay, profiling, counters, device-library\nselection, and future scheduling policy inside IREE instead of routing\nnormal execution through HIP. The cost is ~70kLoC but that gives IREE\ndirect ownership of AMD GPU execution instead of routing through HIP\nstreams and HIP graphs. The critical unlocks happen because IREE already\nknows the real program structure that HIP tries to guess at: explicit\nsemaphore frontiers, queue affinity, memory types, binding tables,\nreusable command-buffer blocks, executable metadata, profiling scopes,\nand replay captures. The native driver turns that structure directly\ninto AQL packets and queue-local completion state, which lets us do\nthings HIP cannot naturally express: low-overhead dynamic command\nbuffers, heterogeneous HAL device groups, future remote execution,\ndevice-side fixup/scheduling, and profiling/replay from the same command\nmodel. The early numbers show the shape of the win: ~12.5x lower submit\noverhead for cross-queue dependency edges, ~22x lower dynamic graph\nconstruction tax versus HIP graphs on a 512-dispatch chain, and ~20x\nlower steady-state host CPU time on queue-heavy submission paths. This\nis v0, but it is already the architecture we want to optimize: fewer\ncompatibility layers, more explicit contracts, and a path where AMD GPUs\nparticipate in the full HAL ecosystem instead of living behind a\nHIP-shaped abstraction boundary.\n\nThis is intentionally a large PR. The driver is not a thin shim around\none runtime call; it is the runtime boundary for AMD GPUs. The branch\ncontains the native driver plus the AMDGPU-specific hardening that made\nthe final shape reviewable: command-buffer replay cleanup, queue/pool\nintegration, profiling producers, target-library selection, device\ncapability handling, tests, and developer documentation.\n\nThe headlines:\n\n- IREE now has a native AMDGPU execution path based on HSA queues and\nAQL packets.\n- The driver can run normal HAL dispatches and reusable HAL command\nbuffers without HIP streams or HIP graphs.\n- The command-buffer representation is designed as a durable block\nprogram that can be replayed by host processors now and device-side\nprocessors later.\n- The profiling path can expose queue, dispatch, executable, counter,\ndevice metric, and ATT/SQTT trace data through the HAL profile tooling.\n- The hot paths are structured so static production replay does not pay\nfor optional profiling, trace, upload, or future device-fixup machinery.\n\n## Why\n\nHIP is a useful compatibility layer and comparison point, but it is not\nthe right abstraction boundary for the runtime work IREE wants to do.\n\nIREE needs to be able to control:\n\n- how HAL queue operations become AQL/PM4 packets;\n- where kernargs, command-buffer templates, transient buffers, and\nstaging records live;\n- how semaphore dependencies map to queue frontiers and completion\nepochs;\n- how reusable command buffers are recorded, validated, replayed, and\nprofiled;\n- where host work ends and queue-ordered device work begins;\n- how to capture profiling data without turning the production queue\ninto a debug path; and\n- how to evolve toward device-side command-buffer scheduling and fixup.\n\nHIP graphs are especially awkward for IREE\u0027s dynamic command-buffer use\ncase. They can be expensive to construct, hard to introspect, and\ndifficult to shape around IREE\u0027s own async allocation and replay\ncontracts. The native driver gives IREE a graph-like reusable command\nstream while keeping the command stream in IREE\u0027s own ABI.\n\n## Design Principles\n\nThe implementation follows a few constraints that are worth making\nexplicit for review.\n\n**Own the production hot path.** Queue submission, command-buffer\nreplay, kernarg formation, packet publication, and completion are\nexplicit IREE code. Optional features are allowed only when they do not\ntax the default path. For example, profiling, ATT/SQTT capture,\nqueue-control upload rings, and future device-side fixup all have opt-in\nstorage and control flow.\n\n**Record facts once.** Command buffers are allowed to do work while\nrecording and finalizing so replay can be simple. Binding counts, patch\ncounts, packet counts, barrier requirements, prepublication eligibility,\nrodata references, and block terminators are recorded in the\ncommand-buffer program instead of rediscovered by\nscanning command records during submission.\n\n**Keep host and device processors pointed at the same ABI.** The AMDGPU\ncommand buffer is a block program, not a host-only replay script. The\ncurrent host AQL block processor consumes that program; future\ndevice-side processors should consume the same block format for\ncommand-buffer continuations, scheduling, and kernarg fixup.\n\n**Separate invariant clusters.** The driver is split by subsystem rather\nthan growing one giant queue file. There are distinct files for queue\nsubmission, queue waits, command-buffer block processing, command-buffer\nreplay, profiling augmentation, staging/file paths, memory operations,\nexecutable handling, topology, device capabilities, and utility rings.\n\n**Fail loud on unsupported strategies.** Unsupported memory paths,\ncommand forms, profiling modes, and device capabilities should fail with\na concrete status instead of silently falling back through the wrong\nmechanism.\n\n**Make platform/device variation explicit.** The code names the places\nwhere HSA memory-pool access, HDP publication, topology links, target\nIDs, device-library coverage, Linux KFD metrics, and optional ROCm\nprofiling libraries affect behavior.\n\n## Architecture Overview\n\n### Driver And Device Model\n\nThe driver dynamically loads HSA/ROCR, discovers CPU and GPU agents, and\ncreates logical HAL devices over one or more physical AMDGPU agents.\n\nThe main object split is:\n\n- driver: HSA discovery, option parsing, and logical-device creation;\n- logical device: HAL-facing device object and shared runtime state;\n- physical device: one HSA GPU agent with queues, memory pools,\nexecutable cache, device-library selection, profiling state, device\nmetrics, and topology facts;\n- host queue: HSA queue plus IREE\u0027s AQL, kernarg, notification,\ncompletion, and reclaim state; and\n- virtual queue: the internal interface used so command-buffer, direct\ndispatch, memory, file, and profiling paths route through one queue\ncontract.\n\nDevice selection supports all visible AMDGPU agents by default,\nsingle-device selection, UUID-based selection, ordinal selection, and\nmulti-device logical devices. The topology code records HSA memory-pool\naccess, link class, NUMA distance, coherency, atomics, and interop\ncapability facts so future placement and transfer strategies can reason\nabout PCIe, xGMI, and other link types without hard-coded assumptions.\n\n### Executables And Device Libraries\n\nAMDGPU executables are loaded from HSACO/code-object data and matched\nagainst the selected physical device. The runtime also embeds AMDGPU\ndevice libraries used for builtin operations such as fill/copy helpers,\ntimestamp helpers, and dispatch-side utilities.\n\nThe device-library target map is single-sourced from generated target\nmetadata. Builds can select exact targets, LLVM generic targets,\nTheRock-style generic families, or product bundles. This keeps package\nsize and device coverage under explicit build-system control while\nletting the runtime fail clearly when a required target was not\nembedded.\n\n### Memory, Pools, And Publication\n\nThe driver integrates with the HAL pool substrate and AMDGPU HSA memory\npools instead of treating all buffers as generic allocations.\n\nThe implementation distinguishes:\n\n- device-local memory;\n- CPU-visible fine-grained host memory;\n- CPU-visible coarse-grained device memory;\n- queue-owned kernarg memory;\n- optional queue-control upload memory;\n- transient allocation pools;\n- file/staging storage; and\n- host-side block/slab pools used by queue and profiling data\nstructures.\n\nHDP publication is represented as a selected capability of the memory\npath, not as an ad hoc flush sprinkled through dispatch code. If CPU\nwrites to memory that the GPU will consume require publication on a\ndevice, the queue-owned memory path knows how to publish those writes\nbefore the relevant packet headers become\nvisible.\n\nThe default queue-control upload ring is disabled until a production\nconsumer opts in. That keeps the future device-side fixup path available\nwithout charging every queue an unused HSA allocation.\n\n### Queue Submission And Completion\n\nHost queues own an HSA AQL queue and maintain:\n\n- an AQL ring view for packet reservation/publicat\n\u003e \u003cimg width\u003d\"403\" height\u003d\"222\" alt\u003d\"image\"\nsrc\u003d\"https://github.com/user-attachments/assets/11a9ef26-dc32-427c-a01e-7969fd24ec2d\"\n/\u003e (kind of, consider this the reference implementation)\n\nion;\n- a kernarg ring for queue-owned dispatch arguments;\n- an epoch/notification ring mapping GPU completions to HAL semaphore\nsignals;\n- a queue frontier snapshot for dependency tracking;\n- one completion thread that drains queue epochs and publishes\nuser-visible semaphore completions;\n- optional PM4 IB slots indexed by AQL packet id on hardware that\nsupports AQL PM4 packets; and\n- optional profiling/counter/trace state.\n\nSubmission is serialized per queue, but independent queues do not\nsynchronize with each other. The queue submission path reserves AQL\npackets, kernargs, and notification entries before publishing headers.\nIf admission fails, reclaim is routed through the same\nnotification/reclaim machinery instead of inventing a\nparallel cleanup path.\n\nHAL ordering is represented by semaphore/frontier dependencies, not by\nassuming FIFO execution. The queue frontier machinery lets the driver\nelide redundant waits when the dependency is already known to be\nsatisfied, while preserving correctness when the frontier overflows or\ncannot prove elision.\n\n### Direct Dispatch And Builtin Operations\n\nDirect `queue_dispatch` resolves executable metadata, validates dispatch\nshape, forms kernargs, retains the executable/buffer resources required\nby the submission, and emits AQL packets through the common queue\nsubmission path.\n\nQueue buffer operations are implemented through explicit strategies.\nBuiltin device kernels cover fill/copy/update paths and are selected\nbased on alignment, size, and available device-library kernels. The code\nleaves room for SDMA, PM4, P2P, and future direct-storage strategies\nwithout conflating those with the current kernel-dispatch path.\n\n### Command Buffers\n\nThe AMDGPU command-buffer ABI is the center of the rewrite.\n\nRecorded command buffers are stored as a program of blocks. Each block\nhas a fixed header with command counts, binding-source counts,\npacket/kernarg worst case, rodata extent, dispatch/profile-marker\ncounts, barrier metadata, and a terminator. Commands include barriers,\ndispatches, fills, copies, updates,\nprofile markers, branches, conditional branches, and returns.\n\nThe important split is:\n\n- the command buffer owns the durable block program and rodata;\n- the AQL block processor consumes one block and writes reserved\npacket/kernarg storage;\n- host queue replay is the container/orchestration layer that\ninitializes a processor, invokes blocks, handles continuations, and\nintegrates with semaphores/reclaim; and\n- profiling processors are separate variants that augment replay only\nwhen profiling was explicitly requested.\n\nThis shape is deliberate. A block processor is close to a small\ninterpreter over the block ABI. It is suitable for dedicated tests today\nand for device-side processor variants later. Host queue code should not\nneed to know how every command body becomes AQL packets.\n\nReplay hot paths are specialized:\n\n- static reusable dispatches can use prepublished kernargs;\n- all-dynamic dispatches use a direct binding-pointer scatter path;\n- mixed static/dynamic reusable dispatches use immutable templates plus\nrecorded dynamic patch sources;\n- indirect dispatch parameters stay on the generic path where required;\nand\n- profile-disabled replay bypasses profile sidecars and trace/counter\nlogic.\n\nDynamic binding sources retain the original `queue_execute` binding\ntable slot for the entire command-buffer lifetime. There is no per-block\nbinding remap sidecar, and no finalization scan that rewrites binding\nslots. Future\ndevice-side fixup should consume recorded patch records directly: `patch\noffset + binding table slot + binding offset`.\n\n### Profiling, Counters, Traces, And Replay\n\nThe driver is a first-class producer for the HAL-native profiling and\nreplay stack.\n\nSupported profiling/data modes include:\n\n- host-side memory and queue events;\n- device-side queue timestamps;\n- per-dispatch timestamps;\n- executable/export metadata;\n- hardware/software counters;\n- queue-range PMC sampling;\n- device metrics from platform-specific sources;\n- filtered ATT/SQTT executable traces through dynamically loaded ROCm\nprofiling libraries; and\n- replay captures that can be run, benchmarked, dumped, and profiled\noutside the original application.\n\nNormal execution does not require ROCm profiling libraries. The\naqlprofile path is dynamically loaded only for modes that need counters\nor executable traces. Linux-specific device-metric support is isolated\nbehind a platform source so the core driver remains structured for\nfuture Windows and macOS HSA support.\n\n## Performance Evidence\n\nThe main apples-to-apples GPU comparison uses the SDXL CLIP prompt\nencoder: a real sharktank workload with 792 dispatches, 28 executables,\nand enough queue traffic to exercise command-buffer replay and\nhost/runtime overhead.\n\nPost-cleanup optimized non-Tracy medians:\n\n| Shape | AMDGPU wall | HIP stream wall | AMDGPU vs stream | HIP graph\nwall | AMDGPU vs graph | AMDGPU host CPU |\n| --- | ---: | ---: | ---: | ---: | ---: | ---: |\n| c1/d1 | 10.9508 ms | 11.5456 ms | 5.15% faster | 11.6199 ms | 5.76%\nfaster | 0.618 ms |\n| c1/d16 | 0.7035 ms/item | 0.7311 ms/item | 3.78% faster | 0.7335\nms/item | 4.09% faster | 0.036 ms/item |\n| c2/d16 | 0.7073 ms/item | 0.7298 ms/item | 3.08% faster | 0.7330\nms/item | 3.50% faster | 0.037 ms/item |\n| c4/d16 | 0.7066 ms/item | 0.7278 ms/item | 2.92% faster | 0.7288\nms/item | 3.05% faster | 0.037 ms/item |\n| c8/d16 | 0.7058 ms/item | 0.7322 ms/item | 3.60% faster | 0.7333\nms/item | 3.75% faster | 0.038 ms/item |\n\nThe broader model spread is consistent with the same story: native\nAMDGPU is usually ahead of HIP stream, usually ahead of HIP graph when\nHIP graph can import the workload, and uses much less host CPU on\nqueue-heavy paths.\n\nRepresentative additional rows:\n\n| Workload | Shape | AMDGPU | HIP stream | HIP graph | Notes |\n| --- | --- | ---: | ---: | ---: | --- |\n| MNIST-12 | c1/d1 | 0.0978 ms | 0.1423 ms | 0.1425 ms | Small\nclassifier, high runtime-overhead sensitivity. |\n| SqueezeNet 1.0 | c1/d1 | 1.1428 ms | 1.2043 ms | 1.1988 ms | Compact\nCNN. |\n| toy CLIP bf16 | c1/d1 | 0.2227 ms | 0.2578 ms | 0.2597 ms |\nTransformer-ish toy encoder. |\n| MobileNetV2-12 | c1/d1 | 1.8462 ms | 1.9316 ms | crash |\nDepthwise/mobile CNN; HIP graph crashes locally. |\n| TinyYOLOv2-8 | c1/d1 | 7.6516 ms | 8.0490 ms | 8.5600 ms | Object\ndetection graph. |\n| ResNet50-v1-12 | c1/d1 | 9.5364 ms | 9.6900 ms | import fails | HIP\ngraph node limit. |\n| SDXL scheduled UNet | c1/d1 body | 204.36 ms | 215.19 ms | 216.43 ms |\nDirect `run_forward` body. |\n| SDXL CLIP prompt encoder | c8/d16 | 0.692 ms | 0.721 ms | 0.725 ms |\nByte-identical HSACO/no-prefetch row. |\n\nWe also compared raw C HAL command-buffer construction/replay against\nraw C HIP graph construction/launch for a 512 dispatch/barrier chain,\navoiding VM overhead on both sides:\n\n| Path | Prebuilt wall | Dynamic wall | Extra wall | Extra wall /\ndispatch | Extra CPU / dispatch |\n| --- | ---: | ---: | ---: | ---: | ---: |\n| HAL command buffer, validated | 2096.4 us | 2177.0 us | 80.5 us |\n0.157 us | 0.582 us |\n| HAL command buffer, unvalidated | 2096.4 us | 2143.3 us | 46.9 us |\n0.092 us | 0.526 us |\n| HIP graph | 2983.7 us | 4022.9 us | 1039.3 us | 2.030 us | 2.308 us |\n\nThat is the key dynamic-command-buffer result: unvalidated HAL\ncommand-buffer recording/replay adds tens of microseconds for the\n512-pair chain, while HIP graph construction adds about a millisecond in\nthe same harness.\n\nQueue-stress microbenchmarks isolate the pathological submission streams\nthat large distributed and graph-style applications care about. The\ncurrent-head HAL rows below use the checked-in AMDGPU `queue_benchmark`\nbuilt optimized with release ThinLTO/O3/native flags, pinned to one CPU\nand one local RDNA3 GPU. HIP rows use the matching HIP event ping-pong\nharness on the same CPU/GPU pin. The end-to-end rows measure 512\ncross-queue dependency edges plus one public host-visible completion:\n\n| Shape | AMDGPU end-to-end / edge | HIP end-to-end / edge | Read |\n| --- | ---: | ---: | --- |\n| Cross-queue dependency edge | 4.58 us | 11.20 us | AMDGPU is 2.4x\nfaster. |\n| Edge + 4-byte device copy | 11.65 us | 14.62 us | AMDGPU is 1.25x\nfaster. |\n| Edge + 4-byte device fill | 10.98 us | 15.20 us | AMDGPU is 1.38x\nfaster. |\n| Edge + tiny dispatch | 10.55 us | 14.59 us | AMDGPU is 1.38x faster. |\n| Edge + no-op dispatch packet | 4.56 us | n/a | AMDGPU stays near the\npure dependency floor when payload work is empty. |\n\nThe pure submit-only dependency row is the sharpest host-path\ncomparison: AMDGPU submits a cross-queue dependency edge for about 0.42\nus/edge, while HIP events cost about 5.23 us/edge in the same pinned\nharness. That is about 12.5x less host-side submission overhead for the\nsynchronization pattern used by tensor-parallel and pipeline-parallel\nprograms.\n\nThis is not just an implementation-speed comparison. HIP stream events\nand HIP graphs sit above a compatibility runtime that has to rediscover\nintent from streams, events, graph nodes, kernel parameters, and raw\npointer arguments. IREE already has that intent in structured HAL\ncommands: explicit semaphore frontiers, queue affinity, binding tables,\nmemory types, command-buffer blocks, and executable metadata. The AMDGPU\nHAL can turn those contracts directly into AQL packets and queue-local\ncompletion state without routing every operation\nthrough HIP\u0027s public stream/event/graph abstraction.\n\nThat structural difference is why the CPU-time story is as important as\nthe wall-time story. On the SDXL CLIP prompt encoder, AMDGPU runs the\nsteady-state batched path with roughly 0.036-0.038 ms/item of host CPU\ntime while HIP stream and HIP graph paths are around 0.74-0.76 ms/item.\nThat is a roughly 20x host CPU reduction on the queue-heavy path. On\nsystems with many accelerators, expensive prefill/decode scheduling, or\nsmall CPU budgets, that difference is the difference between the CPU\nbeing orchestration glue and the CPU becoming the\nbottleneck.\n\nThe same abstraction boundary is also what lets HAL scale beyond HIP\u0027s\nworld model. HAL command buffers, semaphores, queue affinity, memory\nfiles, and device groups can describe local GPUs, CPU devices, remote\ndevices, and heterogeneous execution without changing the program\u0027s\nsynchronization model. The upcoming remote HAL work can use the same\ncommand/dependency concepts across process or machine boundaries; HIP\ncannot represent that kind of heterogeneous or remote execution graph\nwithout collapsing it back into host-side framework logic. This rewrite\nputs AMDGPU on the same HAL substrate as local-task, local-sync,\nprofiling, replay, and future remote execution instead of treating AMD\nGPUs as a HIP-shaped island.\n\nTracy and Perfetto captures were used as structural evidence for queue\nshape, host/runtime gaps, worker behavior, dispatch timing, counter\nranges, and device metric sampling. Non-Tracy optimized runs are the\nsource of the wall-time numbers above.\n\n## Portability And Hardware Coverage\n\nThe current implementation has been exercised primarily on local\nRDNA3/gfx1100 Linux hardware, but the code is structured for broader\nAMDGPU support.\n\nCross-device preparation in this PR includes:\n\n- target ID parsing and generated target maps for exact, generic,\nfamily, and product-bundle device-library selection;\n- explicit HSA memory-pool access and link-topology modeling;\n- CPU-visible device-coarse memory capability selection with HDP\npublication;\n- queue-owned kernarg publication policy;\n- PM4 capability detection and AQL PM4 IB infrastructure where\nsupported;\n- generic device-library target selection instead of hard-coding\ngfx1100; and\n- tests around target IDs, code-object target selection, topology,\nmemory access, device-library lookup, and PM4/AQL emitters.\n\nCross-platform preparation includes:\n\n- dynamic HSA loading instead of a direct link dependency;\n- platform-isolated Linux KFD/device-metric support;\n- optional dynamic loading of ROCm profiling libraries;\n- public HAL abstractions for profiling/replay rather than AMDGPU-only\ntool hooks; and\n- explicit failure for unsupported platform features.\n\nThis PR does not claim every modern RDNA/CDNA target is fully proven. It\ngives us the driver architecture, target map, and capability seams\nrequired to harden that matrix as more hardware and platform HSA stacks\nbecome available.\n\n## Forward-Looking Work Enabled By This Shape\n\nSeveral important features are intentionally not completed in this PR,\nbut the landed architecture is designed around them.\n\n**Device-side dynamic kernarg fixup.** Dynamic command buffers currently\npatch queue-owned kernargs on the host. The planned production path is\nto upload a small per-submission binding table/control record and\ndispatch a device-side fixup kernel that copies template kernargs and\npatches dynamic qwords before\npayload dispatches execute. The recorded command-buffer patch records\nalready carry the essential facts: target patch location, original\nbinding-table slot, and binding offset.\n\n**Device-side command-buffer scheduling.** The block-program ABI gives\nus a clean path to device-side processors. A device queue can invoke\nblock processors, advance command-buffer continuations, and schedule\nindependent blocks without forcing host queue code to understand every\ncommand body.\n\n**Command-buffer control flow.** The ABI already reserves branch,\nconditional branch, and return terminators. Host replay currently\nsupports the subset needed by the landed workloads; the representation\nis intentionally shaped so richer control flow can become an execution\nfeature rather than a new command-buffer\nformat.\n\n**Binding-table-indirect dispatch ABI.** A future dispatch ABI may avoid\ndynamic kernarg pointer fixup by passing an invocation-local binding\ntable base and loading buffer pointers indirectly in kernels. That needs\ncompiler/runtime experiments to measure the cost of an extra scalar load\nversus raw pointer kernargs, but the current direct binding-table slot\ninvariant is compatible with that direction.\n\n**PM4-backed queues and operations.** The driver now has PM4 emitters,\nPM4 program utilities, capability detection, and AQL PM4 IB slots on\nsupported hardware. That creates room for PM4-backed waits, transfers,\nprofiling snippets, and potentially lower-level queue strategies where\nHSA/AQL alone is not the best mechanism.\n\n**Transfer strategy expansion.** Current transfer paths use explicit\nbuiltin device kernels and staging strategies. The queue/file/memory\nsplit leaves room for SDMA, P2P, direct storage, and topology-aware copy\nselection without rewriting the core queue completion path.\n\n**Broader profiling.** CDNA devices should expose richer counter options\nthan the initial local setup. The queue-range PMC and profile-bundle\ninfrastructure are meant to scale into that environment without changing\nthe normal execution path.\n\n## Review Guide\n\nGood entry points for review:\n\n- `runtime/src/iree/hal/drivers/amdgpu/README.md`: user-facing driver\noverview, build flags, runtime selection, profiling, and target-library\nnotes.\n- `runtime/src/iree/hal/drivers/amdgpu/api.h`: public driver/device\noptions.\n- `runtime/src/iree/hal/drivers/amdgpu/driver.c`: driver registration,\nHSA loading, and device creation.\n- `runtime/src/iree/hal/drivers/amdgpu/logical_device.c`: HAL device\nmethods, profiling/replay integration, and physical-device\norchestration.\n- `runtime/src/iree/hal/drivers/amdgpu/physical_device.c`: HSA agent\nsetup, queue creation, memory pools, executable caches, device\nlibraries, profiling, and topology state.\n- `runtime/src/iree/hal/drivers/amdgpu/host_queue.c`: queue ownership,\ncompletion thread, submission state, and reclaim lifetime.\n- `runtime/src/iree/hal/drivers/amdgpu/host_queue_submission.c`: common\nsubmission admission, publication, and failure/reclaim path.\n- `runtime/src/iree/hal/drivers/amdgpu/aql_command_buffer.c`:\ncommand-buffer recording, layout, prepublication, dynamic binding\nstrategy, and block construction.\n- `runtime/src/iree/hal/drivers/amdgpu/abi/command_buffer.h`: durable\ncommand-buffer block ABI.\n- `runtime/src/iree/hal/drivers/amdgpu/aql_block_processor.c`:\nunprofiled AQL block processor.\n- `runtime/src/iree/hal/drivers/amdgpu/aql_block_processor_profile.c`:\nprofiling-augmented block processor.\n- `runtime/src/iree/hal/drivers/amdgpu/host_queue_command_buffer*.c`:\nhost replay orchestration, block submission, packet policy, scratch\nstorage, and profiling integration.\n- `runtime/src/iree/hal/drivers/amdgpu/profile_*.c`: profile producers\nfor events, metadata, counters, device metrics, and traces.\n- `runtime/src/iree/hal/drivers/amdgpu/device/*.c`: embedded device-side\nhelper kernels and host-side packet/kernarg formation helpers.\n- `runtime/src/iree/hal/drivers/amdgpu/util/*.c`: HSA loading, target\nIDs, code-object metadata, rings, signals, PM4/AQL emitters, topology,\nand KFD utilities.\n\n## Validation\n\nValidation covered both source-level unit tests and workload-level\nevidence:\n\n- focused AMDGPU unit tests for HSA loading, target IDs, code-object\nmetadata, device libraries, topology, capabilities, pools, signals,\nrings, emitters, executables, semaphores, allocators, command buffers,\nblock processors, host queue submission, staging, profiling\nmetadata/events, and CTS backends;\n- AMDGPU HAL CTS dispatch/executable coverage;\n- focused Linux Bazel ASAN builds/tests for the AMDGPU runtime targets;\n- focused CMake configure/build/test coverage for AMDGPU runtime\nlibraries and generated CTS artifacts;\n- Windows and macOS CMake validation of the shared\nHAL/async/profile/replay substrate that this driver depends on;\n- SDXL CLIP correctness on both visible local AMDGPU devices with the\nsame weights, inputs, and expected outputs used for CPU validation;\n- SDXL CLIP, SDXL UNet, model-spread, command-buffer-vs-HIP-graph,\nTracy, Perfetto, device-metrics, PMC, and ATT/SQTT profiling runs; and\n- pre-commit formatting/check generation hooks for the final branch.\n\nThe performance numbers in this PR are from optimized non-Tracy runs on\nmy machine, YMMV. Tracy, Perfetto, counters, and device metrics were\nused to explain structure and validate behavior, not as the source of\nwall-clock claims.",
  "tree_diff": []
}