)]}'
{
  "log": [
    {
      "commit": "1d87665ee3f32be1d918c8da57069e31aad2e3cb",
      "tree": "cf965efb733d01503849c29bb461c96038b3945d",
      "parents": [
        "e1568329de107aeb5f4fe7e7ff36e9fe99fa5397"
      ],
      "author": {
        "name": "Lukas Sommer",
        "email": "lukas.sommer@amd.com",
        "time": "Fri May 08 16:14:06 2026 +0200"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Fri May 08 16:14:06 2026 +0200"
      },
      "message": "[VectorDistribute] Analysis to propagate promotion information (#24227)\n\nAdd a sparse backward dataflow analysis to propagate promotion types\nfrom `to_layout` operations on operands to compute ops backward to the\norigin of the operand, e.g., a read from memory.\n\nThe motivation is to use this analysis from transformation passes to\nfind loads from memory that should be promoted, e.g., using\ndirect-to-LDS and then transform the load accordingly.\n\nThe lattice of the analysis has three states: an undefined (initial)\nstate, a defined state with a concrete promotion type stored as\nattribute and an overdefined (top) state. If two defined lattice values\nwith different promotion types meet, the resulting lattice value is\noverdefined.\n\nThis is part of https://github.com/iree-org/iree/issues/23782.\n\nAssisted-by: Claude Code and Codex\n\nSigned-off-by: Lukas Sommer \u003clukas.sommer@amd.com\u003e"
    },
    {
      "commit": "e1568329de107aeb5f4fe7e7ff36e9fe99fa5397",
      "tree": "c838a6134757d0bf8969a7acb48a0428cb8ddb29",
      "parents": [
        "8107036f152829cd3a03615734ede3d855c106cd"
      ],
      "author": {
        "name": "Aaron St George",
        "email": "aaronstgeorge@gmail.com",
        "time": "Fri May 08 06:24:11 2026 -0600"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Fri May 08 05:24:11 2026 -0700"
      },
      "message": "Enable building without google benchmark dependency  (#24167)\n\nWhen building in TheRock we need to [pull the third party google\nbenchmark\ndependency](https://github.com/ROCm/TheRock/blob/497af57abf59b11237b55c4ebd553b89d00d488b/build_tools/fetch_sources.py#L661)\neven though the build in TheRock builds [without\ntests](https://github.com/ROCm/TheRock/blob/497af57abf59b11237b55c4ebd553b89d00d488b/iree-libs/CMakeLists.txt#L19-L38)\nand does not require `iree-benchmark-module`.\n\nThis PR adds an option to build IREE without benchmark dependency,\nallowing us to remove the dep from TheRock. Removing the dep limits\noperational risk and allows us to sidestep a configure failure coming\nfrom benchmark CMake setup in ASan builds (a compiler invocation inside\na capability check fails to find the ASan runtime library).\n\n---------\n\nCo-authored-by: Claude Opus 4.7 (1M context) \u003cnoreply@anthropic.com\u003e"
    },
    {
      "commit": "8107036f152829cd3a03615734ede3d855c106cd",
      "tree": "e7ab80588beeda6c09902d04073a503bf7970ea5",
      "parents": [
        "597629e7b3bccd8ffce6e542e3230d0c5bc84e16"
      ],
      "author": {
        "name": "Zhuoran Yin",
        "email": "zhuoryin@amd.com",
        "time": "Fri May 08 08:17:49 2026 -0400"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Fri May 08 08:17:49 2026 -0400"
      },
      "message": "[Codegen][GPU] Remove the single-iteration workaround and distribute pad-fused copies (#24116)\n\nPreviously, pad-fused `linalg.copy` operations were wrapped in a\nsingle-iteration `scf.forall` (TODO #23365), restricting the DMA load to\none warp. After single-iteration loop removal, this produced an `scf.if`\npredicated on `thread_id` which blocked software pipelining — the\npipeliner cannot handle `gather_to_lds` inside conditional blocks. On a\n1134×2048×150000 bf16 GEMM (LHS transposed), this caused a ~36%\ndirect-load regression.\n\nThis PR does the following:\n1. Remove single-iteration workaround in `tileAtSubgroupLevel`:\nPad-fused copies now use `computeSubgroupTileSizes` for multi-warp\ndistribution, same as non-padded copies.\n2. Propagate source offsets in `createDMAInForall`: When a `tensor.pad`\nis detected, trace from the tiled `extract_slice` back to the original\npad to recover warp offsets. Create a new `tensor.extract_slice` from\nthe pre-pad source with offsets and sizes clamped to actual source\nbounds via `arith.minsi/maxsi/subi`. The DMA\u0027s `in_bounds` attribute\nsignals which dimensions may read OOB, relying on hardware\n(`fat_raw_buffer`) to return zeros.\n3. Guard `ConvertPadFusionCopyToCoalescedDMA`: Skip copies already\ndistributed into warp-mapped foralls, making this pattern a fallback for\nedge cases.\n\n**Before** (single warp, blocks pipelining):\n```\nscf.forall (%warp) \u003d (0, 0) to (4, 64) step (4, 64)   // 1 iteration\n  coalesced_gather_dma %pre_pad_source into %full_init\n```\n\n**After** (all warps participate, pipelining enabled):\n```\nscf.forall (%warp) \u003d (0, 0) to (4, 64) step (1, 64)   // 4 iterations\n  %clamped \u003d extract_slice %pre_pad_source[minsi(%warp, %src_dim), 0]\n                                          [minsi(%remaining, 1), 64]\n  coalesced_gather_dma %clamped into %warp_init in_bounds [false, true]\n```\n\n---------\n\nSigned-off-by: jerryyin \u003czhuoryin@amd.com\u003e"
    },
    {
      "commit": "597629e7b3bccd8ffce6e542e3230d0c5bc84e16",
      "tree": "1730a9717d320193ca1117b97ada14496913cbc3",
      "parents": [
        "98ddf0c9c524082762d7c93182ccd63d36e31aec"
      ],
      "author": {
        "name": "Lukas Sommer",
        "email": "lukas.sommer@amd.com",
        "time": "Fri May 08 09:07:59 2026 +0200"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Fri May 08 09:07:59 2026 +0200"
      },
      "message": "[Codegen] Limit async scope in pipelining (#24350)\n\nCurrently, the pipelining assumes that all `amdgpu.gather_to_lds`\noperations are part of the loop that is being pipelined and marks them\nas async, with waits inserted in the pipelined loop.\n\nThis assumption was fine for the workloads so far, but with work\nunderway to enable pipelining for attention, this no longer holds, as\nthe load of e.g. the `Q` matrix can be outside of the loop. Operations\noutside the loop can also be marked async, but that requires inserting\nmarks and waits before the loop.\n\nThis PR fixes this by recording a marker in the block before the loop\nand later on inserting marks and waits for async operations before the\nmarker.\n\nThis is part of https://github.com/iree-org/iree/issues/23782.\n\nAssisted-by: Claude Code and Codex\n\n---------\n\nSigned-off-by: Lukas Sommer \u003clukas.sommer@amd.com\u003e"
    },
    {
      "commit": "98ddf0c9c524082762d7c93182ccd63d36e31aec",
      "tree": "09ce8be087d95a62fe1507e180e23ed4b4c8e879",
      "parents": [
        "c0f5d4be28ddae1362da84aefc71467b5cb2c2c2"
      ],
      "author": {
        "name": "Lukas Sommer",
        "email": "lukas.sommer@amd.com",
        "time": "Fri May 08 08:50:37 2026 +0200"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Fri May 08 08:50:37 2026 +0200"
      },
      "message": "[VectorExt] Change `to_layout` `shared_memory_conversion` (#24377)\n\nChange how `shared_memory_conversion` is represented on the `to_layout`\noperation. So far, a unit attribute was used to simply turn shared\nmemory conversion on/off. Now, the `shared_memory_conversion` takes an\nattribute, which allows us to attach for example the GPU promotion type\n(e.g. `use_global_load_dma`) to the operation.\n\nThe motivation is to use this additional information in analyses such as\nhttps://github.com/iree-org/iree/pull/24227 as part of\nhttps://github.com/iree-org/iree/issues/23782.\n\nThe representation is `AnyAttr` to (a) avoid a dependency from\n`VectorExt` on `IREEGPU` and (b) to allow other pipelines to use the\nattribute differently.\n\nThe PR also adds some verification to the `LoweringConfigAttr` to avoid\nhaving to repeatedly check that the list of promoted operands and\npromotion types has the same length.\n\nAssisted-by: Codex\n\n---------\n\nSigned-off-by: Lukas Sommer \u003clukas.sommer@amd.com\u003e"
    },
    {
      "commit": "c0f5d4be28ddae1362da84aefc71467b5cb2c2c2",
      "tree": "98aa386d1eec16d6c8b636cf5b3c5f8a4d7e4944",
      "parents": [
        "8cc05f517efb04e4302e8e35e73ed598f9dc234f"
      ],
      "author": {
        "name": "Jakub Kuderski",
        "email": "jakub@nod-labs.com",
        "time": "Thu May 07 21:57:37 2026 -0400"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Thu May 07 21:57:37 2026 -0400"
      },
      "message": "[Codegen] Move remaining pipelines to `iree_codegen` attrs (#24398)\n\nFollow-up to the LLVMGPU, SPIRV, and LLVMCPU pipeline migrations\n(#23816, #23851, #23864). Same goal: get backend-specific pipeline\nidentifiers out of the global `DispatchLoweringPassPipeline` enum and\nbehind `PipelineAttrInterface`.\n\nThis moves the remaining non-CPU/GPU pipeline cases to `iree_codegen`:\n`#iree_codegen.vmvx_pipeline`,\n`#iree_codegen.transform_dialect_codegen`, and\n`#iree_codegen.no_pipeline`. It keeps custom pass pipelines represented\nas `#iree_codegen.pass_pipeline\u003c...\u003e`.\n\nThese attrs intentionally stay in `iree_codegen`. `vmvx_pipeline` does\nnot move to a backend dialect because IREE does not have a VMVX codegen\ndialect today; it can become a backend-owned enum attr if such a dialect\nis introduced later. `transform_dialect_codegen` and `no_pipeline` are\nalso codegen-level sentinels, so modeling the remaining cases as unit\nattrs avoids creating dialect surface just for pipeline spelling.\n\nAfter this, `TranslationInfoAttr` and `iree_codegen.smt.constraints`\nparse pipeline attributes directly through the interface. The old enum,\nkeyword compatibility parser, helper predicates, and textual parsing in\nPython dialect tests can be removed.\n\nIssue: https://github.com/iree-org/iree/issues/23535\n\nAssisted-by: codex"
    },
    {
      "commit": "8cc05f517efb04e4302e8e35e73ed598f9dc234f",
      "tree": "8e0b28e6469a6c8310ad2d7e8fc975d823bdf527",
      "parents": [
        "16191ce4a0e641e5212ec0aef75600ce3c8b60c8"
      ],
      "author": {
        "name": "Han-Chung Wang",
        "email": "hanhan0912@gmail.com",
        "time": "Thu May 07 16:51:41 2026 -0700"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Thu May 07 16:51:41 2026 -0700"
      },
      "message": "[CPU] Improve tiling config for elementwise ops with dynamic shapes. (#24383)\n\nThe default distribution tile sizes were 64s for dynamic shape, which\nalso requires a fixup in elementwise ops. Otherwise, the runtime\noverheads dominates the performance. We identified the issue for static\ncases before, and the recent study also points out that it is needed for\ndynamic shapes. See https://github.com/iree-org/iree/issues/24012 for\nmore details.\n\nNote that it is not common in full model workload because they are\nmostly fused with producers, so it was not on our radar until today.\n\nSigned-off-by: hanhanW \u003chanhan0912@gmail.com\u003e"
    },
    {
      "commit": "16191ce4a0e641e5212ec0aef75600ce3c8b60c8",
      "tree": "670e82539c626b748d0861ff4379f5bf6485192a",
      "parents": [
        "0fe0ca8f7c7daf9af6e6da1df995bdd1098a8c9c"
      ],
      "author": {
        "name": "Zhewen Yu",
        "email": "zhewenyu@amd.com",
        "time": "Fri May 08 00:43:05 2026 +0100"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Thu May 07 23:43:05 2026 +0000"
      },
      "message": "Revert \"[Codegen] Enable DMA by default for F16/BF16 Gemm on gfx950 (#24373)\" (#24395)\n\nThis reverts commit 4f990431a73902c02288fd5892ddf4540b72998b, due to\nperformance regression on some 1x1 conv.\n\nSigned-off-by: Yu-Zhewen \u003czhewenyu@amd.com\u003e"
    },
    {
      "commit": "0fe0ca8f7c7daf9af6e6da1df995bdd1098a8c9c",
      "tree": "622cd8da8d264e21f6d747073db7ee531d08d8d8",
      "parents": [
        "d62d69b27352204b026883498be964e0606c0eec"
      ],
      "author": {
        "name": "Keshav Vinayak Jha",
        "email": "31160700+keshavvinayak01@users.noreply.github.com",
        "time": "Fri May 08 03:05:33 2026 +0530"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Fri May 08 03:05:33 2026 +0530"
      },
      "message": "[Torch][LinalgExt] Support GQA in torch.hop_flex_attention lowering (#24313)\n\nThis PR adds grouped-query attention support to the\ntorch.hop_flex_attention lowering in IREE’s Torch input pipeline.\n\nWhen query, key, and value have different head counts, the lowering now\nexpands key and value heads to match the query head count before\nemitting `iree_linalg_ext.online_attention`.\n\nThis is similar to how the torch-mlir \u003d\u003e TMTensor lowering handles this\ncase.\n\n---------\n\nSigned-off-by: Keshav Vinayak Jha \u003ckeshavvinayakjha@gmail.com\u003e\nCo-authored-by: GPT-5 Codex \u003cnoreply@openai.com\u003e"
    },
    {
      "commit": "d62d69b27352204b026883498be964e0606c0eec",
      "tree": "05c1e4cd3a4199902b6e05cb358c975c97acf688",
      "parents": [
        "c4da71ca6bf0ea7d45fe1453e76bd20a17ea26dd"
      ],
      "author": {
        "name": "Jakub Kuderski",
        "email": "jakub@nod-labs.com",
        "time": "Thu May 07 16:49:35 2026 -0400"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Thu May 07 20:49:35 2026 +0000"
      },
      "message": "[Compiler] Use Repeated\u003cT\u003e for repeated value ranges. NFC. (#24392)\n\nReplace repeated `SmallVector\u003cValue/Type\u003e(n, x)` temporaries with\n`llvm::Repeated\u003cValue/Type\u003e` where the result is only consumed as a\n`ValueRange` or `TypeRange`. This avoids needless memory allocations.\n\nThis mirrors the MLIR-side refactoring in llvm/llvm-project#188846.\n\nAssisted-by: codex"
    },
    {
      "commit": "c4da71ca6bf0ea7d45fe1453e76bd20a17ea26dd",
      "tree": "d7dc2e69c266d5f18a00a85811effbbc6f2660eb",
      "parents": [
        "fac9d3d9d9b1ea55757c3ae94e9b03ff50b6d4f0"
      ],
      "author": {
        "name": "Benoit Jacob",
        "email": "jacob.benoit.1@gmail.com",
        "time": "Thu May 07 16:28:45 2026 -0400"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Thu May 07 20:28:45 2026 +0000"
      },
      "message": "[Codegen][CPU] Add a type-polymorphic generic-scalar MMA fallback. (#24389)\n\nAdds two new `MMAIntrinsic` values, `MMA_GENERIC_SCALAR_1x1x1_REG8` and\n`_REG16`, that the data-tiling cost model picks when no element-type-\nspecific intrinsic on the target supports the matmul\u0027s (LHS, RHS, ACC)\ntypes. This intentionally breaks the \"an MMAIntrinsic enum value pins\ndown a specific element-type triple\" invariant in exchange for not\nhaving to add one enum value per supported triple. Element types live on\nnew `DataTiledMMAAttr.{lhs,rhs,acc}_type` parameters, populated by the\ncost model only when the chosen intrinsic is one of the polymorphic\nvariants.\n\nThe cost model picks `_REG16` on 64-bit ISAs (x86_64, AArch64, RISC-V)\nand `_REG8` on 32-bit ISAs. The number is a register-budget for the\nunroll heuristic — one element of any width occupies one register, but\nthe architectural register file the lowering ends up in (GPR or SIMD-\nscalar lane) is up to LLVM. The budget is encoded in the low byte of the\nenum value, so `chooseUnrolling` can read it back.\n\nSince the intrinsic is 1×1×1, the operand tiles after `intrinsics_m` /\n`intrinsics_n` / `intrinsics_k` are simple row-major (M, K) / (N, K) /\n(M, N) — `linalg.mmt4d`-shaped.\n`DataTiledMMAAttr::buildUnderlyingOperations` therefore short-circuits\nthe swizzle/distribute pipeline for these intrinsics and emits a single\n`vector.contract` directly, with `arith.extf` / `arith.extsi` widening\nnarrow LHS/RHS to ACC\u0027s element type. For sub-byte LHS/RHS types\n`chooseUnrolling` also picks the smallest power-of-two `intrinsics_k`\nsuch that K*lhsBits and K*rhsBits are byte-aligned (e.g. K\u003d2 for i4/f4,\nK\u003d4 for f6, K\u003d8 for i1).\n\nProgress towards #24323"
    },
    {
      "commit": "fac9d3d9d9b1ea55757c3ae94e9b03ff50b6d4f0",
      "tree": "6e5603ae64ba5c0ad58c4e8aa30e9a40de4cd59e",
      "parents": [
        "f4fb944902742e77d7708bcc8d130aa81f49401b"
      ],
      "author": {
        "name": "Alan Li",
        "email": "me@alanli.org",
        "time": "Thu May 07 15:56:10 2026 -0400"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Thu May 07 15:56:10 2026 -0400"
      },
      "message": "[INTEGRATION] Bump llvm to 0f3ca6bb9  (#24390)\n\nUpdated code base to accomondate 2 upstream llvm changes:\n* https://github.com/llvm/llvm-project/pull/196082 : update\n`ValueBoundsOpInterface` options struct.\n* https://github.com/llvm/llvm-project/pull/191821 : introduced\n`mlir::Complex\u003cT\u003e` for non-float complex.\n\nAlso updated stablehlo and torch-mlir for the interface changes."
    },
    {
      "commit": "f4fb944902742e77d7708bcc8d130aa81f49401b",
      "tree": "41364274d1e28d974e66be813945c5905936f096",
      "parents": [
        "1c14508943bc6c16f531aa3bb2955f133897876c"
      ],
      "author": {
        "name": "Alan Li",
        "email": "me@alanli.org",
        "time": "Thu May 07 14:06:31 2026 -0400"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Thu May 07 14:06:31 2026 -0400"
      },
      "message": "[ROCM] Workaround LLVM #194924 partial-unroll regression (#24379)\n\nllvm/llvm-project#194924 added:\n\n```\nUP.Runtime \u003d true;\nUP.PartialThreshold \u003d UP.Threshold / 4;   // \u003d 75 for AMDGPU\n```\n\nto `AMDGPUTTIImpl::getUnrollingPreferences`. The second line gates\ncompile-time partial unrolling, suppressing the partial-unroll-by-2 of\nsmall constant-trip-count reduction loops.\n\nAs a temporary measurement this patch:\nRestore the prior un-overridden LLVM default of 150."
    },
    {
      "commit": "1c14508943bc6c16f531aa3bb2955f133897876c",
      "tree": "a1e1d8b96ec34d2657b9a2f5487572d80a902736",
      "parents": [
        "40305344b4f448ccd70400050a40f9994a2fcd63"
      ],
      "author": {
        "name": "Erick Ochoa Lopez",
        "email": "erick.ochoalopez@amd.com",
        "time": "Thu May 07 08:54:19 2026 -0400"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Thu May 07 08:54:19 2026 -0400"
      },
      "message": "[Codegen] Canonicalize transfer_{read,write} vector\u003c1xT\u003e (#24382)\n\nvector.transfer_read and vector.transfer_writes\u0027s permutation maps are\nirrelevant with vector\u003c1xT\u003e. This pattern unblocks lowering to\nvector.load and vector.store.\n\nAssisted-By: Claude Opus 4.6"
    },
    {
      "commit": "40305344b4f448ccd70400050a40f9994a2fcd63",
      "tree": "f86a935109e389f2b94ecaface7a92b108870090",
      "parents": [
        "ac49ab6d060183e94ec01e24915b5a0010f3410d"
      ],
      "author": {
        "name": "Alan Li",
        "email": "me@alanli.org",
        "time": "Wed May 06 18:14:59 2026 -0400"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Wed May 06 18:14:59 2026 -0400"
      },
      "message": "[INTEGRATION] Bump llvm to 3ed76d05a78d (#24376)\n\nAlso drops `test_reversesequence_time` from `cpu_llvm_sync` xfails,\nwhich is fixed by https://github.com/llvm/llvm-project/pull/195359"
    },
    {
      "commit": "ac49ab6d060183e94ec01e24915b5a0010f3410d",
      "tree": "6340aaa2eb4f7dff3cd48aaf8a101868752571b2",
      "parents": [
        "4f990431a73902c02288fd5892ddf4540b72998b"
      ],
      "author": {
        "name": "Jakub Kuderski",
        "email": "jakub@nod-labs.com",
        "time": "Wed May 06 17:14:19 2026 -0400"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Wed May 06 21:14:19 2026 +0000"
      },
      "message": "[ROCm] Drop deprecated --iree-hip flag aliases (#24381)\n\nThese aliases were deprecated in\nhttps://github.com/iree-org/iree/pull/23420 on 2026-02-06. Three months\n(89 days) have passed, so keep only the --iree-rocm-* flags and\nIREE_ROCM_TEST_TARGET_CHIP."
    },
    {
      "commit": "4f990431a73902c02288fd5892ddf4540b72998b",
      "tree": "d5873749815a8d57c29a3cfc4839f03396a5269c",
      "parents": [
        "5fbfe29bc4ca9e3e0c93931fbe394ac21537a72f"
      ],
      "author": {
        "name": "Zhewen Yu",
        "email": "zhewenyu@amd.com",
        "time": "Wed May 06 22:08:25 2026 +0100"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Wed May 06 22:08:25 2026 +0100"
      },
      "message": "Reapply \"[Codegen] Enable DMA by default for F16/BF16 Gemm on gfx950 (#24117)\" (#24235) (#24373)\n\nThis reverts commit 75ffbc37144de79cc9428f97827251b2242b230f.\n\nThe previously reported numerical issues have now been resolved through\nthe following changes:\n- https://github.com/iree-org/iree/pull/24241\n- https://github.com/iree-org/iree/pull/24242\n- https://github.com/iree-org/iree/pull/24254\n\n---------\n\nSigned-off-by: Yu-Zhewen \u003czhewenyu@amd.com\u003e"
    },
    {
      "commit": "5fbfe29bc4ca9e3e0c93931fbe394ac21537a72f",
      "tree": "42b0d01c00f4dfceff26855455f8a4f3d5954678",
      "parents": [
        "d2f4f441a638c31d2b4bdb1165dc98418a069df2"
      ],
      "author": {
        "name": "Keshav Vinayak Jha",
        "email": "31160700+keshavvinayak01@users.noreply.github.com",
        "time": "Thu May 07 02:27:06 2026 +0530"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Wed May 06 20:57:06 2026 +0000"
      },
      "message": "[LinalgExt] Fix attention NaN for fully-masked rows (#24178)\n\nAttention softmax normalization produced `NaN` for fully-masked rows.\n\n  The failing case is:\n  - masked scores are all `-inf`\n  - softmax numerator becomes `0`\n  - softmax denominator becomes `0`\n  - final normalization computes `0 / 0`\n\nPyTorch SDPA uses `_safe_softmax`, which explicitly zeroes fully-masked\nrows, so IREE should produce `0` here instead of `NaN`.\n\nThis PR handles that in both attention lowering paths:\n\n- Standalone `iree_linalg_ext.attention` decomposition clamps the row\nsoftmax denominator with `max(sum, 1)` before `P / sum`.\n- Online attention finalization keeps the existing unmasked `(1 / sum) *\nx` IR unchanged.\n- Masked online attention guards the existing finalization loop so `sum\n\u003d\u003d 0` yields `0` instead of `NaN`, avoiding an extra row-level pass.\n\nFor non-fully-masked rows, the softmax denominator is unchanged: after\nmax subtraction, at least one term is `exp(0) \u003d 1`, so `sum \u003e\u003d 1`.\n\n  Fixes #24175.\n\n---------\n\nSigned-off-by: Keshav Vinayak Jha \u003ckeshavvinayakjha@gmail.com\u003e"
    },
    {
      "commit": "d2f4f441a638c31d2b4bdb1165dc98418a069df2",
      "tree": "1e3b426d518c5d2bd525cb95d7997d6cf13bf256",
      "parents": [
        "9092be0805ca810e80dc3f1f06958f895fe019a6"
      ],
      "author": {
        "name": "Keshav Vinayak Jha",
        "email": "31160700+keshavvinayak01@users.noreply.github.com",
        "time": "Thu May 07 02:03:54 2026 +0530"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Wed May 06 20:33:54 2026 +0000"
      },
      "message": "Bump iree-org/torch-mlir@46cbd27f7c (#24380)\n\nPick up llvm/torch-mlir#4562, which adds the optional enable_gqa\nattribute to `torch.hop_flex_attention`.\n\nHelps with https://github.com/iree-org/iree/pull/24313\n\nSigned-off-by: Keshav Vinayak Jha \u003ckeshavvinayakjha@gmail.com\u003e"
    },
    {
      "commit": "9092be0805ca810e80dc3f1f06958f895fe019a6",
      "tree": "730553e7c9b3fa0dcfa32195478a054b9eb8aaa9",
      "parents": [
        "6efc2ca92a246bf9d696abd5c5f0b8a654cfb247"
      ],
      "author": {
        "name": "Bangtian Liu",
        "email": "liubangtian@gmail.com",
        "time": "Wed May 06 13:13:53 2026 -0400"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Wed May 06 17:13:53 2026 +0000"
      },
      "message": "[InputConversion] Lower AtenArgmax/AtenArgmin to iree_linalg_ext.arg_compare (#24291)\n\n#24347 has provided the fallback for the arg_compare op (unsupported by\nthe VectorDistribute) to the TileAndFuse pipeline.\n\nThis PR adds a Torch input conversion pattern that lowers\n`torch.aten.argmax` / `torch.aten.argmin` to\n`iree_linalg_ext.arg_compare`, replacing the prior decomposition path\nthrough generic `linalg.generic` reductions, and cleans up the matching\nONNX `select_last_index` xfails that now pass on Valkan O0.\n\nAssisted-by: [Claude Code](https://claude.ai/code)\n\n---------\n\nSigned-off-by: Bangtian Liu \u003cliubangtian@gmail.com\u003e"
    },
    {
      "commit": "6efc2ca92a246bf9d696abd5c5f0b8a654cfb247",
      "tree": "8b16500b700c1b7d6aa5fe86f3eb29f7f57698c0",
      "parents": [
        "ca7e06373768634a9a4a857ce7dd543f299886b2"
      ],
      "author": {
        "name": "Erick Ochoa Lopez",
        "email": "erick.ochoalopez@amd.com",
        "time": "Wed May 06 11:28:33 2026 -0400"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Wed May 06 11:28:33 2026 -0400"
      },
      "message": "Generalize FoldMaskedTransferRaw and add FoldTransferReadOfEmptyTensor (#24301)\n\n* Generalizes FoldMaskedTransferRAW from (masked, masked) to (unmasked,\nunmasked), (unmasked, masked), (masked, unmasked).\n* Adds pattern to fold transfer_read(tensor.empty)) -\u003e ub.poison\n* This allows intermediary index tensors to be folded after\nvectorization.\n* The test pipeline_vector_distribute_reduction_gfx942.mlir needed to be\ncorrected. The empty tensors are now folded, but the test was wrong,\nonline_attention should not have had empty tensors as operands to begin\nwith. All passes that create online_attention fill the operands with\neither 0 or -1. So we do that here as well.\n\nFixes #24294\n\n----\n\nAssisted-By: Claude Opus 4.6"
    },
    {
      "commit": "ca7e06373768634a9a4a857ce7dd543f299886b2",
      "tree": "e10d26579041fdc9b76baf162e78e7c5e8ad222f",
      "parents": [
        "c6525dd763ae1414b3715b6beb56957e2383a41c"
      ],
      "author": {
        "name": "Lukas Sommer",
        "email": "lukas.sommer@amd.com",
        "time": "Wed May 06 17:18:51 2026 +0200"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Wed May 06 17:18:51 2026 +0200"
      },
      "message": "[VectorDistribute] Lower and distribute `async_dma` (#24299)\n\nPass to distribute and lower `async_dma` operations at the workgroup\nlevel to `amdgpu.gather_to_lds` operations at the thread-level (with\nthreads in each subgroup collaborating).\n\nThe pass shares helpers with the existing GPU pass to distribute\noperations based on layouts, but as the `async_dma` operation does not\nhave `vector` operands or results, lowering and distribution are\nimplemented as a separate pass. The changes to\n`GPUNestedLayoutDistributionPatterns.cpp` are therefore mainly a code\nmove extracting shared helpers to the new\n`GPUNestedLayoutUtils.[h|cpp]`.\n\nThe basic idea of the distribution is to construct a (nested) layout\nthat represents how the data-transfer is split across subgroups and\nthreads to perform the full transfer with direct-to-LDS compatible\noperations. The layout is constructed in stages:\n1. We choose the DMA size for the given target that fulfills the\nrequirements and determine the element tile based on the size of the\ntransfer per thread from the DMA size (`distributeFromInnermost`).\n2. The element tile is given by the number of threads in subgroup\n(`distributeFromInnermost`).\n3. Outer tile is always all-ones.\n4. We distribute the transfer to the configured number of subgroups\n(`distributeFromOutermost`).\n5. Whatever is left after these steps ends up as the batch tile of each\nthread.\n\nOnce we have that layout, we can use the shared helpers for the\nmechanics of distributing the operation.\n\nThe distribution fails if any of the requirements are not met. This is\nmostly a defensive check, the pass inserting the `async_dma` operations\n(will be added in a different PR) should only insert `async_dma`\noperations if the prerequisites can be met with the available DMA sizes\nfor the transfer shape etc. Therefore, the pass also fails if any of the\n`async_dma` operations could not be distributed and lowered.\n\nSwizzling and gather semantics are not part of this PR and will be added\nin follow-up PRs.\n\nThis is part of https://github.com/iree-org/iree/issues/23782.\n\nAssisted-by: Claude Code and Codex\n\n---------\n\nSigned-off-by: Lukas Sommer \u003clukas.sommer@amd.com\u003e"
    },
    {
      "commit": "c6525dd763ae1414b3715b6beb56957e2383a41c",
      "tree": "a7a8d0e1ae7f55627ec7fa293c59d67e9f25e7ae",
      "parents": [
        "26083307b8753a11870b8caf590afea15c9429fa"
      ],
      "author": {
        "name": "Lukas Sommer",
        "email": "lukas.sommer@amd.com",
        "time": "Wed May 06 17:17:29 2026 +0200"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Wed May 06 17:17:29 2026 +0200"
      },
      "message": "[Codegen] Duplicate operations in tile size analysis (#24246)\n\nThe tile size analysis considers some operations such as `linalg.fill`\nas duplicatable to avoid poisoning the tile size analysis in case such\neasy-to-duplicate operations are used by multiple users with different\ntile sizes after CSE.\n\nSo far, that duplication was never materialized. Now, the\nmaterialization pass after analysis clones duplicatable operation with\nthe requested tile-size attributes.\n\nThis PR also fixes missing tile-size attributes on operations that have\ntheir tile size fully defined by the analysis on the result. So far, the\nmaterialization only considered operands, now it also uses result\ninformation.\n\nThis is part of https://github.com/iree-org/iree/issues/24221.\n\nAssisted-by: Codex\n\n---------\n\nSigned-off-by: Lukas Sommer \u003clukas.sommer@amd.com\u003e"
    },
    {
      "commit": "26083307b8753a11870b8caf590afea15c9429fa",
      "tree": "ab7f68550d84190479ab3383584a2e61ce674659",
      "parents": [
        "2725310623260a45ef6dbf667e80d1c6ac8af7dd"
      ],
      "author": {
        "name": "Lukas Sommer",
        "email": "lukas.sommer@amd.com",
        "time": "Wed May 06 08:04:51 2026 +0200"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Wed May 06 08:04:51 2026 +0200"
      },
      "message": "[IREEGPU] Bufferize `async_dma` (#24300)\n\nImplements bufferization for `iree_gpu.async_dma` through an external\nmodel interface implementation.\n\nThis is part of https://github.com/iree-org/iree/issues/23782.\n\nAssisted-by: Claude Code and Codex\n\n---------\n\nSigned-off-by: Lukas Sommer \u003clukas.sommer@amd.com\u003e"
    },
    {
      "commit": "2725310623260a45ef6dbf667e80d1c6ac8af7dd",
      "tree": "93c37e4410a8eb8bc4a5ed20249e8dfd66e8bfcb",
      "parents": [
        "a1b8b7246f72e7d1b54d44a245692c252feb56f5"
      ],
      "author": {
        "name": "Ben Vanik",
        "email": "ben.vanik@gmail.com",
        "time": "Tue May 05 22:17:10 2026 -0700"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Tue May 05 22:17:10 2026 -0700"
      },
      "message": "[AMDGPU] Roll-up of AMDGPU HAL improvements for CDNA support (#24359)\n\nMostly just hardening for different PM4 modes and minor performance\noptimizations found during benchmarking. Besides the profiling quirk the\nentire HAL worked first try on an MI300X. Neat."
    },
    {
      "commit": "a1b8b7246f72e7d1b54d44a245692c252feb56f5",
      "tree": "75385733f1fd24c2e9c1b4c0190a77541c41ec29",
      "parents": [
        "f9562feeaf1eaab5f2d5ed7a3265029d21e9fcea"
      ],
      "author": {
        "name": "Han-Chung Wang",
        "email": "hanhan0912@gmail.com",
        "time": "Tue May 05 20:38:46 2026 -0700"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Tue May 05 20:38:46 2026 -0700"
      },
      "message": "[test][nfc] Add regression tests about strided vector.gather back. (#24370)\n\nIt was failing on RISC-V because it explicitly exlcudes the lowering\n(which contains the proper fix) in\nhttps://github.com/iree-org/iree/commit/25dcacfcbcd4ec91260920d7296ad558aa03ef84\n\nIt seems like it is converted to `llvm.*.gather` ops and relies on LLVM\nbackend to pick the corresponding instruction. This basically bypasses\nthe vector level optimization in MLIR which is not fixable in IREE.\n\nRelated issue: https://github.com/iree-org/iree/issues/24342\n\n---------\n\nSigned-off-by: hanhanW \u003chanhan0912@gmail.com\u003e"
    },
    {
      "commit": "f9562feeaf1eaab5f2d5ed7a3265029d21e9fcea",
      "tree": "8e95dae7e823e7138fa50096cb7859c8b1cc55bc",
      "parents": [
        "915c6ea4d5d5af191c40cb6f0168fbaf4684eb2f"
      ],
      "author": {
        "name": "Ben Vanik",
        "email": "ben.vanik@gmail.com",
        "time": "Tue May 05 20:14:03 2026 -0700"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Tue May 05 20:14:03 2026 -0700"
      },
      "message": "[HAL] Preserve tensor import effects when folding (#24364)\n\nHAL tensor imports carry optional wait fences and consume markers that\nare part of the availability and ownership contract for the imported\ntensor. Folding through an import/export pair erased that contract,\nallowing an async wrapper that returns an imported buffer view unchanged\nto discard the wait fence and signal its output fence immediately.\n\nKeep the fold only for imports with no sequencing or ownership effects.\nImports with wait_fence, consume, or byte offsets remain explicit so\nlater conversion can materialize the required timepoint await or\nownership transfer instead of reducing the wrapper to an immediate\nhost-side fence signal.\n\nAdd canonicalization coverage for both import-to-export and\nexport-to-import pairs so the plain effect-free fold remains available\nwhile sequenced imports are preserved.\n\nFound while triaging #24324."
    },
    {
      "commit": "915c6ea4d5d5af191c40cb6f0168fbaf4684eb2f",
      "tree": "9d5a6e565cfb378f1b26579558b5ff9fc71e3dd8",
      "parents": [
        "b63db901bf90751c9b1f5f8b812f7ce6e6af9d83"
      ],
      "author": {
        "name": "Ben Vanik",
        "email": "ben.vanik@gmail.com",
        "time": "Tue May 05 20:13:06 2026 -0700"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Tue May 05 20:13:06 2026 -0700"
      },
      "message": "[HAL] Add executable global lookup buffers (#24336)\n\nIntroduces iree_hal_executable_lookup_global_by_name as the public HAL\nAPI for resolving executable-owned globals. The API returns a normal\niree_hal_buffer_t so callers can use existing queue copies, mapping\nchecks, and command-buffer bindings instead of reaching around HAL with\nraw device pointers.\n\nAMDGPU resolves HSA variable symbols for the selected queue-affinity\ndevice and wraps the variable address as non-owning executable-backed\nstorage. HIP and CUDA use their module global lookup APIs and wrap\nexternal device pointers with the same lifetime rule. Returned buffers\nretain the executable until the buffer is released.\n\nAll executable vtables now implement the hook explicitly. Backends\nwithout global support fail with UNIMPLEMENTED, and replay recording\nfails loudly rather than returning an unrecorded buffer object.\n\n(cherry pick from #24049 with CUDA + AMDGPU support as well as HIP)\n\nCo-authored-by: Andrew Woloszyn \u003candrew.woloszyn@gmail.com\u003e"
    },
    {
      "commit": "b63db901bf90751c9b1f5f8b812f7ce6e6af9d83",
      "tree": "fe9773d90db77bcc80b0b4674429ed86fc77ad84",
      "parents": [
        "b0149471b41ac365001d36ef0c9a248b2ecd5311"
      ],
      "author": {
        "name": "Bangtian Liu",
        "email": "liubangtian@gmail.com",
        "time": "Tue May 05 18:00:40 2026 -0400"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Tue May 05 18:00:40 2026 -0400"
      },
      "message": "[LLVMGPU] Add TileAndFuse fallback for iree_linalg_ext.arg_compare (#24347)\n\nThis PR is a short-term fix: send arg_compare ops unsupported by the\n`VectorDistribute `pipeline to `TileAndFuse`\n- adds a `setArgCompareConfig` lowering-config function that routes\n`iree_linalg_ext.arg_compare` ops to the `LLVMGPUTileAndFuse` pipeline\nwhen `setReductionConfig` (the VectorDistribute path) rejects them.\n- gates `iree_linalg_ext.arg_compare` vectorization in the `TileAndFuse`\npipeline only, while the resulting `iree_vector_ext.arg_compare` only\nhas a lowering through the nested-layout distribution patterns owned by\nthe VectorDistribute pipeline.\n\nIssue: #24309 \nAssisted-by: [Claude Code](https://claude.ai/code)\n\n---------\n\nSigned-off-by: Bangtian Liu \u003cliubangtian@gmail.com\u003e"
    },
    {
      "commit": "b0149471b41ac365001d36ef0c9a248b2ecd5311",
      "tree": "d762c8d0dcb609eff7bffa11062c5d53210f8835",
      "parents": [
        "020f6bed41819712f47bf63702fa3472d464807e"
      ],
      "author": {
        "name": "Alan Li",
        "email": "me@alanli.org",
        "time": "Tue May 05 13:18:29 2026 -0400"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Tue May 05 13:18:29 2026 -0400"
      },
      "message": "[INTEGRATION] bump llvm @ 8be29edc2 (#24363)"
    },
    {
      "commit": "020f6bed41819712f47bf63702fa3472d464807e",
      "tree": "61d767884e756e39d0cbf28cd71e3ec1e259bae4",
      "parents": [
        "4f6cd98737594897c7e9141452294229eb38a77f"
      ],
      "author": {
        "name": "Benoit Jacob",
        "email": "jacob.benoit.1@gmail.com",
        "time": "Tue May 05 09:59:06 2026 -0400"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Tue May 05 09:59:06 2026 -0400"
      },
      "message": "[Codegen][CPU] Lower data-tiled inner_tiled in VirtualVectorLoweringPass. (#24358)\n\n`iree_codegen.inner_tiled` is a high-level vector op (a multi-dim\ncontraction whose shape is described by its kind), and lowering it to\nper-intrinsic ops is \"high-level vector -\u003e lower-level dialect\", which\nis exactly what `LLVMCPUVirtualVectorLoweringPass` already does for\n`vector.contract`, `vector.multi_reduction`, `vector.transfer_*`, and\nthe IREE VectorExt ops. So land the CPU `inner_tiled` lowering in\nthat pass rather than in a separate pass.\n\nThree pattern populates from `IREE::Codegen` (introduced in PR\nhttps://github.com/iree-org/iree/pull/24351 and #24357) join the\ntarget-independent\nblock of `LLVMCPUVirtualVectorLoweringPass`:\n\n  - `IREE::Codegen::populateUnrollInnerTiledPatterns` unrolls any\n    non-unit iter dim (typically a residual reduction tile) to unit.\n  - `IREE::Codegen::populateDropInnerTiledUnitDimsPatterns` drops the\n    (now-unit) iter domain so the lowered op has empty iter bounds,\n    which is the precondition of...\n  - `IREE::Codegen::populateLowerInnerTiledPatterns` replaces the\n    iter-free vector-semantics `inner_tiled` with the per-intrinsic\n    ops emitted by its kind\u0027s `buildUnderlyingOperations` — for\n    `IREE::CPU::DataTiledMMAAttr` those are `llvm.call_intrinsic` ops.\n\nThese patterns only match `iree_codegen.inner_tiled`, so they are\northogonal to the existing patterns in this pass; the\n`DropVectorUnitDimsPass` that runs immediately before this one only\ntouches `vector.*` ops and similarly does not interact.\n\nAfter the greedy pattern application, run\n`IREE::Util::eliminateHoistableConversions` to cancel the inverse-pair\n`util.hoistable_conversion` ops the inner_tiled lowering wraps around\nthe per-intrinsic ACC distribute/reassemble. GPU\u0027s `LowerIREEGPUOps`\npass calls this for the same reason; the introducing pattern set is\nthe right place to clean up after itself.\n\n`LLVMCPUVirtualVectorLoweringPass` runs post-bufferization. That is\nfine for `iree_codegen.inner_tiled`: by then the op is already in\nvector semantics (lifted by `GenericVectorizationPass` via the\n`InnerTiledOpVectorizationModel` `VectorizableOpInterface` external\nmodel), and bufferization does not touch vectors.\n\nUpdates a stale forward reference in `KernelDispatch.cpp` left over\nfrom https://github.com/iree-org/iree/pull/24328 — the comment described\nthe unit-iter-dim drop and per-\nintrinsic lower patterns as firing inside a pass named\n`LLVMCPULowerInnerTiledPass`, which never landed; they now fire\ninside `LLVMCPUVirtualVectorLoweringPass`.\n\nProgress towards https://github.com/iree-org/iree/issues/24323\n\n---------\n\nSigned-off-by: Benoit Jacob \u003cjacob.benoit.1@gmail.com\u003e"
    },
    {
      "commit": "4f6cd98737594897c7e9141452294229eb38a77f",
      "tree": "a1524aadb656a56e65f30a9b92859d3b06503e4d",
      "parents": [
        "d7eed39e91daf3f5701fed39205fd28b42a8afb5"
      ],
      "author": {
        "name": "Keshav Vinayak Jha",
        "email": "31160700+keshavvinayak01@users.noreply.github.com",
        "time": "Tue May 05 16:18:12 2026 +0530"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Tue May 05 16:18:12 2026 +0530"
      },
      "message": "[DispatchCreation][LinalgExt] Add OnlineAttentionOp support in dispatch formation and reshape fusion (#24068)\n\nThis PR teaches dispatch creation and the LinalgExt reshape-fusion\ninfrastructure about OnlineAttentionOp, which is missing from the paths\ncurrently specialized only for `AttentionOp`. Torch input conversion was\nsplit out to #24177, and rewrite porting was split to #24123. Codegen\nsupport landed in #24110.\n\n---------\n\nSigned-off-by: Keshav Vinayak Jha \u003ckeshavvinayakjha@gmail.com\u003e"
    },
    {
      "commit": "d7eed39e91daf3f5701fed39205fd28b42a8afb5",
      "tree": "98a61273f1a3268be279637c6e947753bd0939e0",
      "parents": [
        "76a8215fe42ad71f690ae0d97094cff453286dec"
      ],
      "author": {
        "name": "Keshav Vinayak Jha",
        "email": "31160700+keshavvinayak01@users.noreply.github.com",
        "time": "Tue May 05 13:34:47 2026 +0530"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Tue May 05 13:34:47 2026 +0530"
      },
      "message": "[LinalgExt] Fix attention index remap after unit-dim folding (#24349)\n\nRemap `iree_linalg_ext.index` ops in attention regions when unit loop\ndims are dropped so later decomposition cannot create out-of-range\n`linalg.index` ops.\n\nAlso updated stale doc of `iree_linalg_ext.index` op, since\n`OnlineAttention` support was already added.\n\nSigned-off-by: Keshav Vinayak Jha \u003ckeshavvinayakjha@gmail.com\u003e\nCo-authored-by: GPT-5 Codex \u003cnoreply@openai.com\u003e"
    },
    {
      "commit": "76a8215fe42ad71f690ae0d97094cff453286dec",
      "tree": "8f06e16a0c806a32049be96a5c1a98b7c8adf7a3",
      "parents": [
        "18a49cbac4d4673af41fec597e5056b41a191ecf"
      ],
      "author": {
        "name": "dependabot[bot]",
        "email": "49699333+dependabot[bot]@users.noreply.github.com",
        "time": "Mon May 04 22:16:55 2026 -0400"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Mon May 04 22:16:55 2026 -0400"
      },
      "message": "Bump dawidd6/action-download-artifact from 20 to 21 in the github-actions group (#24360)"
    },
    {
      "commit": "18a49cbac4d4673af41fec597e5056b41a191ecf",
      "tree": "188aada81c3bd0bacd4de48feb7e2ddade411421",
      "parents": [
        "2ba8b6f581c0f5eb9eef0f34c0aa1fc96794ae5a"
      ],
      "author": {
        "name": "Benoit Jacob",
        "email": "jacob.benoit.1@gmail.com",
        "time": "Mon May 04 16:58:39 2026 -0400"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Mon May 04 20:58:39 2026 +0000"
      },
      "message": "[Codegen] NFC: Lift InnerTiledOp unroll pattern to Codegen. (#24357)\n\nFollow-up to the previous NFC, moving the third (and last)\n`iree_codegen.inner_tiled` rewrite pattern out of\n`Codegen/Dialect/GPU/Transforms/`:\n\n- `UnrollInnerTiledPattern` (unrolls along iter dimensions until each\nintrinsic invocation has a unit iter shape)\n\nThe pattern itself is dialect-neutral; only the convenience no-args\npopulate function applied a default config (matmul-like LHS-reuse\ntraversal order, unit native shape) that happens to be a good fit for\nany architecture, including CPU.\n\nRenames the populate functions:\n\n- `IREE::GPU::populateIREEGPUVectorUnrollPatterns` (with-options form)\n-\u003e `IREE::Codegen::populateUnrollInnerTiledPatterns`\n- `IREE::GPU::populateIREEGPUVectorUnrollPatterns` (no-args form) -\u003e\n`IREE::Codegen::populateUnrollInnerTiledPatterns`\n\nRenames the GPU-specific helper `gpuMatmulLikeUnrollOrder` to\n`matmulLikeUnrollOrder` (its content is dialect-neutral). The\n`kUnrollAccDistribute` / `kUnrollAccReassemble` HoistableConversionOp\ntag constants move with the pattern.\n\nUpdates the in-tree callers in `IREEGPUExtensions.cpp` and\n`UnrollToIntrinsics.cpp`, and adjusts BUILD.bazel/CMakeLists.txt deps.\n\nStrict NFC; no functional or behavioral change.\n\nProgress towards #24323"
    },
    {
      "commit": "2ba8b6f581c0f5eb9eef0f34c0aa1fc96794ae5a",
      "tree": "3279503eda64def822ba673cfc7e101568077de6",
      "parents": [
        "f2b08972582ac376f57f9c9b65068363d5134123"
      ],
      "author": {
        "name": "Nirvedh Meshram",
        "email": "96096277+nirvedhmeshram@users.noreply.github.com",
        "time": "Mon May 04 14:59:09 2026 -0500"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Mon May 04 19:59:09 2026 +0000"
      },
      "message": "Revert \"[LLVMGPU] Fall back to scalar lowering for tiny attention shapes (#24239)\" (#24356)\n\nThis breaks a test in MI355 CI see example\nhttps://github.com/iree-org/iree/actions/runs/25333663639/job/74274843822#step:12:661,\nreverting while we make a fix.\n\nThis reverts commit 81f4decfba8e2b8d43e9f55084802638ef7e55bb."
    },
    {
      "commit": "f2b08972582ac376f57f9c9b65068363d5134123",
      "tree": "bd500158addde88082c67a264433eaf0eab86944",
      "parents": [
        "e623d00597ae4be465f404a34e971e2ebd8e3475"
      ],
      "author": {
        "name": "Han-Chung Wang",
        "email": "hanhan0912@gmail.com",
        "time": "Mon May 04 12:48:29 2026 -0700"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Mon May 04 12:48:29 2026 -0700"
      },
      "message": "[Codegen] Fix iree-compile --debug crash on CPU/GPU codegen pass options (#24190)\n\nUsing CPUCodegenOptions / GPUCodegenOptions as MLIR pass options crashes\nwith `--debug` (or any pass-pipeline print path): the default\n`llvm::cl::parser\u003cT\u003e` inherits `generic_parser_base`, so MLIR\u0027s\nOptionParser alias resolves to `GenericOptionParser\u003cT\u003e`, whose\nfindArgStrForValue hits `llvm_unreachable(\"unknown data value for\noption\")` on struct-typed values\n([mlir/include/mlir/Pass/PassOptions.h](https://github.com/llvm/llvm-project/blob/99457c368586b1debf49f55b3a0684317f5f298d/mlir/include/mlir/Pass/PassOptions.h#L158-L170)).\n\nFix by specializing `llvm::cl::parser\u003cT\u003e` to inherit `basic_parser\u003cT\u003e`\ninstead, and providing `operator\u003c\u003c` in the type\u0027s associated namespace\n(`mlir::iree_compiler`) so ADL can find it. This routes printing through\n`printOptionValue / has_stream_operator`, which accepts our streaming\nthat prints `opaque`. The parse path is also a no-op — these options\nnever flow in from pass pipeline strings.\n\nFixes https://github.com/iree-org/iree/issues/24074\n\nAssisted-by: Claude Code\n\n---------\n\nSigned-off-by: hanhanW \u003chanhan0912@gmail.com\u003e"
    },
    {
      "commit": "e623d00597ae4be465f404a34e971e2ebd8e3475",
      "tree": "041812f5f3f853398f6ae1e8b19fa1242da3a1e9",
      "parents": [
        "d097bad562a7e941d5522c58c82f4e0a26907378"
      ],
      "author": {
        "name": "Alan Li",
        "email": "me@alanli.org",
        "time": "Mon May 04 14:59:31 2026 -0400"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Mon May 04 14:59:31 2026 -0400"
      },
      "message": "[INTEGRATION] Bump llvm to f306525759 (#24354)"
    },
    {
      "commit": "d097bad562a7e941d5522c58c82f4e0a26907378",
      "tree": "57f91009f4c64cb4ff23e5acc9903bcd574ef8c8",
      "parents": [
        "11926304c49d708d6a953568aa2444cff63b2d0d"
      ],
      "author": {
        "name": "Ian Wood",
        "email": "ianwood2024@u.northwestern.edu",
        "time": "Mon May 04 10:37:33 2026 -0700"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Mon May 04 10:37:33 2026 -0700"
      },
      "message": "[DispatchCreation] Prevent fill-\u003escatter cloning (#24214)\n\nOps like `iree_linalg_ext.scatter` and `tensor.insert_slice` only write\na subset of their `outs`/`dest`; their iteration space does not cover\nthe init (the indexing map is null, or the operand has no mapping into\nthe affine iteration space). A producer that materializes a full-tensor\nwrite has no shared iteration space with such a consumer and therefore\ncannot be tile-fused with it.\n\nBlock these clones in `getCloneableOps`:\n- `isUnfusableInit` identifies init operands that producers cannot be\nfused through: `tensor.insert_slice.dest` and any\n`LinalgFusionOpInterface` DPS init whose indexing map is null (e.g.\nscatter\u0027s `original`).\n- `collectUnfusableInitSources` uses `getBackwardSlice` to walk from\neach such init through cloneable chains, recording only producers that\nneed to be materialized. Views/reshapes (`tensor.extract_slice`,\n`tensor.empty`, `tensor.expand_shape`, `tensor.collapse_shape`) are\ntraversed but not blocked, so the llama scatter-on-extracted-slice\npattern still fuses.\n\nFixes iree-org/iree#24071.\n\n---------\n\nSigned-off-by: Ian Wood \u003cianwood@u.northwestern.edu\u003e"
    },
    {
      "commit": "11926304c49d708d6a953568aa2444cff63b2d0d",
      "tree": "bd7146f857b72c80e82cb9ef758c14c4c6c6598e",
      "parents": [
        "d358e811732bbdbb19d795e2ca1e380897ac332b"
      ],
      "author": {
        "name": "Benoit Jacob",
        "email": "jacob.benoit.1@gmail.com",
        "time": "Mon May 04 12:44:02 2026 -0400"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Mon May 04 16:44:02 2026 +0000"
      },
      "message": "[Codegen] NFC: Lift InnerTiledOp lower \u0026 drop-unit-dim patterns to Codegen. (#24351)\n\nMoves the two `iree_codegen.inner_tiled` rewrite patterns and their\npopulate functions out of `Codegen/Dialect/GPU/Transforms/` into\n`Codegen/Dialect/Codegen/Transforms/LowerInnerTiled.cpp`:\n\n- `LowerInnerTiledPattern` (replaces an iter-bound-empty `inner_tiled`\nwith the per-intrinsic ops emitted by its kind\u0027s\n`buildUnderlyingOperations`)\n- `DropInnerTiledUnitDimsPattern` (folds unit iteration bounds and pulls\nthe corresponding extract/broadcast pairs out of the loop)\n\nBoth patterns operate on `IREE::Codegen::InnerTiledOp` and dispatch\nthrough the kind\u0027s interface, so there is nothing GPU-specific about\nthem. Lifting them lets CPU pass authors share the same lowering logic\nwithout a `Codegen/Dialect/GPU/...` include.\n\nRenames the populate functions accordingly:\n\n- `IREE::GPU::populateIREEGPULowerInnerTiledPatterns` -\u003e\n`IREE::Codegen::populateLowerInnerTiledPatterns`\n- `IREE::GPU::populateIREEGPUDropUnitDimsPatterns` -\u003e\n`IREE::Codegen::populateDropInnerTiledUnitDimsPatterns`\n\nThe `kShapeCastToIntrinsic` / `kShapeCastFromIntrinsic` and\n`kDropUnitDims` / `kAddUnitDims` HoistableConversionOp tag constants\nmove with the patterns. The remaining tag pair on the GPU side\n(`kUnrollAccDistribute` / `kUnrollAccReassemble`) stays in\n`Codegen/Dialect/GPU/Transforms/Transforms.cpp` since it\u0027s still only\nused by the GPU unroll path.\n\nUpdates the existing in-tree callers in `LowerIREEGPUOps.cpp`,\n`IREEGPUExtensions.cpp`, and `UnrollToIntrinsics.cpp`, and adjusts\nBUILD.bazel/CMakeLists.txt deps so GPU\u0027s transform libraries depend on\n`IREECodegenTransforms`.\n\nStrict NFC; no functional or behavioral change.\n\nProgress towards #24323"
    },
    {
      "commit": "d358e811732bbdbb19d795e2ca1e380897ac332b",
      "tree": "22e28b469bcde8f593bdc599a8981c5220d0178d",
      "parents": [
        "d4e04f7e8f90b95e4d70e76cc42732aa43f06bf5"
      ],
      "author": {
        "name": "Benoit Jacob",
        "email": "jacob.benoit.1@gmail.com",
        "time": "Mon May 04 10:35:10 2026 -0400"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Mon May 04 14:35:10 2026 +0000"
      },
      "message": "[Codegen][CPU] Lower inner_tiled to llvm.call_intrinsic. (#24345)\n\nImplements CPU\u0027s `DataTiledMMAAttr::buildUnderlyingOperations` by\nreusing the shared `Codegen::buildDataTiledMMAUnderlyingOperations` from\n#24326. CPU passes a callback, `createCpuMmaIntrinsicCall`, that emits\nan `llvm.call_intrinsic` for each per-intrinsic invocation. The function\ndispatches on the `MMAIntrinsic` enum value through a switch.\n\nProgress towards #24323\n\n---------\n\nCo-authored-by: Han-Chung Wang \u003chanhan0912@gmail.com\u003e"
    },
    {
      "commit": "d4e04f7e8f90b95e4d70e76cc42732aa43f06bf5",
      "tree": "a4448f206d87f65fffa4fdb004f8d1f85032f150",
      "parents": [
        "81f4decfba8e2b8d43e9f55084802638ef7e55bb"
      ],
      "author": {
        "name": "Florian Walbroel",
        "email": "walbroel@roofline.ai",
        "time": "Mon May 04 08:51:12 2026 +0200"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Mon May 04 08:51:12 2026 +0200"
      },
      "message": "compiler/plugins/input/TOSA: fix: TOSA arith lowering must handle apply scale introduced by linalg lowering (#24121)\n\nTOSA to linalg lowering may (re-)introduce some additional TOSA\noperations as not all TOSA operations can be lowering to linalg. These\nneed a following run of TOSA to arith lowering for element wise\noperations. The current pass pipeline is missing the option to enable\nlowering of tosa.apply_scale, which may be introduced during the\nlowering to linalg. This causes errors in the later stage of the\ncompilation flow such as vectorization.\n\nSigned-off-by: Florian Walbroel \u003cwalbroel@roofline.ai\u003e"
    },
    {
      "commit": "81f4decfba8e2b8d43e9f55084802638ef7e55bb",
      "tree": "d43cf4c07b105092e314d44af7f0ed2cf99bdac7",
      "parents": [
        "fdf1392b84f8f89fb87cd62d839b773469b8e685"
      ],
      "author": {
        "name": "Keshav Vinayak Jha",
        "email": "31160700+keshavvinayak01@users.noreply.github.com",
        "time": "Mon May 04 12:10:43 2026 +0530"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Mon May 04 06:40:43 2026 +0000"
      },
      "message": "[LLVMGPU] Fall back to scalar lowering for tiny attention shapes (#24239)\n\nThe attention `VectorDistribute` configs (both the MMA-intrinsic path\nand the memory-bound reduction path) assume head dimensions and K2 reach\na certain size. For shapes below that threshold (e.g. Q\u003dK\u003dV\u003d[2,2,2,2]\nf16), the reduction path still succeeds at emitting a `VectorDistribute`\nconfig, but the tile sizes it picks produce vector ops whose shapes the\nlayout engine cannot support, causing the failure in\nhttps://github.com/iree-org/iree/issues/24221\n\nAdd early bailouts for the shapes that cannot be tiled cleanly.\n\n---------\n\nSigned-off-by: Keshav Vinayak Jha \u003ckeshavvinayakjha@gmail.com\u003e\nCo-authored-by: Lukas Sommer \u003clsommer@amd.com\u003e\nCo-authored-by: GPT-5 \u003cnoreply@openai.com\u003e"
    },
    {
      "commit": "fdf1392b84f8f89fb87cd62d839b773469b8e685",
      "tree": "47c21ad71fc8a9b89703e888a93299d8442b9942",
      "parents": [
        "09350084469f7c49e09d3f4348bdd706627eacd4"
      ],
      "author": {
        "name": "Keshav Vinayak Jha",
        "email": "31160700+keshavvinayak01@users.noreply.github.com",
        "time": "Mon May 04 11:18:37 2026 +0530"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Mon May 04 11:18:37 2026 +0530"
      },
      "message": "[DispatchCreation] Allow fusion of multi-result producers (#24169)\n\nEnables consumer fusion for multi-result producers like\n`iree_linalg_ext.online_attention` whose results flow into a single\nconsumer via operands with different ranks (e.g. acc and sum of the\nnormalization `linalg.generic`).\n\nSupports https://github.com/iree-org/iree/pull/24068\n\n---------\n\nSigned-off-by: Keshav Vinayak Jha \u003ckeshavvinayakjha@gmail.com\u003e"
    },
    {
      "commit": "09350084469f7c49e09d3f4348bdd706627eacd4",
      "tree": "9c3c8c10a4376c37b220a650ea7e75bf8330af9a",
      "parents": [
        "f4d1908e89fece7e0bf5cebd606b69a9483817f3"
      ],
      "author": {
        "name": "Vivian Zhang",
        "email": "zhyuhang88@gmail.com",
        "time": "Fri May 01 21:37:30 2026 -0700"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Fri May 01 21:37:30 2026 -0700"
      },
      "message": "[DispatchCreation] Tighten scatter-skip predicate in CollapseDimensions (#24334)\n\nPR #24034 skipped collapse for any `linalg.generic` with a\n`tensor.extract` and a `linalg.index` in its body, in order to avoid an\nexpensive delinearization on strided-scatter generics produced by\n`ConvertStridedInsertSliceToGeneric`. That predicate was too broad and\ncaught the RoPE + FP8 dispatch in Llama-8B, whose fused body also has\n`tensor.extract` + `linalg.index` (rotate_half lookup) but is not a\nstrided scatter; the collateral block prevents the `4×batch×seq` outer\ncollapse and produces a 5-D dispatch (`4x2048x8x2x64`) that tiles ~3.5x\nslower on gfx942 than the pre-#24034 `8192x8x128` shape.\n\nTighten the predicate to require all three signals of the\nstrided-scatter shape: empty `ins` operands (source captured via\n`tensor.extract`), at least one `linalg.index`, and a `tensor.extract`\nwhose result is consumed by an `arith.select` (the bounds check).\n\nFixes: https://github.com/iree-org/iree/issues/24322\n\n---------\n\nSigned-off-by: yzhang93 \u003czhyuhang88@gmail.com\u003e\nCo-authored-by: Claude \u003cnoreply@anthropic.com\u003e"
    },
    {
      "commit": "f4d1908e89fece7e0bf5cebd606b69a9483817f3",
      "tree": "872e5fcb34d591273dcf19ccd3d6e2b24723b42e",
      "parents": [
        "7098bdfec0fa62ce250e9a99b0977dbb86b4fcf0"
      ],
      "author": {
        "name": "Zhewen Yu",
        "email": "zhewenyu@amd.com",
        "time": "Fri May 01 22:48:38 2026 +0100"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Fri May 01 22:48:38 2026 +0100"
      },
      "message": "[Codegen][DMA] Fix unaligned swizzle offset computation in gather-to-lds lowering (#24241)\n\nThe inverse XOR swizzle applied to DMA source offsets was incorrect in\ntwo cases:\n\n1. **Subgroup base offset**: When a subgroup\u0027s transfer size is not a\nmultiple of the swizzle period, different subgroups sharing the same\nlocal offsets but occupying different rows would get identical swizzled\naddresses.\n\nFix: incorporate the subgroup\u0027s base offset within the full allocation\nbefore swizzling.\n\n2. **Access-width alignment**: When `elementsPerLane \u003c accessWidth`, the\ninteger division inside `swizzleOffset` truncates offsets that differ\nonly within an access-width group to the same value.\n\nFix: strip the sub-accesswidth remainder before swizzling and restore it\nafter. This fix is applied directly in `swizzleOffset` for both XOR and\nrotate_rows swizzles. While rotate_rows isn\u0027t currently used with DMA,\nthe access-width alignment issue affects both swizzle types.\n\nBoth issues caused numerical mismatches for BF16 batch matmuls using DMA\nwith XOR swizzle enabled.\n\nAssisted-by: Cursor (Claude)\n\n---------\n\nSigned-off-by: Yu-Zhewen \u003czhewenyu@amd.com\u003e"
    },
    {
      "commit": "7098bdfec0fa62ce250e9a99b0977dbb86b4fcf0",
      "tree": "acdc8c56d1be8c5f11545f2523a2572df787250f",
      "parents": [
        "316c1c12f43db0cc7a6b0e5979a14f8329dda756"
      ],
      "author": {
        "name": "Han-Chung Wang",
        "email": "hanhan0912@gmail.com",
        "time": "Fri May 01 11:52:04 2026 -0700"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Fri May 01 18:52:04 2026 +0000"
      },
      "message": "[LLVMGPU][nfc] Modernize the rest of LLVMGPU pipeline tests. (#24341)\n\nIt moves the tests to use module-scope, which gets rid of unnecessary\nhal ops.\n\nSigned-off-by: hanhanW \u003chanhan0912@gmail.com\u003e"
    },
    {
      "commit": "316c1c12f43db0cc7a6b0e5979a14f8329dda756",
      "tree": "d449429db4ba98b178fa3f8b9fe19120452a3554",
      "parents": [
        "8fc32e0e1e968d4ccf038667a759d3ad54962d05"
      ],
      "author": {
        "name": "Han-Chung Wang",
        "email": "hanhan0912@gmail.com",
        "time": "Fri May 01 11:32:33 2026 -0700"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Fri May 01 11:32:33 2026 -0700"
      },
      "message": "[LLVMGPU][nfc] Modernize vector distribution pipeline tests. (#24340)\n\nhttps://github.com/iree-org/iree/commit/9ed4a4ec7e676c7fb89e3e493af6856f52cab12c\naccidentally reverts the previous changes that bring all the pipeline\ntests to module-scope. The revision reworks on it.\n\nAll the checks are the same, and some dispatch_config op checks are\nadded (for workgroup size, etc.). The additional cleanup changes are:\n- Remove stale comment for `hal.executable.target`. No idea why it was\nchecked in.\n- Drop multiple blank lines.\n\n---------\n\nSigned-off-by: hanhanW \u003chanhan0912@gmail.com\u003e"
    },
    {
      "commit": "8fc32e0e1e968d4ccf038667a759d3ad54962d05",
      "tree": "4dece432bb0702b5a2604e36e3dce102b674cc7f",
      "parents": [
        "3be9dc6479c55e5838cf9ee404592d890893c005"
      ],
      "author": {
        "name": "Vivian Zhang",
        "email": "zhyuhang88@gmail.com",
        "time": "Fri May 01 10:32:21 2026 -0700"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Fri May 01 10:32:21 2026 -0700"
      },
      "message": "[DispatchCreation] Refactor and add low-parallelism split reduction parameter set (#24293)\n\nIntroduce a `low-parallelism` flag that selects a split reduction\nparameter set tuned for targets with limited concurrent workgroup\nparallelism (e.g. RDNA-class GPUs), distinct from the existing\nhigh-parallelism default (CDNA-class). The matmul and convolution\nlimit-parallel-loops helpers are restructured into a single top-down\noutput size ladder with reduction size sub-rules in each band.\n\nRepresentative benchmark results:\n\nRDNA4 (RX 9070 XT) conv:\n| Speedup | Baseline (us) | This PR (us) | Saved (us) | Shape |\n|--------:|--------------:|-------------:|-----------:|-------|\n| 2.21x | 6234.3 | 2825.4 | 3408.9 | -n 32 -c 256 -H 25 -W 25 -k 2376 -y\n3 -x 3 -F 4 |\n| 1.54x | 6537.7 | 4234.0 | 2303.7 | -n 32 -c 256 -H 25 -W 25 -k 2376 -y\n3 -x 3 -F 2 |\n| 1.24x | 10359.2 | 8330.5 | 2028.6 | -n 12 -c 224 -H 470 -W 725 -k 224\n-u 2 -v 2 -g 4 -F 4 |\n| 1.62x | 5210.9 | 3220.0 | 1990.9 | -n 5 -c 224 -H 470 -W 725 -k 224 -u\n2 -v 2 -g 4 -F 4 |\n| 1.43x | 3660.4 | 2568.0 | 1092.4 | -n 4 -c 224 -H 470 -W 725 -k 224 -u\n2 -v 2 -g 4 -F 4 |\n\nRDNA4 1x1 conv:\n| Speedup | Baseline (us) | This PR (us) | Saved (us) | Shape |\n|--------:|--------------:|-------------:|-----------:|-------|\n| 4.12x | 45.2 | 11.0 | 34.2 | -n 16 -c 64 -H 24 -W 16 -k 192 -F 4 |\n| 2.27x | 61.3 | 27.0 | 34.3 | -n 16 -c 96 -H 48 -W 32 -k 96 -F 4 |\n| 2.30x | 30.9 | 13.4 | 17.5 | -n 16 -c 48 -H 24 -W 16 -k 192 -F 4 |\n\nCDNA4 (MI355) conv:\n| Speedup | Baseline (us) | This PR (us) | Saved (us) | Shape |\n|--------:|--------------:|-------------:|-----------:|-------|\n| 2.60x | 48.7 | 18.7 | 30.0 | -n 16 -c 96 -H 48 -W 32 -k 96 -y 3 -x 1\n-F 4 |\n| 1.81x | 4198.3 | 2319.7 | 1878.6 | -n 12 -c 224 -H 470 -W 725 -k 224\n-u 2 -v 2 -g 4 -F 4 |\n| 1.71x | 3287.0 | 1924.0 | 1363.0 | -n 10 -c 224 -H 470 -W 725 -k 224\n-u 2 -v 2 -g 4 -F 4 |\n| 1.49x | 3218.2 | 2160.6 | 1057.6 | -n 12 -c 224 -H 235 -W 363 -k 224\n-g 4 -F 4 |\n\nCDNA4 1x1 conv:\n| Speedup | Baseline (us) | This PR (us) | Saved (us) | Shape |\n|--------:|--------------:|-------------:|-----------:|-------|\n| 2.54x | 1499.0 | 589.5 | 909.5 | -n 10 -c 448 -H 118 -W 182 -k 896 -F\n4 |\n| 2.06x | 1703.2 | 826.4 | 876.8 | -n 12 -c 448 -H 118 -W 182 -k 896 -F\n4 |\n\n---------\n\nSigned-off-by: yzhang93 \u003czhyuhang88@gmail.com\u003e\nCo-authored-by: Claude \u003cnoreply@anthropic.com\u003e"
    },
    {
      "commit": "3be9dc6479c55e5838cf9ee404592d890893c005",
      "tree": "6afb7563f5cdb29bbdb8b9be64dc20b926d131d9",
      "parents": [
        "d13374fbbe3687331139f498640ff5932de1e212"
      ],
      "author": {
        "name": "Erick Ochoa Lopez",
        "email": "erick.ochoalopez@amd.com",
        "time": "Fri May 01 12:58:38 2026 -0400"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Fri May 01 12:58:38 2026 -0400"
      },
      "message": "Refactor vector.multi_reduction into flattening, unrolling, and lowering passes. (#24183)\n\n* Adds an initial vector flattening pass.\n* At the moment, the only operation that is flattened is\nvector.multi_reduction, but others will be added later.\n* Adds vector unrolling for vector.multi_reduction.\n* Adds a vector.multi_reduction lowering pass\n\n---------\n\nCo-authored-by: Eric \u003c55723758+efric@users.noreply.github.com\u003e"
    },
    {
      "commit": "d13374fbbe3687331139f498640ff5932de1e212",
      "tree": "b7273bdb5a7a7d18c1339bfaf1c209fe2b72b6f0",
      "parents": [
        "dd5a6e3582b57d454c7d19c494ec2e083148958a"
      ],
      "author": {
        "name": "Nirvedh Meshram",
        "email": "96096277+nirvedhmeshram@users.noreply.github.com",
        "time": "Fri May 01 11:24:52 2026 -0500"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Fri May 01 11:24:52 2026 -0500"
      },
      "message": "Bump llvm to llvm-project@88e5eeb292f (#24339)"
    },
    {
      "commit": "dd5a6e3582b57d454c7d19c494ec2e083148958a",
      "tree": "7bba5d2594320e7a3e520571ce55294b8bba1811",
      "parents": [
        "c40c7a31ac4177da14f2e3ba77f1d9efabd11946"
      ],
      "author": {
        "name": "Muzammiluddin Syed",
        "email": "muzasyed@amd.com",
        "time": "Fri May 01 12:15:36 2026 -0400"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Fri May 01 12:15:36 2026 -0400"
      },
      "message": "[Codegen] Support pack/unpack/linalg generic transpose in CombineLayoutTransformation (#24273)\n\nWhen performing writes to LDS for unaligned gemm operands, we lower the\npreceding `tensor.extract_slice` -\u003e `tensor.pad` -\u003e `linalg.copy` to a\nmasked `vector.transfer_read` in a `scf.for` loop\n\nHowever, when this chain of ops inside the lds promotion `scf.for` loops\nincludes a `linalg.generic (transpose)` we produce scalar writes with\n`memref.map_load`.\n\nThis happens because the raising of the linalg.generic to a\nlinalg.transpose is erroneously not firing prior to vectorization in\n`CombineSourceLayoutTransformationPass` due to\n`simplifyComplexRelayoutOps` not recognizing linalg.generic (transpose)\nwhen inside a for loop.\n\nThis change removes `simplifyComplexRelayoutOps` entirely and performs\nthe folding of unpack/pack ops and linalg.generic (transpose) ops\ndirectly without relying on a preliminary simplification of these ops\nprior to folding.\n\n---------\n\nSigned-off-by: Muzammiluddin Syed \u003cmuzasyed@amd.com\u003e"
    },
    {
      "commit": "c40c7a31ac4177da14f2e3ba77f1d9efabd11946",
      "tree": "910c49e053a4a08ad617ef8c2d3e6eefabd6d15c",
      "parents": [
        "174808a20722d663412ff69d7cd8ad4805282f35"
      ],
      "author": {
        "name": "Benoit Jacob",
        "email": "jacob.benoit.1@gmail.com",
        "time": "Fri May 01 12:02:44 2026 -0400"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Fri May 01 16:02:44 2026 +0000"
      },
      "message": "[Codegen][CPU] Teach the lowering strategy about inner_tiled. (#24328)\n\nAdds a `setRootConfig(InnerTiledOp)` overload in `KernelDispatch.cpp`\nthat:\n\n  * routes `iree_codegen.inner_tiled` roots through the\n    `Mmt4dTilingExpert` pipeline. The DataTiling pipeline is the\n    pack/unpack-focused pipeline and not all the modern lowerings (e.g.\n    tile-and-fuse, post-bufferize vector lowering) kick in there;\n    Mmt4dTilingExpert has them, plus its `GenericVectorizationPass` step\n    is where vector-semantics lifting of `inner_tiled` happens via the\n    in-tree `VectorizableOpInterface` external model. The previous\n    commit schedules `LLVMCPULowerInnerTiledPass` right after\n    `GenericVectorizationPass` in this pipeline.\n  * sets `distribution \u003d (1, 1, 0)` (one inner tile per workgroup in\n    the M and N parallel iter dims, K not distributed);\n  * sets `vector_common_parallel \u003d (1, 1, 0)` and\n    `vector_reduction \u003d (0, 0, 1)` so the K dim is tiled to a real\n    `scf.for` instead of being fully unrolled inside the\n    `LLVMCPULowerInnerTiledPass` greedy fixed point.\n\nWithout this commit, a `linalg.matmul` with\n`--iree-opt-data-tiling --iree-llvmcpu-enable-inner-tiled` ended up on\nthe `Default` pipeline (which has no inner_tiled lowering hook) and\neither failed bufferization (for non-trivial K) or compiled in time\nexponential in K (after the previous \"unroll inner_tiled\" commit\nturned the survival into a compile-time blow-up). With this commit,\na 256x256x256 f32 matmul on znver5 now compiles end-to-end in seconds:\nthe workgroup body is a tight scf.for(K_iter) wrapping a single\n`llvm.call_intrinsic \"llvm.fma.v16f32\"` per intrinsics_m unroll.\n\nDistribution tile sizing is deliberately conservative (1 inner tile\nper parallel dim) — the mmt4d cost-model-driven sizing heuristic\nshould be ported to inner_tiled as a follow-up.\n\nProgress towards #24323"
    },
    {
      "commit": "174808a20722d663412ff69d7cd8ad4805282f35",
      "tree": "6903da1a6450b1e18e0e105b7d414cce74fbf4fc",
      "parents": [
        "83a30bbaedfd1aa202f3547092b795f3c3ed2699"
      ],
      "author": {
        "name": "Bangtian Liu",
        "email": "liubangtian@gmail.com",
        "time": "Fri May 01 11:44:29 2026 -0400"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Fri May 01 11:44:29 2026 -0400"
      },
      "message": "[LinalgExt] Fix ArgCompareOp::generateResultTileValue for producer fusion (#24317)\n\nThis PR fixes `ArgCompareOp::generateResultTileValue` to correctly map\noutput-rank coordinates to input-rank coordinates by re-inserting the\nreduction dimension, enabling producer fusion with downstream consumers.\n\nAssisted-by:  [Claude Code](https://claude.ai/code)\n\nSigned-off-by: Bangtian Liu \u003cliubangtian@gmail.com\u003e"
    },
    {
      "commit": "83a30bbaedfd1aa202f3547092b795f3c3ed2699",
      "tree": "e3e4599ac13ddd10eef69f94e0abcb4a2551b47c",
      "parents": [
        "404b958eb516a3b7e90d0189336f2731c561febf"
      ],
      "author": {
        "name": "Han-Chung Wang",
        "email": "hanhan0912@gmail.com",
        "time": "Fri May 01 08:11:58 2026 -0700"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Fri May 01 08:11:58 2026 -0700"
      },
      "message": "Update CODEOWNERS for spreading review responsibility (#24332)\n\n- Add bangtianliu to LinalgExt since he\u0027s been contributing new\nLinalgExt ops and transformations.\n- Add a new `Codegen/**/*Vector*` category and add myself and active\nowners and reviewers to code owners (based on my observation).\n\nNote that the new category is added after `Codegen/Common` so only new\nowners will show up on the new matching. From Github\u0027s doc:\n\n```\n# Order is important; the last matching pattern takes the most\n# precedence. When someone opens a pull request that only\n# modifies JS files, only @js-owner and not the global\n# owner(s) will be requested for a review.\n*.js    @js-owner #This is an inline comment.\n```\n\nIt adds the owners to all vectorization and vector transformations,\nexcept VectorExt (which is a later rule that owned by Groverkss)."
    },
    {
      "commit": "404b958eb516a3b7e90d0189336f2731c561febf",
      "tree": "169cce4fc066ae756fa87ab59fd166ab42a2470d",
      "parents": [
        "0a73681ecf10b74eda69cfc73bd32c8b2095c45c"
      ],
      "author": {
        "name": "Benoit Jacob",
        "email": "jacob.benoit.1@gmail.com",
        "time": "Fri May 01 10:37:19 2026 -0400"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Fri May 01 10:37:19 2026 -0400"
      },
      "message": "[Codegen] NFC: Lift DataTiledMMA inner_tiled lowering helpers into MMAUtils. (#24326)\n\nMoves the body of `DataTiledMMAAttr::buildUnderlyingOperations` and its\nprivate helpers (`incrementIndices`, `flattenVector`,\n`distributeMmaFragmentToIntrinsics`) out of the GPU dialect into a new\n`Codegen/Utils/MMAUtils.{h,cpp}`. The shared\n`buildDataTiledMMAUnderlyingOperations` takes a callback for the\narchitecture-specific MMA op emission, so future CPU support can reuse\neverything except the per-intrinsic op creation.\n\nAlso consolidates the HoistableConversionOp tag string literals\n(kDataTiledAcc{Distribute,Reassemble},\nkRdna3{Interleave,Deinterleave}Acc, kVDMFMA{Interleave,Deinterleave}Acc)\ninto the new header so that paired tags — which match by string — can no\nlonger drift between translation units.\n\nGPU\u0027s `DataTiledMMAAttr::buildUnderlyingOperations` is now a thin\nwrapper that supplies a callback delegating to the existing\n`createMmaOp`. GPU\u0027s `DataTiledScaledMMAAttr::buildUnderlyingOperations`\nkeeps using the helpers via `using` declarations.\n\nNo functional change.\n\nProgress towards #24323"
    },
    {
      "commit": "0a73681ecf10b74eda69cfc73bd32c8b2095c45c",
      "tree": "c02181eef2332a3c4a5790b26772d300687412e3",
      "parents": [
        "01c52ebad4c926e588695c41c39b7b0da3573ed9"
      ],
      "author": {
        "name": "Benoit Jacob",
        "email": "jacob.benoit.1@gmail.com",
        "time": "Fri May 01 10:36:49 2026 -0400"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Fri May 01 10:36:49 2026 -0400"
      },
      "message": "[Codegen][CPU] Fix RHS indexing map in materialize-encoding inner_tiled lowering. (#24325)\n\nThe CPU encoding-materialization path (`lowerContractionOpToInnerTiled`)\nemits an `iree_codegen.inner_tiled` op with the standard matmul indexing\nmaps `[(d0, d2), (d2, d1), (d0, d1)]`, but the RHS pack it lowers to\nuses `outer_dims_perm \u003d [1, 0]` so the packed RHS comes out with outer\ndims in `(N_iter, K_iter)` order — i.e. mmt4d-style, not standard-matmul\nstyle. The two interpretations of the same operand shape disagree, and\nthe inner_tiled verifier rejects the op with:\n\n    error: \u0027iree_codegen.inner_tiled\u0027 op shape does not match\n           iteration bounds\n\nSwitching the RHS map to `(d0, d1, d2) -\u003e (d1, d2)` matches what the\npack actually produces and lets the verifier project a consistent\niteration domain across all three operands. The semantics are still a\nvalid contraction (one parallel + one reduction on each input, two\nparallels on the output), just with the RHS walked in mmt4d order.\n\nWithout this fix, data-tiled `linalg.matmul` -\u003e `inner_tiled` lowering\non CPU fails the verifier on the very first dispatch and never gets near\ncodegen; with it, end-to-end compile gets past verification all the way\nto bufferization (where a separate, pre-existing pipeline gap takes\nover: there is no CPU pass yet that vectorizes/lowers `inner_tiled`\nbefore bufferize).\n\nProgress towards #24323"
    },
    {
      "commit": "01c52ebad4c926e588695c41c39b7b0da3573ed9",
      "tree": "cbefe607a6c0ee2c2afc85dab116365f1d9766a4",
      "parents": [
        "64031ddf1981a07830d2fd2d9a19149d2382d5f3"
      ],
      "author": {
        "name": "Abhishek Varma",
        "email": "abhvarma@amd.com",
        "time": "Fri May 01 10:27:21 2026 +0530"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Fri May 01 10:27:21 2026 +0530"
      },
      "message": "[DispatchCreation] Fuse scalar reductions with their parallel consumers (#24166)\n\nPatterns of the shape :-\n```\n  %s \u003d reduce(%x)\n  %res \u003d elementwise(%x, %s)\n```\nwere being emitted as two dispatches instead of one as observed\n[here](https://github.com/iree-org/iree/issues/24148).\n\nThe fusion check in `getRootParallelLoopToOpMap` rejected the consumer\nbecause the composed root-to-consumer map had all-zero results (e.g. ()\n-\u003e (0, 0) for a consumer broadcasting the reduction\u0027s scalar).\nThat check is meant to catch consumers that don\u0027t actually depend on any\nof the root\u0027s parallel loops.\n\nBut when the root is a full reduction to a scalar, it has no parallel\nloops to begin with - so the map comes out all zeros for any consumer,\nand the check ends up rejecting fusion.\n\nThis PR narrows the rejection to fire only when the root has at least\none parallel dim, which lets the above pattern fuse into a single\ndispatch.\n\nFixes: https://github.com/iree-org/iree/issues/24148 (for `Qwen MoE`)\n\nSigned-off-by: Abhishek Varma \u003cabhvarma@amd.com\u003e"
    },
    {
      "commit": "64031ddf1981a07830d2fd2d9a19149d2382d5f3",
      "tree": "249300dfc9662d8b5fc2f5d41dd5d86985a66aa9",
      "parents": [
        "e6139f68ae0edd77d65777f9ceca5b02939c8ec9"
      ],
      "author": {
        "name": "Han-Chung Wang",
        "email": "hanhan0912@gmail.com",
        "time": "Thu Apr 30 17:20:15 2026 -0700"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Thu Apr 30 17:20:15 2026 -0700"
      },
      "message": "Reapply \"[Codegen] Use local binders for optimization flags in codegen (#24220)\" (#24333)\n\nPreviously, I made a workaround in\nhttps://github.com/iree-org/iree/commit/1b75890ac8093dd9b74b8110833bd2a46b3da7f0.\nThe main reason is that all the codegen pipeline tests were anchor on a\nsingle pass (e.g., `XXXLowerExecutableTargetPass`), which makes plumbing\nthrough options infeasible.\n\nAfter migrating to the real pipeline transformations, we are able to use\nthe option in whole pipeline tests; we can drop the old workaround from\niree-opt changes.\n\nThe revision introduces the option for each backend (default `O0`) and\nuse local binders to apply the optimization flags. It drops the\n`XXX::FromFlags::get()` uses from Codegen.\n\nThe additional change is updating the default value in `Passes.td` to\nnot apply optimization level. It was a workaround when we abused the\npass for optimization control. Now all the pipeline tests use textual\npass pipelines, so we no longer need the workaround.\n\nci-extra: linux_x64_clang_tsan\n\n---------\n\nSigned-off-by: hanhanW \u003chanhan0912@gmail.com\u003e"
    },
    {
      "commit": "e6139f68ae0edd77d65777f9ceca5b02939c8ec9",
      "tree": "5e5e701a6befc92541ef46b2b3f8be6c0bae68e0",
      "parents": [
        "d055923f48285fe3047829854fca74d18bad9f14"
      ],
      "author": {
        "name": "Erick Ochoa Lopez",
        "email": "erick.ochoalopez@amd.com",
        "time": "Thu Apr 30 18:50:00 2026 -0400"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Thu Apr 30 18:50:00 2026 -0400"
      },
      "message": "[CI] Ease contention on self hosted machines (#24316)\n\nAfter https://github.com/iree-org/iree/pull/24194 we have no need to use\na benchmarking machine for cpu jobs in torch ops. So let\u0027s just use the\ngithub action ones for torch_ops cpu and help reduce contention on self\nhosted machines."
    },
    {
      "commit": "d055923f48285fe3047829854fca74d18bad9f14",
      "tree": "c018d55a7d3bdec99f3b348d8bf5cec4ce24710a",
      "parents": [
        "7247601b1967516ef7fd5cc5ae55abd5ced2397a"
      ],
      "author": {
        "name": "Nirvedh Meshram",
        "email": "96096277+nirvedhmeshram@users.noreply.github.com",
        "time": "Thu Apr 30 15:55:35 2026 -0500"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Thu Apr 30 15:55:35 2026 -0500"
      },
      "message": "Bump stablehlo to stablehlo@806a6844dfd92cca (#24330)\n\nDrops local patch since upstream stablehlo has caught up.\n\nSigned-off-by: Nirvedh Meshram \u003cnirvedh@gmail.com\u003e"
    },
    {
      "commit": "7247601b1967516ef7fd5cc5ae55abd5ced2397a",
      "tree": "322113cd61df8b4f1c536563a570072b05465438",
      "parents": [
        "ce12fef09b38b20e9251546458f97b6d416c296b"
      ],
      "author": {
        "name": "Max191",
        "email": "44243577+Max191@users.noreply.github.com",
        "time": "Thu Apr 30 16:30:48 2026 -0400"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Thu Apr 30 16:30:48 2026 -0400"
      },
      "message": "[LLVMGPU][ROCDL] Add pass to group global loads for better instruction scheduling (#24247)\n\nAdds LLVMGPUGroupGlobalLoadsPass which moves global loads in the same\nblock to be adjacent to each other when they are separated by pure\naddress computation ops. The pass moves each load along with its\ntransitive dependency chain to be right after the preceding global load.\n\nThis improves performance in situations where LLVM is not able to\nconvert address computation into a single base + constant offset. In\nsuch cases, instruction scheduling can become pessimistic and each\nglobal load needs to be waited on before the next is issued. With this\ninstruction reordering, all global loads are issued together after\naddress computation is completed.\n\nBased on benchmarks with this change alone, we don\u0027t have any cases in\nour suite of kernels that runs into this issue today. However, some\nconvolution shapes run into the issue after the changes in\nhttps://github.com/iree-org/iree/pull/24245, and this PR prevents such\nregressions.\n\nThis is only enabled for ROCDL in this PR, because we don\u0027t have any\ndata points to support adding it to other pipelines yet.\n\n---------\n\nSigned-off-by: Max Dawkins \u003cmax.dawkins@gmail.com\u003e\nCo-authored-by: Claude Opus 4.7 (1M context) \u003cnoreply@anthropic.com\u003e"
    },
    {
      "commit": "ce12fef09b38b20e9251546458f97b6d416c296b",
      "tree": "aff210ca6ba714f91d023d674cc598317fb2f752",
      "parents": [
        "967b794b6474ca56ca33e7f1cc246659f9f8bfbc"
      ],
      "author": {
        "name": "Rob Suderman",
        "email": "rob.suderman@gmail.com",
        "time": "Thu Apr 30 13:17:32 2026 -0700"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Thu Apr 30 13:17:32 2026 -0700"
      },
      "message": "Bump iree-org/torch-mlir@d2768f876d (#24320)"
    },
    {
      "commit": "967b794b6474ca56ca33e7f1cc246659f9f8bfbc",
      "tree": "678b98a99ba192b8536a6805d1994979e8778ff1",
      "parents": [
        "fcbd569042fb7e4d206a5efe5a1651773678d6d0"
      ],
      "author": {
        "name": "Han-Chung Wang",
        "email": "hanhan0912@gmail.com",
        "time": "Thu Apr 30 13:09:28 2026 -0700"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Thu Apr 30 20:09:28 2026 +0000"
      },
      "message": "[CPU] Add ContiguousMemrefGather1DToConditionalLoads vector lowering. (#24327)\n\nUpstream emits, per lane, a\n`delinearize(linearize(offsets, shape) + idx, shape)` to produce N-D\nload indices. It recovers the flatten index behavior from vectorization,\nwhich corrects the indices for strided memrefs. Because the \"OOB access\"\nneeds to take strides into account, and it recovers the logical indices.\nIt is a correct form for loading ops.\n\nHowever, it adds additional computation, which is not easy to remove,\nwhen it is a contiguous memref. For such cases, the vector.load -\u003e LLVM\ndialect lowering linearizes the indices based on the strides, which\nresults in the same physical address. Thus, the correction is not\nneeded, because we rely on the vector -\u003e LLVM lowering for the fixup.\n\nFrom the upstream doc, the value is target specific, so it is hard to\nupstream this \"optimization\".\n\nThe root cause is the flatten index behavior of vectorization, and it is\nnot an easy fix today.\n\nSigned-off-by: hanhanW \u003chanhan0912@gmail.com\u003e"
    },
    {
      "commit": "fcbd569042fb7e4d206a5efe5a1651773678d6d0",
      "tree": "50540e8883e424afcd03e21a2b3de5620c4f8097",
      "parents": [
        "9f7a14e222a499f1f083471fa821ead4e16b40a5"
      ],
      "author": {
        "name": "Nirvedh Meshram",
        "email": "96096277+nirvedhmeshram@users.noreply.github.com",
        "time": "Thu Apr 30 14:55:38 2026 -0500"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Thu Apr 30 14:55:38 2026 -0500"
      },
      "message": "Bump LLVM to llvm-project@6f1e6e47bdf (#24314)\n\nAPI changes from llvm-project@c1a236091832 — [mlir][python] expose\nremaining Location inspection API\n([#192630](https://github.com/llvm/llvm-project/pull/192630/changes))\n\nCarrying local patch for stablehlo because of llvm-project@1823355d06b8\n-\nhttps://github.com/iree-org/stablehlo/commit/fb869da27148c08c7f24602c3007fd5832a14cf3\n\n---------\n\nSigned-off-by: Nirvedh Meshram \u003cnirvedh@gmail.com\u003e"
    },
    {
      "commit": "9f7a14e222a499f1f083471fa821ead4e16b40a5",
      "tree": "2b51cafd3fcdeef6a8b9c84a98969b7bb17da08f",
      "parents": [
        "5872dc257388ec42499f7961094d7b664d492d70"
      ],
      "author": {
        "name": "Erick Ochoa Lopez",
        "email": "erick.ochoalopez@amd.com",
        "time": "Thu Apr 30 14:17:12 2026 -0400"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Thu Apr 30 14:17:12 2026 -0400"
      },
      "message": "[CI] Update iree-test-suite ref (#24304)\n\n* Enables benchmarking in torch_ops\n* Adds tolerance_factor to torch_ops and torch_models\n* Update/Remove deprecated flags\n\nChanges in IREE:\n* Updates golden times in torch_models to be divided by 1.1 (to account\nfor tolerance_factor)\n* Updates golden times in torch_ops by looking at the current timings.\n* For benchmark AB/8192x8192xf32_bench in\ntorch_ops_gpu_hip_gfx1201_O3.json there\u0027s a regression that has been\nnarrowed down to https://github.com/llvm/llvm-project/pull/184138.\nReverting it fixes this one, but further impact has not been studied\nyet.\n* For benchmarking torch_ops use rocprofv3"
    },
    {
      "commit": "5872dc257388ec42499f7961094d7b664d492d70",
      "tree": "76f160693a5dee3fe8e9a4b4724ee0f5eab384dd",
      "parents": [
        "0380544b17da75c1e757e96bd747aa326fe3675c"
      ],
      "author": {
        "name": "Benoit Jacob",
        "email": "jacob.benoit.1@gmail.com",
        "time": "Thu Apr 30 14:09:05 2026 -0400"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Thu Apr 30 14:09:05 2026 -0400"
      },
      "message": "[Codegen][CPU] Pick inner-tiled unroll factors from a register budget. (#24303)\n\nReplace the hard-coded `intrinsics_m \u003d intrinsics_n \u003d 1` from #24289\nwith a simple cost model based on maximizing arithmetic intensity\nunder the constraint of fitting in register space.\n\nAI: Claude Opus 4.7\n\nSigned-off-by: Benoit Jacob \u003cjacob.benoit.1@gmail.com\u003e"
    },
    {
      "commit": "0380544b17da75c1e757e96bd747aa326fe3675c",
      "tree": "bdd5635190947bc805d1fedbdedacf6ee946076d",
      "parents": [
        "a79bb7bfba94f7e26020d04359b95fc005ef0bb6"
      ],
      "author": {
        "name": "Lukas Sommer",
        "email": "lukas.sommer@amd.com",
        "time": "Thu Apr 30 10:18:28 2026 +0200"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Thu Apr 30 10:18:28 2026 +0200"
      },
      "message": "[IREEGPU] Define and expand `subgroup_scan` (#24188)\n\nThe upstream GPU dialect currently does not expose a `subgroup_scan`\noperation as analog to `subgroup_scan`.\n\nHaving the `subgroup_scan` operation allows us to lower associative\nscans across the threads in incremental steps rather than doing\ndistribution and full lowering to cross-subgroup operations in a single\nstep.\n\nThe associative scan semantics of the operation is similar to\n`vector.scan` and supports inclusive and exclusive scan. The subgroup\nsemantics is similar to `subgroup_scan` where applicable.\n\nThe expansion of the operation into lower-level vector operations and\ncross-subgroup operations (e.g., shuffle) uses the [Hillis-Steele\nalgorithm](https://en.wikipedia.org/wiki/Prefix_sum#Algorithm_1:_Shorter_span,_more_parallel).\n\nThis is part of https://github.com/iree-org/iree/issues/24186.\n\nAssisted-by: Claude Code and Codex\n\n---------\n\nSigned-off-by: Lukas Sommer \u003clukas.sommer@amd.com\u003e"
    },
    {
      "commit": "a79bb7bfba94f7e26020d04359b95fc005ef0bb6",
      "tree": "2057b0a21cfc824ceb20ec994a8bc904839d2494",
      "parents": [
        "dfe81344abeed40c6c8bcb664eb35f113deee19c"
      ],
      "author": {
        "name": "Vivian Zhang",
        "email": "zhyuhang88@gmail.com",
        "time": "Wed Apr 29 22:52:02 2026 -0700"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Wed Apr 29 22:52:02 2026 -0700"
      },
      "message": "[LLVMGPU] Remove unused `--iree-codegen-llvmgpu-use-unaligned-gemm-vector-distribution` flag (#24308)"
    },
    {
      "commit": "dfe81344abeed40c6c8bcb664eb35f113deee19c",
      "tree": "5ec34b900fd8d1e4af1c5f85c55a464ed21edba5",
      "parents": [
        "835d1b88ab3f3a7cdd0fc952c28337143602c2d8",
        "b42f44c0d93ea2e1c2198378cd2f795d5cf7af3c"
      ],
      "author": {
        "name": "Ben Vanik",
        "email": "ben.vanik@gmail.com",
        "time": "Wed Apr 29 22:20:15 2026 -0700"
      },
      "committer": {
        "name": "GitHub",
        "email": "noreply@github.com",
        "time": "Wed Apr 29 22:20:15 2026 -0700"
      },
      "message": "[HAL/AMDGPU] Initial host-side AMDGPU HAL implementation (#24298)\n\nThis PR lands IREE\u0027s native AMDGPU HAL driver: a direct HSA/ROCR backend\nthat owns queue submission, packet construction, memory placement,\ncommand-buffer recording/replay, profiling, counters, device-library\nselection, and future scheduling policy inside IREE instead of routing\nnormal execution through HIP. The cost is ~70kLoC but that gives IREE\ndirect ownership of AMD GPU execution instead of routing through HIP\nstreams and HIP graphs. The critical unlocks happen because IREE already\nknows the real program structure that HIP tries to guess at: explicit\nsemaphore frontiers, queue affinity, memory types, binding tables,\nreusable command-buffer blocks, executable metadata, profiling scopes,\nand replay captures. The native driver turns that structure directly\ninto AQL packets and queue-local completion state, which lets us do\nthings HIP cannot naturally express: low-overhead dynamic command\nbuffers, heterogeneous HAL device groups, future remote execution,\ndevice-side fixup/scheduling, and profiling/replay from the same command\nmodel. The early numbers show the shape of the win: ~12.5x lower submit\noverhead for cross-queue dependency edges, ~22x lower dynamic graph\nconstruction tax versus HIP graphs on a 512-dispatch chain, and ~20x\nlower steady-state host CPU time on queue-heavy submission paths. This\nis v0, but it is already the architecture we want to optimize: fewer\ncompatibility layers, more explicit contracts, and a path where AMD GPUs\nparticipate in the full HAL ecosystem instead of living behind a\nHIP-shaped abstraction boundary.\n\nThis is intentionally a large PR. The driver is not a thin shim around\none runtime call; it is the runtime boundary for AMD GPUs. The branch\ncontains the native driver plus the AMDGPU-specific hardening that made\nthe final shape reviewable: command-buffer replay cleanup, queue/pool\nintegration, profiling producers, target-library selection, device\ncapability handling, tests, and developer documentation.\n\nThe headlines:\n\n- IREE now has a native AMDGPU execution path based on HSA queues and\nAQL packets.\n- The driver can run normal HAL dispatches and reusable HAL command\nbuffers without HIP streams or HIP graphs.\n- The command-buffer representation is designed as a durable block\nprogram that can be replayed by host processors now and device-side\nprocessors later.\n- The profiling path can expose queue, dispatch, executable, counter,\ndevice metric, and ATT/SQTT trace data through the HAL profile tooling.\n- The hot paths are structured so static production replay does not pay\nfor optional profiling, trace, upload, or future device-fixup machinery.\n\n## Why\n\nHIP is a useful compatibility layer and comparison point, but it is not\nthe right abstraction boundary for the runtime work IREE wants to do.\n\nIREE needs to be able to control:\n\n- how HAL queue operations become AQL/PM4 packets;\n- where kernargs, command-buffer templates, transient buffers, and\nstaging records live;\n- how semaphore dependencies map to queue frontiers and completion\nepochs;\n- how reusable command buffers are recorded, validated, replayed, and\nprofiled;\n- where host work ends and queue-ordered device work begins;\n- how to capture profiling data without turning the production queue\ninto a debug path; and\n- how to evolve toward device-side command-buffer scheduling and fixup.\n\nHIP graphs are especially awkward for IREE\u0027s dynamic command-buffer use\ncase. They can be expensive to construct, hard to introspect, and\ndifficult to shape around IREE\u0027s own async allocation and replay\ncontracts. The native driver gives IREE a graph-like reusable command\nstream while keeping the command stream in IREE\u0027s own ABI.\n\n## Design Principles\n\nThe implementation follows a few constraints that are worth making\nexplicit for review.\n\n**Own the production hot path.** Queue submission, command-buffer\nreplay, kernarg formation, packet publication, and completion are\nexplicit IREE code. Optional features are allowed only when they do not\ntax the default path. For example, profiling, ATT/SQTT capture,\nqueue-control upload rings, and future device-side fixup all have opt-in\nstorage and control flow.\n\n**Record facts once.** Command buffers are allowed to do work while\nrecording and finalizing so replay can be simple. Binding counts, patch\ncounts, packet counts, barrier requirements, prepublication eligibility,\nrodata references, and block terminators are recorded in the\ncommand-buffer program instead of rediscovered by\nscanning command records during submission.\n\n**Keep host and device processors pointed at the same ABI.** The AMDGPU\ncommand buffer is a block program, not a host-only replay script. The\ncurrent host AQL block processor consumes that program; future\ndevice-side processors should consume the same block format for\ncommand-buffer continuations, scheduling, and kernarg fixup.\n\n**Separate invariant clusters.** The driver is split by subsystem rather\nthan growing one giant queue file. There are distinct files for queue\nsubmission, queue waits, command-buffer block processing, command-buffer\nreplay, profiling augmentation, staging/file paths, memory operations,\nexecutable handling, topology, device capabilities, and utility rings.\n\n**Fail loud on unsupported strategies.** Unsupported memory paths,\ncommand forms, profiling modes, and device capabilities should fail with\na concrete status instead of silently falling back through the wrong\nmechanism.\n\n**Make platform/device variation explicit.** The code names the places\nwhere HSA memory-pool access, HDP publication, topology links, target\nIDs, device-library coverage, Linux KFD metrics, and optional ROCm\nprofiling libraries affect behavior.\n\n## Architecture Overview\n\n### Driver And Device Model\n\nThe driver dynamically loads HSA/ROCR, discovers CPU and GPU agents, and\ncreates logical HAL devices over one or more physical AMDGPU agents.\n\nThe main object split is:\n\n- driver: HSA discovery, option parsing, and logical-device creation;\n- logical device: HAL-facing device object and shared runtime state;\n- physical device: one HSA GPU agent with queues, memory pools,\nexecutable cache, device-library selection, profiling state, device\nmetrics, and topology facts;\n- host queue: HSA queue plus IREE\u0027s AQL, kernarg, notification,\ncompletion, and reclaim state; and\n- virtual queue: the internal interface used so command-buffer, direct\ndispatch, memory, file, and profiling paths route through one queue\ncontract.\n\nDevice selection supports all visible AMDGPU agents by default,\nsingle-device selection, UUID-based selection, ordinal selection, and\nmulti-device logical devices. The topology code records HSA memory-pool\naccess, link class, NUMA distance, coherency, atomics, and interop\ncapability facts so future placement and transfer strategies can reason\nabout PCIe, xGMI, and other link types without hard-coded assumptions.\n\n### Executables And Device Libraries\n\nAMDGPU executables are loaded from HSACO/code-object data and matched\nagainst the selected physical device. The runtime also embeds AMDGPU\ndevice libraries used for builtin operations such as fill/copy helpers,\ntimestamp helpers, and dispatch-side utilities.\n\nThe device-library target map is single-sourced from generated target\nmetadata. Builds can select exact targets, LLVM generic targets,\nTheRock-style generic families, or product bundles. This keeps package\nsize and device coverage under explicit build-system control while\nletting the runtime fail clearly when a required target was not\nembedded.\n\n### Memory, Pools, And Publication\n\nThe driver integrates with the HAL pool substrate and AMDGPU HSA memory\npools instead of treating all buffers as generic allocations.\n\nThe implementation distinguishes:\n\n- device-local memory;\n- CPU-visible fine-grained host memory;\n- CPU-visible coarse-grained device memory;\n- queue-owned kernarg memory;\n- optional queue-control upload memory;\n- transient allocation pools;\n- file/staging storage; and\n- host-side block/slab pools used by queue and profiling data\nstructures.\n\nHDP publication is represented as a selected capability of the memory\npath, not as an ad hoc flush sprinkled through dispatch code. If CPU\nwrites to memory that the GPU will consume require publication on a\ndevice, the queue-owned memory path knows how to publish those writes\nbefore the relevant packet headers become\nvisible.\n\nThe default queue-control upload ring is disabled until a production\nconsumer opts in. That keeps the future device-side fixup path available\nwithout charging every queue an unused HSA allocation.\n\n### Queue Submission And Completion\n\nHost queues own an HSA AQL queue and maintain:\n\n- an AQL ring view for packet reservation/publicat\n\u003e \u003cimg width\u003d\"403\" height\u003d\"222\" alt\u003d\"image\"\nsrc\u003d\"https://github.com/user-attachments/assets/11a9ef26-dc32-427c-a01e-7969fd24ec2d\"\n/\u003e (kind of, consider this the reference implementation)\n\nion;\n- a kernarg ring for queue-owned dispatch arguments;\n- an epoch/notification ring mapping GPU completions to HAL semaphore\nsignals;\n- a queue frontier snapshot for dependency tracking;\n- one completion thread that drains queue epochs and publishes\nuser-visible semaphore completions;\n- optional PM4 IB slots indexed by AQL packet id on hardware that\nsupports AQL PM4 packets; and\n- optional profiling/counter/trace state.\n\nSubmission is serialized per queue, but independent queues do not\nsynchronize with each other. The queue submission path reserves AQL\npackets, kernargs, and notification entries before publishing headers.\nIf admission fails, reclaim is routed through the same\nnotification/reclaim machinery instead of inventing a\nparallel cleanup path.\n\nHAL ordering is represented by semaphore/frontier dependencies, not by\nassuming FIFO execution. The queue frontier machinery lets the driver\nelide redundant waits when the dependency is already known to be\nsatisfied, while preserving correctness when the frontier overflows or\ncannot prove elision.\n\n### Direct Dispatch And Builtin Operations\n\nDirect `queue_dispatch` resolves executable metadata, validates dispatch\nshape, forms kernargs, retains the executable/buffer resources required\nby the submission, and emits AQL packets through the common queue\nsubmission path.\n\nQueue buffer operations are implemented through explicit strategies.\nBuiltin device kernels cover fill/copy/update paths and are selected\nbased on alignment, size, and available device-library kernels. The code\nleaves room for SDMA, PM4, P2P, and future direct-storage strategies\nwithout conflating those with the current kernel-dispatch path.\n\n### Command Buffers\n\nThe AMDGPU command-buffer ABI is the center of the rewrite.\n\nRecorded command buffers are stored as a program of blocks. Each block\nhas a fixed header with command counts, binding-source counts,\npacket/kernarg worst case, rodata extent, dispatch/profile-marker\ncounts, barrier metadata, and a terminator. Commands include barriers,\ndispatches, fills, copies, updates,\nprofile markers, branches, conditional branches, and returns.\n\nThe important split is:\n\n- the command buffer owns the durable block program and rodata;\n- the AQL block processor consumes one block and writes reserved\npacket/kernarg storage;\n- host queue replay is the container/orchestration layer that\ninitializes a processor, invokes blocks, handles continuations, and\nintegrates with semaphores/reclaim; and\n- profiling processors are separate variants that augment replay only\nwhen profiling was explicitly requested.\n\nThis shape is deliberate. A block processor is close to a small\ninterpreter over the block ABI. It is suitable for dedicated tests today\nand for device-side processor variants later. Host queue code should not\nneed to know how every command body becomes AQL packets.\n\nReplay hot paths are specialized:\n\n- static reusable dispatches can use prepublished kernargs;\n- all-dynamic dispatches use a direct binding-pointer scatter path;\n- mixed static/dynamic reusable dispatches use immutable templates plus\nrecorded dynamic patch sources;\n- indirect dispatch parameters stay on the generic path where required;\nand\n- profile-disabled replay bypasses profile sidecars and trace/counter\nlogic.\n\nDynamic binding sources retain the original `queue_execute` binding\ntable slot for the entire command-buffer lifetime. There is no per-block\nbinding remap sidecar, and no finalization scan that rewrites binding\nslots. Future\ndevice-side fixup should consume recorded patch records directly: `patch\noffset + binding table slot + binding offset`.\n\n### Profiling, Counters, Traces, And Replay\n\nThe driver is a first-class producer for the HAL-native profiling and\nreplay stack.\n\nSupported profiling/data modes include:\n\n- host-side memory and queue events;\n- device-side queue timestamps;\n- per-dispatch timestamps;\n- executable/export metadata;\n- hardware/software counters;\n- queue-range PMC sampling;\n- device metrics from platform-specific sources;\n- filtered ATT/SQTT executable traces through dynamically loaded ROCm\nprofiling libraries; and\n- replay captures that can be run, benchmarked, dumped, and profiled\noutside the original application.\n\nNormal execution does not require ROCm profiling libraries. The\naqlprofile path is dynamically loaded only for modes that need counters\nor executable traces. Linux-specific device-metric support is isolated\nbehind a platform source so the core driver remains structured for\nfuture Windows and macOS HSA support.\n\n## Performance Evidence\n\nThe main apples-to-apples GPU comparison uses the SDXL CLIP prompt\nencoder: a real sharktank workload with 792 dispatches, 28 executables,\nand enough queue traffic to exercise command-buffer replay and\nhost/runtime overhead.\n\nPost-cleanup optimized non-Tracy medians:\n\n| Shape | AMDGPU wall | HIP stream wall | AMDGPU vs stream | HIP graph\nwall | AMDGPU vs graph | AMDGPU host CPU |\n| --- | ---: | ---: | ---: | ---: | ---: | ---: |\n| c1/d1 | 10.9508 ms | 11.5456 ms | 5.15% faster | 11.6199 ms | 5.76%\nfaster | 0.618 ms |\n| c1/d16 | 0.7035 ms/item | 0.7311 ms/item | 3.78% faster | 0.7335\nms/item | 4.09% faster | 0.036 ms/item |\n| c2/d16 | 0.7073 ms/item | 0.7298 ms/item | 3.08% faster | 0.7330\nms/item | 3.50% faster | 0.037 ms/item |\n| c4/d16 | 0.7066 ms/item | 0.7278 ms/item | 2.92% faster | 0.7288\nms/item | 3.05% faster | 0.037 ms/item |\n| c8/d16 | 0.7058 ms/item | 0.7322 ms/item | 3.60% faster | 0.7333\nms/item | 3.75% faster | 0.038 ms/item |\n\nThe broader model spread is consistent with the same story: native\nAMDGPU is usually ahead of HIP stream, usually ahead of HIP graph when\nHIP graph can import the workload, and uses much less host CPU on\nqueue-heavy paths.\n\nRepresentative additional rows:\n\n| Workload | Shape | AMDGPU | HIP stream | HIP graph | Notes |\n| --- | --- | ---: | ---: | ---: | --- |\n| MNIST-12 | c1/d1 | 0.0978 ms | 0.1423 ms | 0.1425 ms | Small\nclassifier, high runtime-overhead sensitivity. |\n| SqueezeNet 1.0 | c1/d1 | 1.1428 ms | 1.2043 ms | 1.1988 ms | Compact\nCNN. |\n| toy CLIP bf16 | c1/d1 | 0.2227 ms | 0.2578 ms | 0.2597 ms |\nTransformer-ish toy encoder. |\n| MobileNetV2-12 | c1/d1 | 1.8462 ms | 1.9316 ms | crash |\nDepthwise/mobile CNN; HIP graph crashes locally. |\n| TinyYOLOv2-8 | c1/d1 | 7.6516 ms | 8.0490 ms | 8.5600 ms | Object\ndetection graph. |\n| ResNet50-v1-12 | c1/d1 | 9.5364 ms | 9.6900 ms | import fails | HIP\ngraph node limit. |\n| SDXL scheduled UNet | c1/d1 body | 204.36 ms | 215.19 ms | 216.43 ms |\nDirect `run_forward` body. |\n| SDXL CLIP prompt encoder | c8/d16 | 0.692 ms | 0.721 ms | 0.725 ms |\nByte-identical HSACO/no-prefetch row. |\n\nWe also compared raw C HAL command-buffer construction/replay against\nraw C HIP graph construction/launch for a 512 dispatch/barrier chain,\navoiding VM overhead on both sides:\n\n| Path | Prebuilt wall | Dynamic wall | Extra wall | Extra wall /\ndispatch | Extra CPU / dispatch |\n| --- | ---: | ---: | ---: | ---: | ---: |\n| HAL command buffer, validated | 2096.4 us | 2177.0 us | 80.5 us |\n0.157 us | 0.582 us |\n| HAL command buffer, unvalidated | 2096.4 us | 2143.3 us | 46.9 us |\n0.092 us | 0.526 us |\n| HIP graph | 2983.7 us | 4022.9 us | 1039.3 us | 2.030 us | 2.308 us |\n\nThat is the key dynamic-command-buffer result: unvalidated HAL\ncommand-buffer recording/replay adds tens of microseconds for the\n512-pair chain, while HIP graph construction adds about a millisecond in\nthe same harness.\n\nQueue-stress microbenchmarks isolate the pathological submission streams\nthat large distributed and graph-style applications care about. The\ncurrent-head HAL rows below use the checked-in AMDGPU `queue_benchmark`\nbuilt optimized with release ThinLTO/O3/native flags, pinned to one CPU\nand one local RDNA3 GPU. HIP rows use the matching HIP event ping-pong\nharness on the same CPU/GPU pin. The end-to-end rows measure 512\ncross-queue dependency edges plus one public host-visible completion:\n\n| Shape | AMDGPU end-to-end / edge | HIP end-to-end / edge | Read |\n| --- | ---: | ---: | --- |\n| Cross-queue dependency edge | 4.58 us | 11.20 us | AMDGPU is 2.4x\nfaster. |\n| Edge + 4-byte device copy | 11.65 us | 14.62 us | AMDGPU is 1.25x\nfaster. |\n| Edge + 4-byte device fill | 10.98 us | 15.20 us | AMDGPU is 1.38x\nfaster. |\n| Edge + tiny dispatch | 10.55 us | 14.59 us | AMDGPU is 1.38x faster. |\n| Edge + no-op dispatch packet | 4.56 us | n/a | AMDGPU stays near the\npure dependency floor when payload work is empty. |\n\nThe pure submit-only dependency row is the sharpest host-path\ncomparison: AMDGPU submits a cross-queue dependency edge for about 0.42\nus/edge, while HIP events cost about 5.23 us/edge in the same pinned\nharness. That is about 12.5x less host-side submission overhead for the\nsynchronization pattern used by tensor-parallel and pipeline-parallel\nprograms.\n\nThis is not just an implementation-speed comparison. HIP stream events\nand HIP graphs sit above a compatibility runtime that has to rediscover\nintent from streams, events, graph nodes, kernel parameters, and raw\npointer arguments. IREE already has that intent in structured HAL\ncommands: explicit semaphore frontiers, queue affinity, binding tables,\nmemory types, command-buffer blocks, and executable metadata. The AMDGPU\nHAL can turn those contracts directly into AQL packets and queue-local\ncompletion state without routing every operation\nthrough HIP\u0027s public stream/event/graph abstraction.\n\nThat structural difference is why the CPU-time story is as important as\nthe wall-time story. On the SDXL CLIP prompt encoder, AMDGPU runs the\nsteady-state batched path with roughly 0.036-0.038 ms/item of host CPU\ntime while HIP stream and HIP graph paths are around 0.74-0.76 ms/item.\nThat is a roughly 20x host CPU reduction on the queue-heavy path. On\nsystems with many accelerators, expensive prefill/decode scheduling, or\nsmall CPU budgets, that difference is the difference between the CPU\nbeing orchestration glue and the CPU becoming the\nbottleneck.\n\nThe same abstraction boundary is also what lets HAL scale beyond HIP\u0027s\nworld model. HAL command buffers, semaphores, queue affinity, memory\nfiles, and device groups can describe local GPUs, CPU devices, remote\ndevices, and heterogeneous execution without changing the program\u0027s\nsynchronization model. The upcoming remote HAL work can use the same\ncommand/dependency concepts across process or machine boundaries; HIP\ncannot represent that kind of heterogeneous or remote execution graph\nwithout collapsing it back into host-side framework logic. This rewrite\nputs AMDGPU on the same HAL substrate as local-task, local-sync,\nprofiling, replay, and future remote execution instead of treating AMD\nGPUs as a HIP-shaped island.\n\nTracy and Perfetto captures were used as structural evidence for queue\nshape, host/runtime gaps, worker behavior, dispatch timing, counter\nranges, and device metric sampling. Non-Tracy optimized runs are the\nsource of the wall-time numbers above.\n\n## Portability And Hardware Coverage\n\nThe current implementation has been exercised primarily on local\nRDNA3/gfx1100 Linux hardware, but the code is structured for broader\nAMDGPU support.\n\nCross-device preparation in this PR includes:\n\n- target ID parsing and generated target maps for exact, generic,\nfamily, and product-bundle device-library selection;\n- explicit HSA memory-pool access and link-topology modeling;\n- CPU-visible device-coarse memory capability selection with HDP\npublication;\n- queue-owned kernarg publication policy;\n- PM4 capability detection and AQL PM4 IB infrastructure where\nsupported;\n- generic device-library target selection instead of hard-coding\ngfx1100; and\n- tests around target IDs, code-object target selection, topology,\nmemory access, device-library lookup, and PM4/AQL emitters.\n\nCross-platform preparation includes:\n\n- dynamic HSA loading instead of a direct link dependency;\n- platform-isolated Linux KFD/device-metric support;\n- optional dynamic loading of ROCm profiling libraries;\n- public HAL abstractions for profiling/replay rather than AMDGPU-only\ntool hooks; and\n- explicit failure for unsupported platform features.\n\nThis PR does not claim every modern RDNA/CDNA target is fully proven. It\ngives us the driver architecture, target map, and capability seams\nrequired to harden that matrix as more hardware and platform HSA stacks\nbecome available.\n\n## Forward-Looking Work Enabled By This Shape\n\nSeveral important features are intentionally not completed in this PR,\nbut the landed architecture is designed around them.\n\n**Device-side dynamic kernarg fixup.** Dynamic command buffers currently\npatch queue-owned kernargs on the host. The planned production path is\nto upload a small per-submission binding table/control record and\ndispatch a device-side fixup kernel that copies template kernargs and\npatches dynamic qwords before\npayload dispatches execute. The recorded command-buffer patch records\nalready carry the essential facts: target patch location, original\nbinding-table slot, and binding offset.\n\n**Device-side command-buffer scheduling.** The block-program ABI gives\nus a clean path to device-side processors. A device queue can invoke\nblock processors, advance command-buffer continuations, and schedule\nindependent blocks without forcing host queue code to understand every\ncommand body.\n\n**Command-buffer control flow.** The ABI already reserves branch,\nconditional branch, and return terminators. Host replay currently\nsupports the subset needed by the landed workloads; the representation\nis intentionally shaped so richer control flow can become an execution\nfeature rather than a new command-buffer\nformat.\n\n**Binding-table-indirect dispatch ABI.** A future dispatch ABI may avoid\ndynamic kernarg pointer fixup by passing an invocation-local binding\ntable base and loading buffer pointers indirectly in kernels. That needs\ncompiler/runtime experiments to measure the cost of an extra scalar load\nversus raw pointer kernargs, but the current direct binding-table slot\ninvariant is compatible with that direction.\n\n**PM4-backed queues and operations.** The driver now has PM4 emitters,\nPM4 program utilities, capability detection, and AQL PM4 IB slots on\nsupported hardware. That creates room for PM4-backed waits, transfers,\nprofiling snippets, and potentially lower-level queue strategies where\nHSA/AQL alone is not the best mechanism.\n\n**Transfer strategy expansion.** Current transfer paths use explicit\nbuiltin device kernels and staging strategies. The queue/file/memory\nsplit leaves room for SDMA, P2P, direct storage, and topology-aware copy\nselection without rewriting the core queue completion path.\n\n**Broader profiling.** CDNA devices should expose richer counter options\nthan the initial local setup. The queue-range PMC and profile-bundle\ninfrastructure are meant to scale into that environment without changing\nthe normal execution path.\n\n## Review Guide\n\nGood entry points for review:\n\n- `runtime/src/iree/hal/drivers/amdgpu/README.md`: user-facing driver\noverview, build flags, runtime selection, profiling, and target-library\nnotes.\n- `runtime/src/iree/hal/drivers/amdgpu/api.h`: public driver/device\noptions.\n- `runtime/src/iree/hal/drivers/amdgpu/driver.c`: driver registration,\nHSA loading, and device creation.\n- `runtime/src/iree/hal/drivers/amdgpu/logical_device.c`: HAL device\nmethods, profiling/replay integration, and physical-device\norchestration.\n- `runtime/src/iree/hal/drivers/amdgpu/physical_device.c`: HSA agent\nsetup, queue creation, memory pools, executable caches, device\nlibraries, profiling, and topology state.\n- `runtime/src/iree/hal/drivers/amdgpu/host_queue.c`: queue ownership,\ncompletion thread, submission state, and reclaim lifetime.\n- `runtime/src/iree/hal/drivers/amdgpu/host_queue_submission.c`: common\nsubmission admission, publication, and failure/reclaim path.\n- `runtime/src/iree/hal/drivers/amdgpu/aql_command_buffer.c`:\ncommand-buffer recording, layout, prepublication, dynamic binding\nstrategy, and block construction.\n- `runtime/src/iree/hal/drivers/amdgpu/abi/command_buffer.h`: durable\ncommand-buffer block ABI.\n- `runtime/src/iree/hal/drivers/amdgpu/aql_block_processor.c`:\nunprofiled AQL block processor.\n- `runtime/src/iree/hal/drivers/amdgpu/aql_block_processor_profile.c`:\nprofiling-augmented block processor.\n- `runtime/src/iree/hal/drivers/amdgpu/host_queue_command_buffer*.c`:\nhost replay orchestration, block submission, packet policy, scratch\nstorage, and profiling integration.\n- `runtime/src/iree/hal/drivers/amdgpu/profile_*.c`: profile producers\nfor events, metadata, counters, device metrics, and traces.\n- `runtime/src/iree/hal/drivers/amdgpu/device/*.c`: embedded device-side\nhelper kernels and host-side packet/kernarg formation helpers.\n- `runtime/src/iree/hal/drivers/amdgpu/util/*.c`: HSA loading, target\nIDs, code-object metadata, rings, signals, PM4/AQL emitters, topology,\nand KFD utilities.\n\n## Validation\n\nValidation covered both source-level unit tests and workload-level\nevidence:\n\n- focused AMDGPU unit tests for HSA loading, target IDs, code-object\nmetadata, device libraries, topology, capabilities, pools, signals,\nrings, emitters, executables, semaphores, allocators, command buffers,\nblock processors, host queue submission, staging, profiling\nmetadata/events, and CTS backends;\n- AMDGPU HAL CTS dispatch/executable coverage;\n- focused Linux Bazel ASAN builds/tests for the AMDGPU runtime targets;\n- focused CMake configure/build/test coverage for AMDGPU runtime\nlibraries and generated CTS artifacts;\n- Windows and macOS CMake validation of the shared\nHAL/async/profile/replay substrate that this driver depends on;\n- SDXL CLIP correctness on both visible local AMDGPU devices with the\nsame weights, inputs, and expected outputs used for CPU validation;\n- SDXL CLIP, SDXL UNet, model-spread, command-buffer-vs-HIP-graph,\nTracy, Perfetto, device-metrics, PMC, and ATT/SQTT profiling runs; and\n- pre-commit formatting/check generation hooks for the final branch.\n\nThe performance numbers in this PR are from optimized non-Tracy runs on\nmy machine, YMMV. Tracy, Perfetto, counters, and device metrics were\nused to explain structure and validate behavior, not as the source of\nwall-clock claims."
    },
    {
      "commit": "b42f44c0d93ea2e1c2198378cd2f795d5cf7af3c",
      "tree": "5ec34b900fd8d1e4af1c5f85c55a464ed21edba5",
      "parents": [
        "bf7d2b5765e143f8b12198b4f1c9489f3f4459ea"
      ],
      "author": {
        "name": "Ben Vanik",
        "email": "ben.vanik@gmail.com",
        "time": "Wed Apr 29 13:53:20 2026 -0700"
      },
      "committer": {
        "name": "Ben Vanik",
        "email": "ben.vanik@gmail.com",
        "time": "Wed Apr 29 13:53:20 2026 -0700"
      },
      "message": "[HAL/AMDGPU] Use status matcher in notification test\n\nUse the IREE status test macro for notification-ring callback status validation instead of checking the raw boolean status predicate. This keeps AMDGPU tests aligned with the runtime test style and preserves the diagnostic status payload on failure.\n"
    },
    {
      "commit": "bf7d2b5765e143f8b12198b4f1c9489f3f4459ea",
      "tree": "901b839f6ebb70b775cbc6b0e6dc7fa732504bb4",
      "parents": [
        "3ac416476e87e377125895077721b4115fd0247a"
      ],
      "author": {
        "name": "Ben Vanik",
        "email": "ben.vanik@gmail.com",
        "time": "Tue Apr 28 18:37:57 2026 -0700"
      },
      "committer": {
        "name": "Ben Vanik",
        "email": "ben.vanik@gmail.com",
        "time": "Wed Apr 29 13:46:13 2026 -0700"
      },
      "message": "[HAL/AMDGPU] Disable queue upload rings by default\n\nQueue-control upload rings are intended for device-side command-buffer fixup and future device-queue control records, but there is no production consumer in the current landing slice. Allocating 64 KiB of HSA memory for every host queue is therefore future-strategy storage on the default path.\n\nMake the default upload capacity zero and treat zero as an explicit disabled state at the logical, physical, and host queue option boundaries. Non-zero capacities still have the same power-of-two contract, and the queue only initializes the upload ring when a caller opts in.\n\nThis keeps the upload ring utility and reclaim plumbing available for the device-side fixup work without charging ordinary dispatch or static command-buffer replay for an unused per-queue allocation.\n"
    },
    {
      "commit": "3ac416476e87e377125895077721b4115fd0247a",
      "tree": "7cb0d1890506c89106e9b450a8c5e02f2e6c7d7e",
      "parents": [
        "8d437bdd18242f02d4caa99b62ced2267ea9c43d"
      ],
      "author": {
        "name": "Ben Vanik",
        "email": "ben.vanik@gmail.com",
        "time": "Sat Apr 25 20:55:28 2026 -0700"
      },
      "committer": {
        "name": "Ben Vanik",
        "email": "ben.vanik@gmail.com",
        "time": "Wed Apr 29 13:46:13 2026 -0700"
      },
      "message": "[HAL/AMDGPU] Remove dynamic binding slot sidecars\n\nDynamic command-buffer binding sources should keep one meaning for their entire lifetime: the slot field indexes the queue_execute binding table. The removed sidecar pass changed that field during command-buffer end(), turning it into a block-local dense pointer-table ordinal after scanning finalized blocks.\n\nDrop that finalization pass and its retained per-block sidecar list. Host replay now resolves binding table entries into raw base pointers indexed by the original binding slots, and multi-block replay resolves that table once in the replay continuation before invoking block submissions.\n\nThis keeps command-buffer construction from walking recorded blocks just to discover binding usage, removes the slot-remap representation, and leaves future device-side kernarg fixup able to consume recorded patch slot/offset records directly.\n"
    },
    {
      "commit": "8d437bdd18242f02d4caa99b62ced2267ea9c43d",
      "tree": "85eeb604518147bcf359d1b80aa6680be6ebbadc",
      "parents": [
        "237efe61c9a35abde7cebeb0987cbd322ec8c66d"
      ],
      "author": {
        "name": "Ben Vanik",
        "email": "ben.vanik@gmail.com",
        "time": "Sat Apr 25 18:39:44 2026 -0700"
      },
      "committer": {
        "name": "Ben Vanik",
        "email": "ben.vanik@gmail.com",
        "time": "Wed Apr 29 13:46:12 2026 -0700"
      },
      "message": "[HAL/AMDGPU] Specialize all-dynamic dispatch replay\n\nAll-dynamic command-buffer dispatches still used the generic HAL kernarg replay strategy. That interpreter has to branch over raw static, queue_execute dynamic, and deferred static-buffer sources for each binding even when recording already proved every dispatch binding is dynamic.\n\nAdd a DYNAMIC_BINDINGS kernarg strategy for non-indirect all-dynamic HAL dispatches. Recording keeps the compact inline tail payload instead of building a zero-filled kernarg template, while base and profiled AQL block processors use a straight pointer scatter from the dense block-sidecar binding pointer table.\n\nMixed static/dynamic reusable dispatches continue to use patched templates, and indirect parameter sources stay on the generic path because they resolve from the binding table directly.\n"
    },
    {
      "commit": "237efe61c9a35abde7cebeb0987cbd322ec8c66d",
      "tree": "40eea41c607e8634fbfda4b70c93b391c24cc1cf",
      "parents": [
        "34c5ba3f2559cb927fa07d709334e77e560e1887"
      ],
      "author": {
        "name": "Ben Vanik",
        "email": "ben.vanik@gmail.com",
        "time": "Sat Apr 25 18:33:58 2026 -0700"
      },
      "committer": {
        "name": "Ben Vanik",
        "email": "ben.vanik@gmail.com",
        "time": "Wed Apr 29 13:46:12 2026 -0700"
      },
      "message": "[HAL/AMDGPU] Compact dynamic binding pointer replay\n\nDynamic command-buffer replay was still caching resolved binding pointers in a sparse array indexed by the original queue_execute binding slot. That kept replay scratch storage tied to the command buffer maximum binding slot even after finalization had built a compact per-block dynamic slot sidecar.\n\nRewrite dynamic dispatch binding-source slots at command-buffer finalization to dense sidecar ordinals. The sidecar continues to store the original queue_execute binding slots, while host replay resolves only those used slots into a compact pointer table consumed by both the base and profiling block processors.\n\nStatic and fully prepublished blocks keep the same no-sidecar path, while dynamic replay gets a denser host-side representation for future queue-upload and device-fixup publication.\n"
    },
    {
      "commit": "34c5ba3f2559cb927fa07d709334e77e560e1887",
      "tree": "e7506bf26fcf037426d8a2bd37dfff65d7f9db09",
      "parents": [
        "2fc0cacf4081c3fb24ac89f66374c90f58eceef5"
      ],
      "author": {
        "name": "Ben Vanik",
        "email": "ben.vanik@gmail.com",
        "time": "Fri Apr 24 22:32:50 2026 -0700"
      },
      "committer": {
        "name": "Ben Vanik",
        "email": "ben.vanik@gmail.com",
        "time": "Wed Apr 29 13:46:12 2026 -0700"
      },
      "message": "[HAL/AMDGPU] Bake dynamic binding slots into command buffers\n\nDynamic command-buffer replay resolved dispatch binding pointers by scanning every block binding-source record on each queue_execute submission, then filtering for dynamic non-indirect sources. That made a recording-time fact part of the replay hot path and kept the device-fixup path from having a crisp block-level signal for whether dynamic binding metadata exists.\n\nBuild a per-block dynamic binding slot sidecar when AQL command buffers are finalized. The block ABI spends its reserved byte on a HAS_DYNAMIC_BINDING_SLOTS flag, so static and fully prepublished blocks can skip the lookup entirely while dynamic blocks resolve only the compact recorded slot list before invoking the block processor.\n\nTighten the replay binding-table contract while touching the path: a dynamic slot must fit both the command buffer recorded binding capacity and the queue_execute binding table actually supplied by the caller before the host indexes the table.\n"
    },
    {
      "commit": "2fc0cacf4081c3fb24ac89f66374c90f58eceef5",
      "tree": "4c4792ef42303d6ebe98ef84b505cbc4ab819661",
      "parents": [
        "85117aadcc9da0ac1c7aefd79394cbbd9ddcbfb2"
      ],
      "author": {
        "name": "Ben Vanik",
        "email": "ben.vanik@gmail.com",
        "time": "Fri Apr 24 14:35:34 2026 -0700"
      },
      "committer": {
        "name": "Ben Vanik",
        "email": "ben.vanik@gmail.com",
        "time": "Wed Apr 29 13:46:12 2026 -0700"
      },
      "message": "[HAL/AMDGPU] Sample counter ranges on a profile queue\n\nCounter range profiling was hard-coded to host queue 0. That is also the deterministic target for IREE_HAL_QUEUE_AFFINITY_ANY, so a 1ms periodic flush could not interrupt a long submitted workload: the range stop/start packets sat behind the work they were supposed to sample.\n\nRoute counter range enable/start/flush through a small queue-selection helper and use the final host queue for range sampling, with queue 0 as the one-queue fallback. This keeps the default queue available for ordinary submissions while letting the profiling flusher run near its requested cadence on devices with multiple host queues.\n\nA 100-iteration SDXL prompt-encoder capture at a 1ms flush interval moved from 204 device-time-range samples with 5.47ms average range duration to 1102 samples with 0.98ms average range duration. Add coverage that device-time-range counter samples are accepted by the test sink and are emitted on the selected profile queue.\n"
    },
    {
      "commit": "85117aadcc9da0ac1c7aefd79394cbbd9ddcbfb2",
      "tree": "9eeaf0a278be17b4c095a8ee58310948ac3fdfbb",
      "parents": [
        "6ebc9c9279690b46e420643cad237233a6ea4195"
      ],
      "author": {
        "name": "Ben Vanik",
        "email": "ben.vanik@gmail.com",
        "time": "Fri Apr 24 14:19:10 2026 -0700"
      },
      "committer": {
        "name": "Ben Vanik",
        "email": "ben.vanik@gmail.com",
        "time": "Wed Apr 29 13:46:12 2026 -0700"
      },
      "message": "[HAL/AMDGPU] Fix patched-template profile metadata\n\nPatched-template command-buffer dispatches store only the dynamic binding patch sources in the block sidecar. The operation metadata path was walking binding_count entries from that compact sidecar, which is only valid for full HAL binding-source lists. Mixed static/dynamic dispatches could therefore read past the block allocation when profiling forced retained command-buffer metadata.\n\nClassify patched-template binding flags directly from the dispatch strategy and patch-source count instead of treating binding_count as a sidecar length. This preserves static/dynamic operation attribution without touching the queue hot path.\n\nExtend the mixed dynamic dispatch test to retain profile metadata and assert that the registered dispatch operation reports both static and dynamic bindings, covering the sidecar shape that exposed the ASAN failure.\n"
    },
    {
      "commit": "6ebc9c9279690b46e420643cad237233a6ea4195",
      "tree": "4f1a505bd78273cbcce224d4db01769b1fdc4df8",
      "parents": [
        "d664e03a7a15e9cd96ac5213bc2313ad65fb7434"
      ],
      "author": {
        "name": "Ben Vanik",
        "email": "ben.vanik@gmail.com",
        "time": "Tue Apr 28 11:45:39 2026 -0700"
      },
      "committer": {
        "name": "Ben Vanik",
        "email": "ben.vanik@gmail.com",
        "time": "Wed Apr 29 13:46:12 2026 -0700"
      },
      "message": "[HAL/AMDGPU] Add queue-range counter profiling\n\nAdd a separate counter-ranges profiling data family and CLI mode for low-disturbance AMDGPU PMC capture. The existing counters mode remains dispatch-scoped and continues to require retained command-buffer metadata, while counter-ranges avoids dispatch metadata, dispatch event storage, and command-buffer profiling sidecars.\n\nAMDGPU counter sessions now distinguish dispatch samples from queue-carried physical-device ranges. Dispatch sample resources are still enabled on every host queue when requested. Range resources are materialized only on the first host queue for each physical device so device-global PMCs are not started and stopped by overlapping queues. Each range queue owns two pre-created banks, stops the active bank on flush/end, optionally restarts the alternate bank in the same queue-ordered reservation, and writes device-time-range counter samples after the cold flush wait completes.\n\nRange-only profiles now emit begin/end clock correlations so one-shot captures have a real device-clock fit. The Perfetto renderer projects device_time_range counter samples onto separate range-counter tracks, keeping them distinct from dispatch-scoped attribution counters.\n"
    },
    {
      "commit": "d664e03a7a15e9cd96ac5213bc2313ad65fb7434",
      "tree": "923983bbf9f4155ada02ee33e6eabaed97866469",
      "parents": [
        "7953fbda7ba4ed05a3bf72b77f33a23df3683bb9"
      ],
      "author": {
        "name": "Ben Vanik",
        "email": "ben.vanik@gmail.com",
        "time": "Tue Apr 28 11:45:39 2026 -0700"
      },
      "committer": {
        "name": "Ben Vanik",
        "email": "ben.vanik@gmail.com",
        "time": "Wed Apr 29 13:46:12 2026 -0700"
      },
      "message": "[HAL/AMDGPU] Initialize queue upload rings\n\nThread a per-host-queue upload-ring capacity through the public logical-device options, physical-device options, and host queue initialization. Each host queue now eagerly creates its queue-control upload ring from the same HSA memory pool and host-write publication policy selected for queue-owned kernargs, keeping descriptor lifetime cold and simple.\n\nKeep existing dispatch and command-buffer replay paths on the exact no-upload submission helper. I intentionally did not add a generic upload request argument to the shared kernel-submission path: until the device-side fixup consumer exists, that would be a new branch on every current submission for no production value. The first upload-using path should enter through a specialized admission shape so static replay keeps paying zero upload checks.\n\nGroup kernel-submission kernarg and queue-upload reclaim state into named aggregates while touching the submission representation. This avoids growing another bag of top-level fields and keeps the reclaim watermark contract explicit without increasing the hot submission state.\n"
    },
    {
      "commit": "7953fbda7ba4ed05a3bf72b77f33a23df3683bb9",
      "tree": "cbf975bc40344cea61582ffa85d236c4195fd450",
      "parents": [
        "593764a5257f511e2565aef090c833e6ddf1c35f"
      ],
      "author": {
        "name": "Ben Vanik",
        "email": "ben.vanik@gmail.com",
        "time": "Tue Apr 28 11:45:39 2026 -0700"
      },
      "committer": {
        "name": "Ben Vanik",
        "email": "ben.vanik@gmail.com",
        "time": "Wed Apr 29 13:46:12 2026 -0700"
      },
      "message": "[HAL/AMDGPU] Track upload ring reclaim positions\n\nExtend notification-ring reclaim entries so queue-owned upload bytes can retire through the same completion epoch as kernargs. The new reclaim-position API reports both kernarg and upload ring watermarks, while the existing kernarg-only wrappers remain for current callers and tests that do not care about upload storage.\n\nThread the upload watermark through kernel-shaped host queue submissions, including the failed-submission noop path that plugs already-reserved AQL slots. Host queue drain and teardown now reclaim all queue-owned ring positions through one helper; the upload watermark stays zero until a submission path actually allocates upload spans, so static command-buffer replay remains untouched.\n\nAdd notification-ring coverage for reporting both queue-owned watermarks across zero-signal epochs.\n"
    },
    {
      "commit": "593764a5257f511e2565aef090c833e6ddf1c35f",
      "tree": "b0a2ea68e5405c819dcce06664d83ac7047586a2",
      "parents": [
        "48712810f9e12303ffffa2ff4caf067406280d70"
      ],
      "author": {
        "name": "Ben Vanik",
        "email": "ben.vanik@gmail.com",
        "time": "Tue Apr 28 11:45:39 2026 -0700"
      },
      "committer": {
        "name": "Ben Vanik",
        "email": "ben.vanik@gmail.com",
        "time": "Wed Apr 29 13:46:12 2026 -0700"
      },
      "message": "[HAL/AMDGPU] Add queue upload ring primitive\n\nAdd a byte-granular queue upload ring utility for small device-visible control records. The ring reserves contiguous aligned spans, tracks logical byte positions for epoch reclaim, and uses the same host-write publication policy as queue-owned kernargs so future device-side fixup inputs can share the HDP/no-op publication decision.\n\nKeep the primitive under util:queue_primitives instead of host_queue so command-buffer replay does not grow another private lifetime system. This commit only adds the reusable allocator and tests; production queue paths are unchanged until the notification/admission wiring lands.\n\nRegenerate the CMake metadata for the new Bazel target.\n"
    },
    {
      "commit": "48712810f9e12303ffffa2ff4caf067406280d70",
      "tree": "98259edaffc2b3eb7d1830448942e1993791e6eb",
      "parents": [
        "c6caae608db50011f31aa71edf4cccc9708ef6d7"
      ],
      "author": {
        "name": "Ben Vanik",
        "email": "ben.vanik@gmail.com",
        "time": "Tue Apr 28 11:45:38 2026 -0700"
      },
      "committer": {
        "name": "Ben Vanik",
        "email": "ben.vanik@gmail.com",
        "time": "Wed Apr 29 13:46:12 2026 -0700"
      },
      "message": "[HAL/AMDGPU] Record mixed dynamic kernarg templates\n\nAdd a PATCHED_TEMPLATE command-buffer kernarg strategy for reusable direct dispatches that have both static and dynamic HAL ABI bindings. Recording now writes an immutable rodata kernarg template with static binding pointers, constants, and implicit-arg bytes already populated, plus a compact patch list containing only the dynamic binding table slot and destination kernarg qword.\n\nHost block replay copies the template into queue-owned kernarg storage and then runs a tight scatter loop over those dynamic patch records. Static reusable dispatches stay on the existing PREPUBLISHED path, while one-shot and all-dynamic dispatches keep using the inline HAL form for now so we do not add template bytes without eliminating binding-table resolution.\n\nKeep the ABI record size fixed by repacking the binding-source record, and cover the new shape with an end-to-end command-buffer queue execution test that mixes a static input binding with a dynamic output binding.\n"
    },
    {
      "commit": "c6caae608db50011f31aa71edf4cccc9708ef6d7",
      "tree": "262958959aa50599e4a11c2b899e23c32af92f91",
      "parents": [
        "f60c162ba9c0a28ef1ce2a70e2d3df603deb979c"
      ],
      "author": {
        "name": "Ben Vanik",
        "email": "ben.vanik@gmail.com",
        "time": "Tue Apr 28 11:45:38 2026 -0700"
      },
      "committer": {
        "name": "Ben Vanik",
        "email": "ben.vanik@gmail.com",
        "time": "Wed Apr 29 13:46:12 2026 -0700"
      },
      "message": "[HAL/AMDGPU] Test external buffer fail-loud contracts\n\nMake the cross-platform external-buffer stance explicit in allocator coverage. AMDGPU supports host-allocation import through the HSA memory-lock path, but device-allocation, opaque fd, and opaque Win32 imports are not implemented, and export is unavailable.\n\nThe tests lock that down so future Windows/macOS HSA work has to replace a deliberate fail-loud contract instead of accidentally inheriting an enum-shaped stub.\n"
    },
    {
      "commit": "f60c162ba9c0a28ef1ce2a70e2d3df603deb979c",
      "tree": "ad9ecd35176e2f69ff35e9f0b653764f097ae68b",
      "parents": [
        "1d278204d82545ea540f8fccb96cc944bbd11a5a"
      ],
      "author": {
        "name": "Ben Vanik",
        "email": "ben.vanik@gmail.com",
        "time": "Tue Apr 28 11:45:38 2026 -0700"
      },
      "committer": {
        "name": "Ben Vanik",
        "email": "ben.vanik@gmail.com",
        "time": "Wed Apr 29 13:46:12 2026 -0700"
      },
      "message": "[HAL/AMDGPU] Centralize physical topology edge selection\n\nMove the physical source/destination topology edge decision table into physical_device_capabilities.*, separating HSA fact collection from policy selection. logical_device.c now queries memory-pool access and link-hop records, then feeds a pure selector that records coarse/fine access, grant-required peer access, PCIe/xGMI/etc link flags, coherency, 32-bit and 64-bit atomics, link class/cost/NUMA distance, and derived HAL topology modes/capabilities.\n\nThis keeps the queue, command-buffer, and copy hot paths free of new buffer-snooping or recurring validation while giving future SDMA/P2P strategy selection named cold-path facts to consume. Unsupported copy strategies remain feature slots rather than implicit queue-path branches.\n\nAdd synthetic coverage for xGMI, PCIe without coherent/system-atomic support, multi-hop worst-case collapse, grant-required peer memory, no-access host-staged fallback, and invalid HSA fact inputs.\n"
    },
    {
      "commit": "1d278204d82545ea540f8fccb96cc944bbd11a5a",
      "tree": "8673fc6452906bd92b875657728c9c09121b3521",
      "parents": [
        "fc1dcfe5f06cddef23d47faf619086fec557d5bb"
      ],
      "author": {
        "name": "Ben Vanik",
        "email": "ben.vanik@gmail.com",
        "time": "Tue Apr 28 11:45:38 2026 -0700"
      },
      "committer": {
        "name": "Ben Vanik",
        "email": "ben.vanik@gmail.com",
        "time": "Wed Apr 29 13:46:12 2026 -0700"
      },
      "message": "[HAL/AMDGPU] Split device metrics source sampling\n\nMove platform-neutral device-metrics session lifecycle, record packing, metadata emission, and sample chunk writing into profile_device_metrics.c, and move the Linux sysfs/gpu_metrics implementation into a dedicated source leaf. The common source boundary now carries only profile metadata, metric ids, sample ids, sample builders, and opaque implementation state, so the profile emitter no longer knows about Linux paths, sysfs slots, or file descriptors.\n\nKeep Linux behavior explicit in profile_device_metrics_linux.c. It owns PCI sysfs path discovery, hwmon/scalar file discovery, gpu_metrics parsing, unavailable-read handling, and non-Linux UNIMPLEMENTED stubs for device-metrics source initialization. The source state is grouped by discovery and open-file responsibilities instead of keeping fd/path bags in the shared representation.\n\nTighten initialization ownership while splitting the file. Source initialization now cleans any partially initialized platform state before returning failure, while the session tracks only the successfully initialized prefix. This avoids cleanup walking never-initialized source storage and gives future platform metric sources the same ownership contract.\n"
    },
    {
      "commit": "fc1dcfe5f06cddef23d47faf619086fec557d5bb",
      "tree": "840f7987916d4292bf74a9520899136b579bd266",
      "parents": [
        "f8505e9c1eb6cac68f8c7c548399791142ed47a5"
      ],
      "author": {
        "name": "Ben Vanik",
        "email": "ben.vanik@gmail.com",
        "time": "Tue Apr 28 11:45:38 2026 -0700"
      },
      "committer": {
        "name": "Ben Vanik",
        "email": "ben.vanik@gmail.com",
        "time": "Wed Apr 29 13:46:12 2026 -0700"
      },
      "message": "[HAL/AMDGPU] Abstract profile device clock sampling\n\nMove profiling clock-correlation sampling behind util/device_clock. The system object now owns a platform clock source instead of a raw KFD descriptor, logical devices sample through that source, and physical devices carry the HSA driver_uid without baking the Linux KFD name into the core device identity.\n\nKeep util/kfd as the Linux ioctl transport. It now returns raw AMDKFD_IOC_GET_CLOCK_COUNTERS values, while device_clock owns the generic validation and source-type dispatch so future Windows or macOS HSA support has a named unavailable/source boundary instead of spreading platform branches through profiling consumers.\n\nEmit clock-correlation chunks only for profiling data families that consume HSA/device timestamps. Host-only queue events, memory events, executable metadata, and device metrics no longer require a device clock source, which keeps unrelated profiling modes alive on platforms without an equivalent clock-correlation API.\n"
    },
    {
      "commit": "f8505e9c1eb6cac68f8c7c548399791142ed47a5",
      "tree": "09931c16cf4e33ab1666d262c51ea6d12ba3b4dd",
      "parents": [
        "b383a528d83cc73620af5e32e73017589fddc37f"
      ],
      "author": {
        "name": "Ben Vanik",
        "email": "ben.vanik@gmail.com",
        "time": "Tue Apr 28 11:45:38 2026 -0700"
      },
      "committer": {
        "name": "Ben Vanik",
        "email": "ben.vanik@gmail.com",
        "time": "Wed Apr 29 13:46:12 2026 -0700"
      },
      "message": "[HAL] Extract profile event ring utility\n\nMove the host-side lossy profile event ring mechanics shared by local profiling and AMDGPU queue/memory profiling into hal/utils/profile_event_ring. The helper owns only record positions, power-of-two wrapping, event id assignment, dropped-record accounting, snapshot spans, and commit-after-write bookkeeping; callers still own storage, synchronization, profile metadata, and sink sequencing.\n\nConvert local profile rings and AMDGPU host-side queue/memory event streams to the utility. AMDGPU dispatch and queue-device profiling rings intentionally stay explicit because they are exact device-visible timelines tied to packet publication, reservation cancellation, completion-signal harvest, counters, and trace slots. Hiding those behind the lossy host-ring helper would blur the important queue contract instead of simplifying it.\n\nThe AMDGPU host-side event flush path now writes the same two-span snapshot style as local profiling instead of allocating and copying a temporary contiguous buffer. This keeps profiling-disabled paths unchanged while removing duplicate cold-path machinery.\n"
    },
    {
      "commit": "b383a528d83cc73620af5e32e73017589fddc37f",
      "tree": "36ed0407ea0d616b588f3f9a2e10cc62b4c94c50",
      "parents": [
        "14c82d948993ae6056c5a6f503a05d0326b8cc24"
      ],
      "author": {
        "name": "Ben Vanik",
        "email": "ben.vanik@gmail.com",
        "time": "Tue Apr 28 11:45:38 2026 -0700"
      },
      "committer": {
        "name": "Ben Vanik",
        "email": "ben.vanik@gmail.com",
        "time": "Wed Apr 29 13:46:12 2026 -0700"
      },
      "message": "[HAL/AMDGPU] Stage generated inputs through coarse memory\n\nMark AMDGPU DEVICE_LOCAL|HOST_VISIBLE allocations as low-performance in the allocator compatibility query. Generic HAL buffer generation already probes a host-visible variant first, and without this hint it generated benchmark file inputs directly into fine-grained GPU-local memory instead of producing host data and transferring into coarse device-local memory.\n\nThis keeps the policy in the allocator instead of adding AMDGPU branches to buffer_view_util. Explicit coherent host-visible allocations remain valid and queue-usable; they just stop being selected as the fast path for dispatch inputs when a staged copy can produce the requested DEVICE_LOCAL buffer.\n"
    },
    {
      "commit": "14c82d948993ae6056c5a6f503a05d0326b8cc24",
      "tree": "eaa67a7760a1b013055be155c7bba6c2ee3f48ce",
      "parents": [
        "c156d33baba00e1b06d38a1ef933d203abf6ae28"
      ],
      "author": {
        "name": "Ben Vanik",
        "email": "ben.vanik@gmail.com",
        "time": "Tue Apr 28 11:45:38 2026 -0700"
      },
      "committer": {
        "name": "Ben Vanik",
        "email": "ben.vanik@gmail.com",
        "time": "Wed Apr 29 13:46:12 2026 -0700"
      },
      "message": "[HAL/AMDGPU] Document command-buffer fence policy\n\nRecord the command-buffer fence-scope invariant at the recording and replay choke points. Recording resolves HAL visibility scopes once into compact command flags, while block replay only applies additive submission overlays for waits, queue-owned kernargs, and terminal signal release.\n\nThis closes the remaining policy gap without changing packet formation or adding hot-path validation. The comments make the zero-rediscovery contract explicit so future packet-policy work does not drift back toward submit-time operand scans or broad internal AGENT barriers.\n"
    },
    {
      "commit": "c156d33baba00e1b06d38a1ef933d203abf6ae28",
      "tree": "86029e51eb44a48457da09a17e9a9c2a33273261",
      "parents": [
        "55eea4a7a2a194002482f6583f55a3b33c7d671b"
      ],
      "author": {
        "name": "Ben Vanik",
        "email": "ben.vanik@gmail.com",
        "time": "Tue Apr 28 11:45:38 2026 -0700"
      },
      "committer": {
        "name": "Ben Vanik",
        "email": "ben.vanik@gmail.com",
        "time": "Wed Apr 29 13:46:12 2026 -0700"
      },
      "message": "[HAL/AMDGPU] Mark grant-required peer memory\n\nTreat HSA DISALLOWED_BY_DEFAULT as grantable, not already native. The memory-pool selector now maps it to COPY and tags the topology edge with PEER_ACCESS_REQUIRES_GRANT, so generic scheduling never sees NATIVE direct buffer access unless HSA reports ALLOWED_BY_DEFAULT.\n\nKeep the grant policy cold and explicit. Positive physical capabilities such as peer coherency and atomics are still intersected across composite logical-device pairs, while grant requirements are unioned because any physical pair can constrain the generic edge. Allocation grants remain placement-scoped through the existing access policy, and the new topology bit is the hook for a future peer grant/import strategy without buffer snooping or hot-path HSA queries.\n"
    },
    {
      "commit": "55eea4a7a2a194002482f6583f55a3b33c7d671b",
      "tree": "c13a3fb2f91b99fa85c3906dbbc97db1ef71c5d6",
      "parents": [
        "9b634df35b1c55ab085ba6a4d769582d2033f768"
      ],
      "author": {
        "name": "Ben Vanik",
        "email": "ben.vanik@gmail.com",
        "time": "Tue Apr 28 11:45:38 2026 -0700"
      },
      "committer": {
        "name": "Ben Vanik",
        "email": "ben.vanik@gmail.com",
        "time": "Wed Apr 29 13:46:12 2026 -0700"
      },
      "message": "[HAL/AMDGPU] Separate SVM facts from peer flags\n\nRecord AMDGPU SVM/HMM memory-system facts as cold capability state instead of deriving peer behavior from SVM_ACCESSIBLE_BY_DEFAULT. System info now keeps SVM support, default pageable access, and XNACK mode grouped together, and each physical device records SVM direct-host access plus selected fine/coarse device-local placement facts.\n\nAdd a generic SHARED_VIRTUAL_ADDRESS device capability and map it to the existing topology edge bit without letting it select NATIVE buffer modes. AMDGPU now maps SVM_SUPPORTED to SHARED_VIRTUAL_ADDRESS and SVM_ACCESSIBLE_BY_DEFAULT to UNIFIED_MEMORY, while peer addressability and coherency stay owned by per-pool/per-link refinement.\n"
    },
    {
      "commit": "9b634df35b1c55ab085ba6a4d769582d2033f768",
      "tree": "c5d42676a25deb00bb0a5deff0e4a2251a906c64",
      "parents": [
        "fb363f690cf903df27c73483dd08c1d26f28b3c7"
      ],
      "author": {
        "name": "Ben Vanik",
        "email": "ben.vanik@gmail.com",
        "time": "Tue Apr 28 11:45:38 2026 -0700"
      },
      "committer": {
        "name": "Ben Vanik",
        "email": "ben.vanik@gmail.com",
        "time": "Wed Apr 29 13:46:12 2026 -0700"
      },
      "message": "[HAL/AMDGPU] Split kernarg benchmark counters\n\nReport command-buffer kernarg accounting with the same distinction the replay path actually cares about: logical payload bytes, prepublished storage span, and queue-ring reserved bytes. The old queue_kernarg_bytes counter looked like the hot path had no kernarg cost for zero-binding dispatches, even though queue replay still reserves at least one 64-byte kernarg block per non-prepublished dispatch.\n\nKeep the accumulator grouped by representation concept instead of growing another run of similarly prefixed locals. The benchmark walk remains cold instrumentation after the measured loop; command-buffer recording, finalization, and block-processor replay are unchanged.\n"
    },
    {
      "commit": "fb363f690cf903df27c73483dd08c1d26f28b3c7",
      "tree": "9c1e188204e3a3730137aa93c7581914f4a8b16f",
      "parents": [
        "75fbea92b84990d3aed491e81bd456795e170a66"
      ],
      "author": {
        "name": "Ben Vanik",
        "email": "ben.vanik@gmail.com",
        "time": "Tue Apr 28 11:45:38 2026 -0700"
      },
      "committer": {
        "name": "Ben Vanik",
        "email": "ben.vanik@gmail.com",
        "time": "Wed Apr 29 13:46:12 2026 -0700"
      },
      "message": "[HAL/AMDGPU] Report prepublished kernarg replay counters\n\nAdd command-buffer program counters to the AMDGPU queue benchmark that distinguish prepublished dispatch kernargs from queue-time kernarg replay. The reporting path walks finalized command records after the benchmark loop and records prepublished dispatch count, logical prepublished bytes, materialized storage span, queue-kernarg dispatch count, and queue-kernarg bytes.\n\nThis is intentionally cold benchmark instrumentation only: command-buffer recording, finalization, and block-processor replay are unchanged. The counters give the prepublished-kernarg workstream a direct guardrail for whether a row actually removed queue-time kernarg traffic before we interpret timing differences.\n"
    },
    {
      "commit": "75fbea92b84990d3aed491e81bd456795e170a66",
      "tree": "31e7f47c6e1240d919bbff7ea570ee53f7ac6337",
      "parents": [
        "a81bd5eafe337d85cc7b75c2119cd7052d8f587e"
      ],
      "author": {
        "name": "Ben Vanik",
        "email": "ben.vanik@gmail.com",
        "time": "Tue Apr 28 11:45:38 2026 -0700"
      },
      "committer": {
        "name": "Ben Vanik",
        "email": "ben.vanik@gmail.com",
        "time": "Wed Apr 29 13:46:12 2026 -0700"
      },
      "message": "[HAL/AMDGPU] Record prepublished kernarg totals\n\nHoist prepublished kernarg materialization accounting into command-buffer recording. Recording now updates the immutable-template count, payload length, and maximum alignment when a prepublished dispatch template is appended, so end() can skip the old rodata classification pass when no templates were recorded.\n\nKeep the representation grouped by responsibility: the prepublished kernarg state now has separate storage, templates, and materialized aggregates instead of a flat set of buffer fields. Replay still consumes the same finalized payload_reference byte offsets, so the block-processor hot path and command-buffer ABI shape stay unchanged.\n"
    },
    {
      "commit": "a81bd5eafe337d85cc7b75c2119cd7052d8f587e",
      "tree": "257a2e108217503a2200ba2d5e49386d4191aa48",
      "parents": [
        "2a36353580a1125881a4290e31441454b4657221"
      ],
      "author": {
        "name": "Ben Vanik",
        "email": "ben.vanik@gmail.com",
        "time": "Fri Apr 24 08:21:08 2026 -0700"
      },
      "committer": {
        "name": "Ben Vanik",
        "email": "ben.vanik@gmail.com",
        "time": "Wed Apr 29 13:46:12 2026 -0700"
      },
      "message": "[HAL/AMDGPU] Document profiling and replay workflows\n\nMove the AMDGPU README expansion out of the generic profile-render packaging slice and land it after the AMDGPU device-library and profiling pieces it describes.\n"
    },
    {
      "commit": "2a36353580a1125881a4290e31441454b4657221",
      "tree": "cd99fd3b9ec90c310b2b0afac7df859e01b5493c",
      "parents": [
        "b905b7d3a5b471d5851b7179bc6bafb99104b712"
      ],
      "author": {
        "name": "Ben Vanik",
        "email": "ben.vanik@gmail.com",
        "time": "Tue Apr 28 11:45:38 2026 -0700"
      },
      "committer": {
        "name": "Ben Vanik",
        "email": "ben.vanik@gmail.com",
        "time": "Wed Apr 29 13:46:12 2026 -0700"
      },
      "message": "[HAL/AMDGPU] Split device-library target selection\n\nMove device-library target candidate construction out of the HSA loading file and into a pure util leaf. device_library.c now stays focused on agent ISA enumeration, embedded file lookup, and code object loading, while device_library_target owns the cold selection policy for exact, feature-bearing, and generic fallback target strings.\n\nAdd focused coverage for feature-bearing ISA candidate ordering, gfx12.5 generic-family fallback, and whole-segment embedded file architecture matching. The selection path can strip features to search for fallback filenames, but the test keeps that as an ordered lookup policy instead of letting prefix matching declare feature-bearing generic targets compatible with stripped binaries.\n\nDocument the current generated target-map audit for gfx9-4, gfx11, gfx12, and gfx12.5 so future architecture additions have a visible single-source policy boundary.\n"
    },
    {
      "commit": "b905b7d3a5b471d5851b7179bc6bafb99104b712",
      "tree": "c55b1129365292a282ea4efe859ec90dfbb3c9fc",
      "parents": [
        "a809ed5a48384757bf3c40f57c6210a71a8db98a"
      ],
      "author": {
        "name": "Ben Vanik",
        "email": "ben.vanik@gmail.com",
        "time": "Tue Apr 28 11:45:38 2026 -0700"
      },
      "committer": {
        "name": "Ben Vanik",
        "email": "ben.vanik@gmail.com",
        "time": "Wed Apr 29 13:46:12 2026 -0700"
      },
      "message": "[HAL/AMDGPU] Test executable target inference\n\nAdd a cold executable test that exercises the target-ID inference boundary directly. Raw HSACO inference now has coverage for V5 ELF feature flags, and wrapped AMDGPU flatbuffer inference is tested with intentionally stale metadata so the embedded ELF remains the source of load truth.\n\nRefresh the public executable and flatbuffer schema comments to describe the current target-ID contract. ExecutableDef.isa is producer metadata carrying a feature-bearing target ID when available; runtime compatibility recovers the authoritative code-object target from ELF flags instead of trusting that flatbuffer label.\n"
    },
    {
      "commit": "a809ed5a48384757bf3c40f57c6210a71a8db98a",
      "tree": "5e38dcf0a4f39eb05d96f07e9b94e7d8a4351bc0",
      "parents": [
        "cdf2f71625563b3d4a77f2efbe579065d5a18684"
      ],
      "author": {
        "name": "Ben Vanik",
        "email": "ben.vanik@gmail.com",
        "time": "Tue Apr 28 11:45:38 2026 -0700"
      },
      "committer": {
        "name": "Ben Vanik",
        "email": "ben.vanik@gmail.com",
        "time": "Wed Apr 29 13:46:12 2026 -0700"
      },
      "message": "[HAL/AMDGPU] Name ISA commonality agents\n\nInclude GPU agent and ISA ordinals in the parsed target-ID commonality diagnostic. The previous slice reported the processor/SRAMECC/XNACK mismatch reasons, but the topology error still lacked the exact agents being compared; that made the failure less actionable on mixed multi-GPU systems.\n\nThis keeps the KFD SRAMECC policy boundary clean: ROCr constructs and filters HSA agents before IREE sees them, and IREE now reports any exposed mixed target modes as a HAL-device topology incompatibility with the concrete agent ordinals and feature reasons.\n"
    },
    {
      "commit": "cdf2f71625563b3d4a77f2efbe579065d5a18684",
      "tree": "cba013433264a3388e95ca9996f5d4c78a4a2291",
      "parents": [
        "dfe46445885610588dd47e75fd799b44059db23b"
      ],
      "author": {
        "name": "Ben Vanik",
        "email": "ben.vanik@gmail.com",
        "time": "Tue Apr 28 11:45:37 2026 -0700"
      },
      "committer": {
        "name": "Ben Vanik",
        "email": "ben.vanik@gmail.com",
        "time": "Wed Apr 29 13:46:12 2026 -0700"
      },
      "message": "[HAL/AMDGPU] Model target feature support\n\nExtend the generated AMDGPU target map so exact target rows also carry XNACK/SRAMECC feature-support bits. The target-ID parser now uses that shared table to normalize known unsupported features to UNSUPPORTED while preserving supported but unspecified modes as ANY, which keeps the ROCr wildcard-vs-explicit compatibility distinction intact without adding another hand-written target database.\n\nRecord the physical-device HSA ISA identity as one nested isa field instead of a loose processor buffer plus duplicated gfx IP version. Queue/profile policy now consumes isa.target_id.version, and system info caches HSA_AMD_SYSTEM_INFO_XNACK_ENABLED as the process-wide KFD-bound XNACK mode. That gives later topology and executable-load code a named cold-path place to ask about agent identity instead of rediscovering feature state.\n\nTighten multi-GPU ISA commonality diagnostics by comparing parsed target IDs and reporting processor, generic-version, SRAMECC, and XNACK mismatches by name. Mixed feature modes were previously only visible as raw string differences, which was technically useful but not the invariant we need for day-0 CDNA/RDNA support.\n"
    }
  ],
  "next": "dfe46445885610588dd47e75fd799b44059db23b"
}
