)]}'
{
  "commit": "cc50107f19976cf2244ee152074ffc0c46587491",
  "tree": "f00940ab07c9aeb27bfd84b03f74572adb9388cf",
  "parents": [
    "6e6d8541b3d7b78cca8146436c35527b1966098e"
  ],
  "author": {
    "name": "Benoit Jacob",
    "email": "jacob.benoit.1@gmail.com",
    "time": "Tue Jun 02 12:08:29 2026 -0400"
  },
  "committer": {
    "name": "GitHub",
    "email": "noreply@github.com",
    "time": "Tue Jun 02 12:08:29 2026 -0400"
  },
  "message": "[Codegen][CPU] Route inner_tiled broadcast into m_bcst-foldable slot. (#24516)\n\nThe CPU `inner_tiled` lowering replicates whichever of LHS/RHS has fewer\nlanes up to the other\u0027s lane count before calling the LLVM intrinsic.\nThe previous lowering emitted this as `vector.broadcast` to a\n`(replicate, K)` 2-D shape followed by `vector.shape_cast` to flat, with\na comment claiming the x86 backend\u0027s instruction selector would recover\nthe `{1toN}` broadcast-from-memory form on its own.\n\nEmpirically that did not work for bf16 matmul codegen on Zen 4: every\n`vdpbf16ps` instruction was preceded by a separate `vbroadcastss`,\ndoubling the per-row uop count of the hot inner loop. Two structural\nreasons:\n\n1. The IR shape mattered. LLVM\u0027s x86 ISel `m_bcst` patterns key on the\ncanonical `_mm512_set1_ps`-style splat: a scalar fed into `insertelement\n\u003cN x T\u003e poison, T, 0` followed by `shufflevector \u003cN x T\u003e, poison, \u003cN x\ni32\u003e zeroinitializer`, with `T` a float. Our `vector.broadcast` to a\n`(replicate, K)` 2-D shape + `vector.shape_cast` lowered to a different\nshufflevector pattern (or a direct `\u003cK x elem\u003e -\u003e \u003cN*K x elem\u003e`\ninterleaved shuffle) that did not pattern-match.\n\n2. The intrinsic operand position mattered. The ISA-level `m_bcst` EVEX\noperand is on the *third* source of `dpbf16ps`/`vpdpwssd`/ `pmaddwd`,\nand on the `b` operand (second multiplicand) of FMA\u0027s `a*b+c`. We passed\nthe broadcasted operand into the LHS slot, putting it where ISel cannot\nfold a memory broadcast.\n\nRewrite the replication to bitcast the source to a 1-lane vector of\nwidth `K * elem_bits` (with a float lane type when that width is 32 or\n64 bits, matching the `_mm512_set1_ps` shape), extract the scalar,\n`vector.broadcast` it to `replicate` lanes, then bitcast back. Track\nwhether the broadcast landed on lhs and, for the symmetric LLVM\nintrinsics, route the broadcasted operand into the m_bcst-foldable slot.\nFor K\u003d1 the bitcast pair is a no-op LLVM elides. vpdpbusd is asymmetric\n(UI8 must stay in the second slot); its existing sign-aware routing\nhappens to put the broadcast in the m_bcst slot precisely in the two\norientations where the ISA allows the fold, so no change needed there.\n\nMeasured on a 4096×4096 dynamic-shape bf16×bf16 -\u003e f32 matmul on Zen 4\n(avx512_bf16, no AMX), with `--iree-opt-data-tiling\n--iree-llvmcpu-enable-inner-tiled`:\n\n- All 29 `vdpbf16ps` in the inner loop now use the `{1to16}`\nmemory-broadcast form (vs 0 before); all 29 separate `vbroadcastss` are\ngone.\n- End-to-end matmul: 80.8 ms -\u003e 62.7 ms (1.29x faster, 16.0 it/s -\u003e from\n12.4 it/s), closing ~60% of the gap to the precompiled mmt4d ukernel\n(50.5 ms).\n\nProgress towards #24515.\n\nSigned-off-by: Benoit Jacob \u003cjacob.benoit.1@gmail.com\u003e\nCo-authored-by: Claude Opus 4.7 \u003cnoreply@anthropic.com\u003e",
  "tree_diff": [
    {
      "type": "modify",
      "old_id": "081dce6fdb41bc919334e075c2179a0e4e3e5c33",
      "old_mode": 33188,
      "old_path": "compiler/src/iree/compiler/Codegen/Dialect/CPU/IR/IREECPUAttrs.cpp",
      "new_id": "bddf8f7af49c7493b6b5497d5a0282c633fcdcf7",
      "new_mode": 33188,
      "new_path": "compiler/src/iree/compiler/Codegen/Dialect/CPU/IR/IREECPUAttrs.cpp"
    },
    {
      "type": "modify",
      "old_id": "43cf44dfdcc333b59850977a1b805f4b1e686ed8",
      "old_mode": 33188,
      "old_path": "compiler/src/iree/compiler/Codegen/Dialect/CPU/IR/test/lower_inner_tiled.mlir",
      "new_id": "990017dbf1e476641faf7722c1a50e9bd7989453",
      "new_mode": 33188,
      "new_path": "compiler/src/iree/compiler/Codegen/Dialect/CPU/IR/test/lower_inner_tiled.mlir"
    }
  ]
}
