)]}'
{
  "commit": "dda3d329125adb3eec77ac6b27bc33bfaffd04cd",
  "tree": "e5ab847b1a4077e1439b67e06318040a678c6c7c",
  "parents": [
    "cc50107f19976cf2244ee152074ffc0c46587491"
  ],
  "author": {
    "name": "Benoit Jacob",
    "email": "jacob.benoit.1@gmail.com",
    "time": "Tue Jun 02 12:08:42 2026 -0400"
  },
  "committer": {
    "name": "GitHub",
    "email": "noreply@github.com",
    "time": "Tue Jun 02 12:08:42 2026 -0400"
  },
  "message": "[Codegen][CPU] Flatten contiguous trailing dims of transfers before unrolling. (#24517)\n\n`VectorTransferLoweringPass` applies the MLIR transfer-lowering patterns\nwith `maxTransferRank\u003d1` plus full-unroll, which fully unrolls any\nrank-N\u003e1 `vector.transfer_read`/`transfer_write` to multiple rank-1\ntransfers (one per index of the outer dim). For multi-dim tiles whose\ntrailing dims are contiguous in memory, this unrolls a single wide load\ninto many narrow ones, which then have to be reassembled into a wide\nvector via a chain of `shufflevector`s in the hot inner loop.\n\nExample surfacing the cost: a 4096x4096 dynamic-shape bf16xbf16-\u003ef32\nmatmul with `--iree-llvmcpu-enable-inner-tiled` on Zen 4 lowered to\ninner_tiled with N\u003d16, K_inner\u003d2. The RHS for one K-step is a\n`vector\u003c16x2xbf16\u003e` from a contiguous 64-byte slice. Unrolling to 16\nseparate `\u003c2 x bfloat\u003e` loads forced a sequence of `vpermt2d`/\n`vpermt2q` per K-iteration in the inner loop to rebuild the wide RHS\nregister — accounting for ~3 cycles of extra work per K-step on top of\nthe 29 dpbf16ps doing the real work.\n\nApply `populateFlattenVectorTransferPatterns` *before* the\nrank-reduction patterns. It rewrites a multi-dim transfer with\ncontiguous trailing dims into a transfer on a `memref.collapse_shape`\nview + a `vector.shape_cast`, so the read ends up as a single 1-D\ntransfer over the collapsed view and lowers to one wide `vector.load`.\nPer-fragment effect on the matmul benchmark above: 80.8 ms -\u003e 67.1 ms\n(1.20x). Combined with the m_bcst-fold broadcast routing in a sibling\ncommit, end-to-end gets to 53.4 ms (within 5% of the precompiled mmt4d\nukernel at 50.9 ms).\n\nTest fallout: two pipelines now lower a per-row pack-tile load into a\nsingle wide load over a collapsed-memref view rather than one load per\nrow (`aligned_unpack_generic` in pipeline_pack_unpack_tests) / write a\nconstant `vector\u003c4x2xi1\u003e` mask as a single flat `vector\u003c8xi1\u003e` store\n(`transpose_mask` in vector_lowering). The new IR is strictly fewer ops\nin both cases; updated the CHECK lines to match.\n\nProgress towards #24515.\n\n---------\n\nSigned-off-by: Benoit Jacob \u003cjacob.benoit.1@gmail.com\u003e\nCo-authored-by: Claude Opus 4.7 \u003cnoreply@anthropic.com\u003e",
  "tree_diff": [
    {
      "type": "modify",
      "old_id": "b9b5c7712119b58dc1c58c6a2c3390c7bee3f864",
      "old_mode": 33188,
      "old_path": "compiler/src/iree/compiler/Codegen/Common/BUILD.bazel",
      "new_id": "81167e58463bc456a8118da4c47e9498c810ddfd",
      "new_mode": 33188,
      "new_path": "compiler/src/iree/compiler/Codegen/Common/BUILD.bazel"
    },
    {
      "type": "modify",
      "old_id": "3bf27c372d8b1304a0fd7077efe1cdacdc53dd77",
      "old_mode": 33188,
      "old_path": "compiler/src/iree/compiler/Codegen/Common/CMakeLists.txt",
      "new_id": "a3f5d46293dd1597efd6a8375859f755a6d95eb9",
      "new_mode": 33188,
      "new_path": "compiler/src/iree/compiler/Codegen/Common/CMakeLists.txt"
    },
    {
      "type": "modify",
      "old_id": "e82bbe852e7f792b05784b70a827647eef3fc5d5",
      "old_mode": 33188,
      "old_path": "compiler/src/iree/compiler/Codegen/Common/VectorTransferLowering.cpp",
      "new_id": "13aade2df31c6abbeeedd13351b1df42aca8a65b",
      "new_mode": 33188,
      "new_path": "compiler/src/iree/compiler/Codegen/Common/VectorTransferLowering.cpp"
    },
    {
      "type": "modify",
      "old_id": "81c2b9dd17ae21b26790db8388ff3e3dc6eaa88b",
      "old_mode": 33188,
      "old_path": "compiler/src/iree/compiler/Codegen/LLVMCPU/test/pipeline_pack_unpack_tests.mlir",
      "new_id": "70f844e56cdbecd92d543cf66badd95443cfdc0f",
      "new_mode": 33188,
      "new_path": "compiler/src/iree/compiler/Codegen/LLVMCPU/test/pipeline_pack_unpack_tests.mlir"
    },
    {
      "type": "modify",
      "old_id": "ec74808292daaaba14cf3887df1303dcc021070d",
      "old_mode": 33188,
      "old_path": "compiler/src/iree/compiler/Codegen/LLVMCPU/test/vector_lowering.mlir",
      "new_id": "26ef0ab0317edc96e68b59bac00b585b1cef7c06",
      "new_mode": 33188,
      "new_path": "compiler/src/iree/compiler/Codegen/LLVMCPU/test/vector_lowering.mlir"
    }
  ]
}
