[Codegen][CPU] Flatten contiguous trailing dims of transfers before unrolling. (#24517)

`VectorTransferLoweringPass` applies the MLIR transfer-lowering patterns
with `maxTransferRank=1` plus full-unroll, which fully unrolls any
rank-N>1 `vector.transfer_read`/`transfer_write` to multiple rank-1
transfers (one per index of the outer dim). For multi-dim tiles whose
trailing dims are contiguous in memory, this unrolls a single wide load
into many narrow ones, which then have to be reassembled into a wide
vector via a chain of `shufflevector`s in the hot inner loop.

Example surfacing the cost: a 4096x4096 dynamic-shape bf16xbf16->f32
matmul with `--iree-llvmcpu-enable-inner-tiled` on Zen 4 lowered to
inner_tiled with N=16, K_inner=2. The RHS for one K-step is a
`vector<16x2xbf16>` from a contiguous 64-byte slice. Unrolling to 16
separate `<2 x bfloat>` loads forced a sequence of `vpermt2d`/
`vpermt2q` per K-iteration in the inner loop to rebuild the wide RHS
register — accounting for ~3 cycles of extra work per K-step on top of
the 29 dpbf16ps doing the real work.

Apply `populateFlattenVectorTransferPatterns` *before* the
rank-reduction patterns. It rewrites a multi-dim transfer with
contiguous trailing dims into a transfer on a `memref.collapse_shape`
view + a `vector.shape_cast`, so the read ends up as a single 1-D
transfer over the collapsed view and lowers to one wide `vector.load`.
Per-fragment effect on the matmul benchmark above: 80.8 ms -> 67.1 ms
(1.20x). Combined with the m_bcst-fold broadcast routing in a sibling
commit, end-to-end gets to 53.4 ms (within 5% of the precompiled mmt4d
ukernel at 50.9 ms).

Test fallout: two pipelines now lower a per-row pack-tile load into a
single wide load over a collapsed-memref view rather than one load per
row (`aligned_unpack_generic` in pipeline_pack_unpack_tests) / write a
constant `vector<4x2xi1>` mask as a single flat `vector<8xi1>` store
(`transpose_mask` in vector_lowering). The new IR is strictly fewer ops
in both cases; updated the CHECK lines to match.

Progress towards #24515.

---------

Signed-off-by: Benoit Jacob <jacob.benoit.1@gmail.com>
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
diff --git a/compiler/src/iree/compiler/Codegen/Common/BUILD.bazel b/compiler/src/iree/compiler/Codegen/Common/BUILD.bazel
index b9b5c77..81167e5 100644
--- a/compiler/src/iree/compiler/Codegen/Common/BUILD.bazel
+++ b/compiler/src/iree/compiler/Codegen/Common/BUILD.bazel
@@ -235,6 +235,7 @@
         "//compiler/src/iree/compiler/Dialect/Util/Transforms",
         "//compiler/src/iree/compiler/Utils",
         "//llvm-external-projects/iree-dialects:IREELinalgTransformDialect",
+        "@llvm-project//llvm:Core",
         "@llvm-project//llvm:Support",
         "@llvm-project//mlir:AMDGPUDialect",
         "@llvm-project//mlir:AMDGPUTransforms",
diff --git a/compiler/src/iree/compiler/Codegen/Common/CMakeLists.txt b/compiler/src/iree/compiler/Codegen/Common/CMakeLists.txt
index 3bf27c3..a3f5d46 100644
--- a/compiler/src/iree/compiler/Codegen/Common/CMakeLists.txt
+++ b/compiler/src/iree/compiler/Codegen/Common/CMakeLists.txt
@@ -181,6 +181,7 @@
     ::PassHeaders
     ::PassesIncGen
     IREELinalgTransformDialect
+    LLVMCore
     LLVMSupport
     MLIRAMDGPUDialect
     MLIRAMDGPUTransforms
diff --git a/compiler/src/iree/compiler/Codegen/Common/VectorTransferLowering.cpp b/compiler/src/iree/compiler/Codegen/Common/VectorTransferLowering.cpp
index e82bbe8..13aade2 100644
--- a/compiler/src/iree/compiler/Codegen/Common/VectorTransferLowering.cpp
+++ b/compiler/src/iree/compiler/Codegen/Common/VectorTransferLowering.cpp
@@ -5,6 +5,8 @@
 // SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
 
 #include "iree/compiler/Codegen/Common/Passes.h"
+#include "iree/compiler/Dialect/HAL/IR/HALTypes.h"
+#include "llvm/IR/DataLayout.h"
 #include "mlir/Conversion/VectorToSCF/VectorToSCF.h"
 #include "mlir/Dialect/Affine/IR/AffineOps.h"
 #include "mlir/Dialect/SCF/IR/SCF.h"
@@ -38,6 +40,32 @@
   MLIRContext *ctx = &getContext();
   mlir::FunctionOpInterface funcOp = getOperation();
 
+  // Flatten contiguous trailing dims of multi-dim transfers when the trailing
+  // dim is narrower than the target's natural word (the pointer size), so a
+  // packed `<16x2xbf16>` (32-bit innermost) lowers to one wide load instead
+  // of 16 narrow loads the rank reduction below would reassemble with a
+  // chain of shuffles. Sub-word loads in bulk are uniformly pathological;
+  // word-and-up loads (`<2xf32>` ... `<16xf32>`) are already fine and
+  // flattening *them* fuses register-sized rows into an oversized 1-D
+  // transfer + a `vector.shape_cast` re-split (extracts), regressing whole-
+  // model .vmfb size for no benefit. This is *not* `native_vector_size`:
+  // that is the *widest* useful vector, not the smallest non-pathological
+  // load.
+  unsigned pointerBits = 64;
+  if (auto targetAttr = IREE::HAL::ExecutableTargetAttr::lookup(funcOp)) {
+    if (auto attr =
+            targetAttr.getConfiguration().getAs<StringAttr>("data_layout")) {
+      if (!attr.getValue().empty()) {
+        pointerBits = llvm::DataLayout(attr.getValue()).getPointerSizeInBits();
+      }
+    }
+  }
+  {
+    RewritePatternSet patterns(ctx);
+    vector::populateFlattenVectorTransferPatterns(patterns, pointerBits);
+    (void)applyPatternsGreedily(funcOp, std::move(patterns));
+  }
+
   RewritePatternSet patterns(ctx);
   // Explicitly materialize the mask on transfer_read/transfer_write.
   // Assume we don't have 4 GB vectors.
diff --git a/compiler/src/iree/compiler/Codegen/LLVMCPU/test/pipeline_pack_unpack_tests.mlir b/compiler/src/iree/compiler/Codegen/LLVMCPU/test/pipeline_pack_unpack_tests.mlir
index 81c2b9d..70f844e 100644
--- a/compiler/src/iree/compiler/Codegen/LLVMCPU/test/pipeline_pack_unpack_tests.mlir
+++ b/compiler/src/iree/compiler/Codegen/LLVMCPU/test/pipeline_pack_unpack_tests.mlir
@@ -80,6 +80,9 @@
 // CHECK-LABEL:     func.func @aligned_unpack_generic
 // CHECK:             %[[SRC:.+]] = hal.interface.binding.subspan {{.*}} : memref<24x32x16x16xf32, #hal.descriptor_type<storage_buffer>>
 // CHECK:             %[[ASSUMED_SRC:.+]] = memref.assume_alignment %[[SRC]], 64
+// The unpack source tile is `vector<16x16xf32>`: its trailing dim is a full
+// 512-bit `vector<16xf32>`, so transfer flattening leaves it alone and plain
+// rank reduction lowers it to one `vector<16xf32>` load per row.
 // CHECK-COUNT-15:        vector.load %[[ASSUMED_SRC]]
 // CHECK:                 %[[LAST_LOAD:.+]] = vector.load %[[ASSUMED_SRC]]
 // CHECK:                 %[[IN_0:.+]] = vector.broadcast %{{.+}} : vector<16xf32> to vector<16x16xf32>
diff --git a/compiler/src/iree/compiler/Codegen/LLVMCPU/test/vector_lowering.mlir b/compiler/src/iree/compiler/Codegen/LLVMCPU/test/vector_lowering.mlir
index ec74808..26ef0ab 100644
--- a/compiler/src/iree/compiler/Codegen/LLVMCPU/test/vector_lowering.mlir
+++ b/compiler/src/iree/compiler/Codegen/LLVMCPU/test/vector_lowering.mlir
@@ -155,7 +155,9 @@
 //   CHECK-NOT:   vector.shuffle
 //   CHECK-DAG:   %[[MASK:.+]] = arith.constant dense<true>
 //   CHECK-DAG:   %[[OUTPUT:.+]] = hal.interface.binding.subspan
-//       CHECK:   vector.store %[[MASK]], %[[OUTPUT]]
+// VectorTransferLoweringPass flattens the contiguous 4x2 trailing dims of
+// the store into a single `vector<8xi1>` store over the collapsed memref.
+//       CHECK:   vector.store %[[MASK]], %{{.+}}
 
 // -----