[GPU][DT] Fix LHS operand offset calculation for DataTiledMMAAttr (#21808) `DistributeInnerTiledToLanesPass` creates `AffineDelinearizeIndexOp` to delinearize the thread index within a workgroup and generate offsets for slicing the operands of InnerTiled operations. Currently, the delinearization for `DataTiledMMAAttr` relies solely on the operand swizzle. However, for the LHS, this swizzle does not include the subGroupsN dimension, which could result in incorrect offset calculations. This PR fixes the issue by inserting a dummy subGroupsN dimension when delinearizing for LHS, ensuring that offsets are generated correctly according to the full subGroupsM × subGroupsN thread space. Note that, no special handling is required for the RHS operand, since subGroupsM is treated as an implicit leading dimension and therefore omitted anyway. Closes #21789 Signed-off-by: Yu-Zhewen <zhewenyu@amd.com>

commit: 26f63c1a0a85e56c1d395e8668f8abefd27b3643 [log] [tgz]
author: Zhewen Yu <zhewenyu@amd.com> Fri Aug 29 09:44:50 2025 +0100
committer: GitHub <noreply@github.com> Fri Aug 29 09:44:50 2025 +0100
tree: 0ad812474c0f8a156d07eabd42902572f14bc426
parent: b7341d9f7d39b6248389f93cd1e696659adb21f6 [diff]
diff --git a/compiler/src/iree/compiler/Codegen/Dialect/GPU/IR/IREEGPUAttrs.cpp b/compiler/src/iree/compiler/Codegen/Dialect/GPU/IR/IREEGPUAttrs.cpp
index 9b294ea..82e2c25 100644
--- a/compiler/src/iree/compiler/Codegen/Dialect/GPU/IR/IREEGPUAttrs.cpp
+++ b/compiler/src/iree/compiler/Codegen/Dialect/GPU/IR/IREEGPUAttrs.cpp

@@ -884,14 +884,25 @@
   // Normally, that distinction is irrelevant here: we just delinearize the
   // thread-id over all cross-thread dimensions.
   //
-  // There is one case that makes things more complicated, encountered so far
-  // only on RDNA3. That is when some intrinsic has multiple (so far, 2) threads
-  // reading the same data. This redundancy is not encoded in the TileSwizzle
-  // structures that we are using here. Instead, in that case, the thread grid
-  // (as encoded in the TileSwizzle) is smaller than the subgroup size. In that
-  // case, there is an implied thread-distribution-only dimension along which
-  // multiple threads read exactly the same data.
-  // So we need to distinguish layoutThreadSizes vs. distributionThreadSizes.
+  // There are, however, two special cases that require inserting dummy
+  // dimensions into the delinearization index list, which are later
+  // removed when constructing tile offsets:
+  //
+  // 1. On RDNA3 only: some intrinsics use multiple threads (currently 2)
+  //    to read the same data. This redundancy is not represented in the
+  //    TileSwizzle structures we rely on. In these cases, the thread grid
+  //    encoded by TileSwizzle is *smaller* than the subgroup size, and an
+  //    implicit "distribution-only" dimension exists where multiple threads
+  //    map to identical data. To handle this, we distinguish between
+  //    layoutThreadSizes and distributionThreadSizes.
+  //
+  // 2. LHS delinearization when both subGroupsM and subGroupsN > 1.
+  //    Although subGroupsN is not part of the LHS swizzle, we must still
+  //    delinearize over the combined subGroupsM × subGroupsN space. By
+  //    contrast, RHS does *not* need special handling, since subGroupsM can be
+  //    treated as an implicit leading dimension and omitted anyway.
+
+  // Handle the RDNA3 special case.
   SmallVector<int64_t> layoutThreadSizes =
       sliceSwizzledShape(swizzle, [](TileSwizzle::Dim d) {
         return d.kind == TileSwizzle::Dim::Kind::CrossThread;
@@ -920,6 +931,17 @@
         getSubgroupSize() / intrinsicLayoutThreadBound);
   }
 
+  // Handle the subgroupsM/N special case.
+  int64_t subgroupsM = getSubgroupsM();
+  int64_t subgroupsN = getSubgroupsN();
+  bool needsLhsSubgroupNDim = (fragment == IREE::GPU::MMAFragment::Lhs) &&
+                              subgroupsM > 1 && subgroupsN > 1;
+  const int lhsSubgroupNDimIdx = 1;
+  if (needsLhsSubgroupNDim) {
+    distributionThreadSizes.insert(
+        distributionThreadSizes.begin() + lhsSubgroupNDimIdx, subgroupsN);
+  }
+
   // Obtain the offsets from delinearization along the distributionThreadSizes.
   // Use a delinearize without outer bound and throw away its initial result
   // to get clamping behavior.
@@ -931,13 +953,16 @@
           ->getResults()
           .drop_front();
 
+  // Erase the delinearized index that corresponds to the dummy
+  // dimension that we had inserted above. This is what causes multiple
+  // threads (which only differed in the index being discarded here) to read
+  // exactly the same data.
   if (hasDistributionOnlyDim) {
-    // Erase the delinearized index that corresponds to the extra distribution
-    // dimension that we had inserted above. This is what causes multiple
-    // threads (which only differed in the index being discarded here) to read
-    // exactly the same data.
     tileOffsets.erase(tileOffsets.begin() + distributionOnlyDimIdx);
   }
+  if (needsLhsSubgroupNDim) {
+    tileOffsets.erase(tileOffsets.begin() + lhsSubgroupNDimIdx);
+  }
 
   // Strides are trivial: each slice is contiguous along the *expanded* dims
   // even if it may not be contiguous in the flattened layout.

diff --git a/compiler/src/iree/compiler/Codegen/Dialect/GPU/Transforms/test/distribute_inner_tiled_to_lanes.mlir b/compiler/src/iree/compiler/Codegen/Dialect/GPU/Transforms/test/distribute_inner_tiled_to_lanes.mlir
index d954ec3..f068a4e 100644
--- a/compiler/src/iree/compiler/Codegen/Dialect/GPU/Transforms/test/distribute_inner_tiled_to_lanes.mlir
+++ b/compiler/src/iree/compiler/Codegen/Dialect/GPU/Transforms/test/distribute_inner_tiled_to_lanes.mlir

@@ -551,9 +551,9 @@
 //  CHECK-SAME:   %[[RHS:[A-Za-z0-9]+]]
 //  CHECK-SAME:   %[[ACC:[A-Za-z0-9]+]]
 //       CHECK:   scf.forall (%[[THREAD_ID:.+]]) in (256) shared_outs(%[[ACC_ARG:.+]] = %[[ACC]]) -> (tensor<1x1x2x2x4x16x4xf32>)
-//   CHECK-DAG:     %[[LHS_IN_IDS:.+]]:5 = affine.delinearize_index %[[THREAD_ID]] into (2, 4, 4, 4)
+//   CHECK-DAG:     %[[LHS_IN_IDS:.+]]:6 = affine.delinearize_index %[[THREAD_ID]] into (2, 2, 4, 4, 4)
 //   CHECK-DAG:     %[[LHS_SLICE:.+]] = tensor.extract_slice %[[LHS]]
-//  CHECK-SAME:       [0, 0, %[[LHS_IN_IDS]]#1, %[[LHS_IN_IDS]]#2, %[[LHS_IN_IDS]]#3, %[[LHS_IN_IDS]]#4, 0] [1, 1, 1, 1, 1, 1, 4] [1, 1, 1, 1, 1, 1, 1]
+//  CHECK-SAME:       [0, 0, %[[LHS_IN_IDS]]#1, %[[LHS_IN_IDS]]#3, %[[LHS_IN_IDS]]#4, %[[LHS_IN_IDS]]#5, 0] [1, 1, 1, 1, 1, 1, 4] [1, 1, 1, 1, 1, 1, 1]
 //   CHECK-DAG:     %[[IN_IDS:.+]]:4 = affine.delinearize_index %[[THREAD_ID]] into (2, 4, 16)
 //   CHECK-DAG:     %[[RHS_SLICE:.+]] = tensor.extract_slice %[[RHS]]
 //  CHECK-SAME:       [0, 0, %[[IN_IDS]]#1, %[[IN_IDS]]#2, %[[IN_IDS]]#3, 0] [1, 1, 1, 1, 1, 4] [1, 1, 1, 1, 1, 1]
commit	26f63c1a0a85e56c1d395e8668f8abefd27b3643	[log] [tgz]
author	Zhewen Yu <zhewenyu@amd.com>	Fri Aug 29 09:44:50 2025 +0100
committer	GitHub <noreply@github.com>	Fri Aug 29 09:44:50 2025 +0100
tree	0ad812474c0f8a156d07eabd42902572f14bc426
parent	b7341d9f7d39b6248389f93cd1e696659adb21f6 [diff]