)]}'
{
  "commit": "bb542eee65fa0a498963df1f2ee2f205a3dd8bd0",
  "tree": "47f50e3aa6b9e57d5bff0a363fade09019f89bdf",
  "parents": [
    "20c83470b4186944e14c67488d295f3f0f33ecce"
  ],
  "author": {
    "name": "Stanley Winata",
    "email": "68087699+raikonenfnu@users.noreply.github.com",
    "time": "Thu Oct 31 16:56:00 2024 -0700"
  },
  "committer": {
    "name": "GitHub",
    "email": "noreply@github.com",
    "time": "Thu Oct 31 16:56:00 2024 -0700"
  },
  "message": "[LLVMGPU] Add Virtual MFMA layout that maximizes load through adjusted K-width (#18930)\n\nThe main use case for the virtual intrinsics are to change the layout of\r\nintrinsics in K-dimension, such that we can coalesce reads from shared\r\nmemory to register.\r\n\r\nCurrently, the \"native\" intrinsics need to enforce the \"native\" layout\r\n(i.e read 4 element per thread for MFMA_F32_16x16x16), however since we\r\nknow that K-dim is a reduction dimension which is associative, we can\r\nread the data in non \"native\"/\"correct\" but \"faster\"/\"more elements per\r\nread\" way but as long as we match the K-dim on both lhs and rhs we will\r\nstill get correct results (i.e read 8 contiguous element per thread from\r\nshared memory along dimension K for and then slice them into two\r\nMFMA_F32_16x16x16)).\r\n\r\nan IR example for this is if we want to do a 16x16x32(MxNxK) matmul with\r\nMFMA_F32_16x16x16_F16 intrinsics, on lane 0 we used to have something\r\nlike:\r\n\r\n```\r\nlhs_0 \u003d read(lhs_shared_mem[0:4])\r\nrhs_0 \u003d read(rhs_shared_mem[0:4])\r\nmma_0 \u003d vector.contract(lhs_0, rhs_0)\r\n\r\n(16 offset since MFMA_F32_16x16x16xF16 has intrinsic K size of 16)\r\nlhs_1 \u003d read(lhs_shared_mem[16 + 0: 16 + 4])\r\nrhs_1 \u003d read(rhs_shared_mem[16 + 0 : 16 + 4])\r\nmma_1 \u003d vector.contract(lhs_1, rhs_1, mma_0)\r\n```\r\n\r\nWith this optimization, we will turn into something like:\r\n\r\n```\r\nlhs_reg \u003d read(lhs_shared_mem[0:8])\r\nrhs_reg \u003d read(rhs_shared_mem[0:8])\r\n\r\nlhs_0 \u003d slice(lhs_reg, [0 : 4])\r\nrhs_0 \u003d slice(rhs_reg, [0 : 4])\r\nmma_0 \u003d vector.contract(lhs_0, rhs_0)\r\n\r\nlhs_1 \u003d slice(lhs_reg, [4 : 8])\r\nrhs_1 \u003d slice(rhs_reg, [4 : 8])\r\nmma_1 \u003d vector.contract(lhs_0, rhs_0, mma_0)\r\n```\r\n\r\nCurrently, we are plumbing it in as MMA intrinsic enums for two variants\r\nof unrolled k \u003d\u003d 2 on the F16s(per discussion with @qedawkins and\r\n@Groverkss ), as they are the easiest and non tangly way to\r\nintegrate/plumb through. all though in the future we can expose this\r\nattribute as k-width for maximizing generability.\r\n\r\n---------\r\n\r\nSigned-off-by: Stanley Winata \u003cstanley.winata@amd.com\u003e",
  "tree_diff": [
    {
      "type": "modify",
      "old_id": "489b8a0cb6709902c1ea1f9da098e92d6b6394f0",
      "old_mode": 33188,
      "old_path": "compiler/src/iree/compiler/Codegen/Common/GPU/test/gpu_nested_layout_contract_amdgpu.mlir",
      "new_id": "db39c0b15742d83b0949349ffd081b2c411e39cc",
      "new_mode": 33188,
      "new_path": "compiler/src/iree/compiler/Codegen/Common/GPU/test/gpu_nested_layout_contract_amdgpu.mlir"
    },
    {
      "type": "modify",
      "old_id": "93a2ca762b5181acef07325ff11d4cd361c9859b",
      "old_mode": 33188,
      "old_path": "compiler/src/iree/compiler/Codegen/Dialect/GPU/IR/IREEGPUAttrs.cpp",
      "new_id": "e53f915434aa66a6741e94a7380d69f615bfd444",
      "new_mode": 33188,
      "new_path": "compiler/src/iree/compiler/Codegen/Dialect/GPU/IR/IREEGPUAttrs.cpp"
    },
    {
      "type": "modify",
      "old_id": "d04e9fefe5b9eda6c2df73983be82d78ad00c733",
      "old_mode": 33188,
      "old_path": "compiler/src/iree/compiler/Codegen/Dialect/GPU/IR/IREEGPUAttrs.td",
      "new_id": "bbb79628e1d3478a4fdd27b12e7e11f35ef3fe11",
      "new_mode": 33188,
      "new_path": "compiler/src/iree/compiler/Codegen/Dialect/GPU/IR/IREEGPUAttrs.td"
    },
    {
      "type": "modify",
      "old_id": "9d4ac2e9a4e173bb8aeb1abf79c10f627b9d88f1",
      "old_mode": 33188,
      "old_path": "compiler/src/iree/compiler/Codegen/Dialect/GPU/IR/IREEGPUEnums.td",
      "new_id": "1afdf0d235be670ab5ab4c82a6d2dfef397bad10",
      "new_mode": 33188,
      "new_path": "compiler/src/iree/compiler/Codegen/Dialect/GPU/IR/IREEGPUEnums.td"
    },
    {
      "type": "modify",
      "old_id": "ede2d0bcf7b81a650a018997763de70ac98020e3",
      "old_mode": 33188,
      "old_path": "compiler/src/iree/compiler/Codegen/LLVMGPU/KernelConfig.cpp",
      "new_id": "b4567e32938d0cbee93d7e0f57d475dfc77a5491",
      "new_mode": 33188,
      "new_path": "compiler/src/iree/compiler/Codegen/LLVMGPU/KernelConfig.cpp"
    },
    {
      "type": "modify",
      "old_id": "7e1ab62101b385868bb7b44cab5579e1468abe41",
      "old_mode": 33188,
      "old_path": "compiler/src/iree/compiler/Codegen/LLVMGPU/test/ROCDL/pipeline_vector_distribute_gfx940.mlir",
      "new_id": "cedec2d21f2f256215e89f8ab43dbfcf71defa17",
      "new_mode": 33188,
      "new_path": "compiler/src/iree/compiler/Codegen/LLVMGPU/test/ROCDL/pipeline_vector_distribute_gfx940.mlir"
    },
    {
      "type": "modify",
      "old_id": "cd6f8ebea6d3613cad1032f3d73fbf850791bed8",
      "old_mode": 33188,
      "old_path": "tests/e2e/matmul/generate_e2e_matmul_tests.py",
      "new_id": "dd387f31141f93115d82153546ffebafb9e55fb6",
      "new_mode": 33188,
      "new_path": "tests/e2e/matmul/generate_e2e_matmul_tests.py"
    }
  ]
}
