)]}'
{
  "commit": "bb1c561cdb25e411071a84cd8173ba107c61c9d3",
  "tree": "7b8472561ea7caf29bfa7d46b937d632dcb9f779",
  "parents": [
    "a7bac5d9f5fab902d906f941a0f1002a809d3f35"
  ],
  "author": {
    "name": "Benoit Jacob",
    "email": "jacob.benoit.1@gmail.com",
    "time": "Thu Jan 09 12:32:45 2025 -0500"
  },
  "committer": {
    "name": "GitHub",
    "email": "noreply@github.com",
    "time": "Thu Jan 09 12:32:45 2025 -0500"
  },
  "message": "Erase all address spaces and get inlined ukernels (#19646)\n\nThe `LLVMGPUCastAddressSpaceFunction` pass was selectively erasing the\nshared memory address space from pointers around Call ops to achieve\ninlining. This PR generalizes that to erasing all address spaces after\nchecking with its original author that there wasn\u0027t anything intentional\nhere:\n[discord](https://discord.com/channels/689900678990135345/1282818085153407038/1326577591557296272)\n\nThis has the intended effect of allowing AMDGPU ukernels to get inlined\ninto their callers.\n\nThere is a side benefit of not having to duplicate ukernels for the\nvarious combinations of address spaces of their pointer parameters. This\nbenefit will be partly rolled back if and when we do assembly ukernels,\nas these will need to know the address spaces to write different\ninstructions, but at least for C ukernels it is nice.\n\nIt was counter-intuitive to me that erasing address spaces was possible\nat all. The key is that these ukernels only get compiled to LLVM IR, not\nto ISA, and the resulting IR gets inlined into a caller where the\naddrspacecast was done and where the actual address space is known.\nAfter inlining, the compiler is still able to propagate the actual\naddress spaces all the way into the inlined ukernel code.\n\nFor the current `multi_mma` ukernel there was no immediate problem. The\nchanges to it in this PR are reaping the benefits of inlining: now the\n`unroll_*` parameters become compile-time constants after inlining so we\nget to simply declare our accumulator tile as a VLA and let it get\nspecialized to a normal fixed-size array. No need anymore to use an\narbitrary fixed size array and try to guard that with assertions.\n\nFor the exising `argmax` ukernels, the inlining revealed a preexisting\nissue: these ukernels are reductions to a single scalar and instead of\nreturning it by value, write their result value to an output buffer\n(which happens to be LDS memory, but the address space doesn\u0027t matter).\nThe problem was that there was no synchronization between the thread\nwriting the value in the ukernel, and the threads reading the value in\nthe caller. Solved by adding a `__threadfence_block()`, which compiles\nto almost nothing in ISA (s_waitcnt, which we have anyway around memory\naccesses) but prevents IR rewrites removing the loads from the output\nbuffer.\n\nI added `__threadfence_block()` to common.h, copied from AMD device\nlibrary headers, along with a few other synchronization functions which\nwe anticipate will be useful in other ukernels. `__syncthreads` is not\nused in this PR.\n\nSigned-off-by: Benoit Jacob \u003cjacob.benoit.1@gmail.com\u003e",
  "tree_diff": [
    {
      "type": "modify",
      "old_id": "d046986cc9b54c0a3abfdf2bb55440b9361b9d41",
      "old_mode": 33188,
      "old_path": "compiler/plugins/target/ROCM/builtins/ukernel/common.h",
      "new_id": "3113643ca1d18c743e0959c0bd2612eff6bc1b30",
      "new_mode": 33188,
      "new_path": "compiler/plugins/target/ROCM/builtins/ukernel/common.h"
    },
    {
      "type": "modify",
      "old_id": "4a6beefa919859152b6ea6a10fc0d65cd546821f",
      "old_mode": 33188,
      "old_path": "compiler/plugins/target/ROCM/builtins/ukernel/iree_uk_amdgpu_argmax_f16i32.c",
      "new_id": "0edf2744c654af37ed591717c142b620d9fc18cb",
      "new_mode": 33188,
      "new_path": "compiler/plugins/target/ROCM/builtins/ukernel/iree_uk_amdgpu_argmax_f16i32.c"
    },
    {
      "type": "modify",
      "old_id": "33c1522d143dabe178a787831d79dab385506edb",
      "old_mode": 33188,
      "old_path": "compiler/plugins/target/ROCM/builtins/ukernel/iree_uk_amdgpu_argmax_f16i64.c",
      "new_id": "552ab87254d3c04ddaa93d1b31154108104b6468",
      "new_mode": 33188,
      "new_path": "compiler/plugins/target/ROCM/builtins/ukernel/iree_uk_amdgpu_argmax_f16i64.c"
    },
    {
      "type": "modify",
      "old_id": "f39d6237279976f7a296203bdc4479cf7986c3d4",
      "old_mode": 33188,
      "old_path": "compiler/plugins/target/ROCM/builtins/ukernel/iree_uk_amdgpu_argmax_f32i32.c",
      "new_id": "ec0c4c363df9b94ad10e7fbe01c089521c076507",
      "new_mode": 33188,
      "new_path": "compiler/plugins/target/ROCM/builtins/ukernel/iree_uk_amdgpu_argmax_f32i32.c"
    },
    {
      "type": "modify",
      "old_id": "d6a9afbcf2d6a937903fd2d64b8a52a911f5ee13",
      "old_mode": 33188,
      "old_path": "compiler/plugins/target/ROCM/builtins/ukernel/iree_uk_amdgpu_argmax_f32i64.c",
      "new_id": "40e7cae7d809643cf819fa9b48e872cf66acf45d",
      "new_mode": 33188,
      "new_path": "compiler/plugins/target/ROCM/builtins/ukernel/iree_uk_amdgpu_argmax_f32i64.c"
    },
    {
      "type": "modify",
      "old_id": "9029a86ddb592deeb6f23dbc1c2456c3dd0bf4e1",
      "old_mode": 33188,
      "old_path": "compiler/plugins/target/ROCM/builtins/ukernel/iree_uk_amdgpu_multi_mma_mfma_i32_16x16x32_i8.c",
      "new_id": "064bedbbfa9236ffce0213fba27f3d5af402cc8e",
      "new_mode": 33188,
      "new_path": "compiler/plugins/target/ROCM/builtins/ukernel/iree_uk_amdgpu_multi_mma_mfma_i32_16x16x32_i8.c"
    },
    {
      "type": "modify",
      "old_id": "aad0618e54b44f0afa96717789abf710ce656be9",
      "old_mode": 33188,
      "old_path": "compiler/src/iree/compiler/Codegen/LLVMGPU/LLVMGPUCastAddressSpaceFunction.cpp",
      "new_id": "1df82fdbc66f6310994b6729f913e798fff00de0",
      "new_mode": 33188,
      "new_path": "compiler/src/iree/compiler/Codegen/LLVMGPU/LLVMGPUCastAddressSpaceFunction.cpp"
    }
  ]
}
