tree 47f50e3aa6b9e57d5bff0a363fade09019f89bdf
parent 20c83470b4186944e14c67488d295f3f0f33ecce
author Stanley Winata <68087699+raikonenfnu@users.noreply.github.com> 1730418960 -0700
committer GitHub <noreply@github.com> 1730418960 -0700
gpgsig -----BEGIN PGP SIGNATURE-----
 
 wsFcBAABCAAQBQJnJBkQCRC1aQ7uu5UhlAAAdkAQABmzBwRMxACQ/DrFt4Tb4I7j
 UQjiCeKBUIAMp2TnnfoVlD7t+Ir6v7551P4/NuYoSs58HhseemaDKf4zC/fNjyOr
 oLNTkOGK5/0RtYAY1rUbrWOuxgwEpVKqnWBttqKAPP2e6R6NxDdQGye/5Xu+tA/6
 2gmRe9/lFwPgDhD7pRMkZpv6iNWjeelnCHEsdAeg7eGDydukZu68/q+c+odcMifq
 IPuzQ3hYL+l1RMgUvMD62YzU2XDLyyfE2idaXeIbDEKSdA76sZIRVyPEq/NgE8iO
 1TYGWyh4ElDdZMcpzgVuv/pyHjAmQeMaftLwvEEW26xymJyYknJqJOgkR+q85YHD
 IpVKSe8ANQoQRQSG7dfGZ9ChDyU9DqTrTjKnxbuzlRFXmdOoKd8USmh6nGWekL1S
 vf0dIRoGeEipfk+d6uwCgcBRlTVVNKkTHFjGrpKnmUVgFOp80frGn5kKa4pwqavJ
 /l4iKppaxYOy37bnL8JhoANTtTA7VGCURRBcrs9pNPzLFM3ErtsFn2A3mkR4Obfd
 K99dutXwhvlf8bpXX1d673bbCZBwMsl+9hIBOyReksAWOb1vX0qFFZRaF6rexljP
 HSpjcD7FCqs5+qNf9b+caW6h1idogr/YWJJ+MQ3Nw2t24lQQIOx0OqPZ47CwWqJc
 a2mKgS5iNr0hgRcH4Mpc
 =xsga
 -----END PGP SIGNATURE-----
 

[LLVMGPU] Add Virtual MFMA layout that maximizes load through adjusted K-width (#18930)

The main use case for the virtual intrinsics are to change the layout of
intrinsics in K-dimension, such that we can coalesce reads from shared
memory to register.

Currently, the "native" intrinsics need to enforce the "native" layout
(i.e read 4 element per thread for MFMA_F32_16x16x16), however since we
know that K-dim is a reduction dimension which is associative, we can
read the data in non "native"/"correct" but "faster"/"more elements per
read" way but as long as we match the K-dim on both lhs and rhs we will
still get correct results (i.e read 8 contiguous element per thread from
shared memory along dimension K for and then slice them into two
MFMA_F32_16x16x16)).

an IR example for this is if we want to do a 16x16x32(MxNxK) matmul with
MFMA_F32_16x16x16_F16 intrinsics, on lane 0 we used to have something
like:

```
lhs_0 = read(lhs_shared_mem[0:4])
rhs_0 = read(rhs_shared_mem[0:4])
mma_0 = vector.contract(lhs_0, rhs_0)

(16 offset since MFMA_F32_16x16x16xF16 has intrinsic K size of 16)
lhs_1 = read(lhs_shared_mem[16 + 0: 16 + 4])
rhs_1 = read(rhs_shared_mem[16 + 0 : 16 + 4])
mma_1 = vector.contract(lhs_1, rhs_1, mma_0)
```

With this optimization, we will turn into something like:

```
lhs_reg = read(lhs_shared_mem[0:8])
rhs_reg = read(rhs_shared_mem[0:8])

lhs_0 = slice(lhs_reg, [0 : 4])
rhs_0 = slice(rhs_reg, [0 : 4])
mma_0 = vector.contract(lhs_0, rhs_0)

lhs_1 = slice(lhs_reg, [4 : 8])
rhs_1 = slice(rhs_reg, [4 : 8])
mma_1 = vector.contract(lhs_0, rhs_0, mma_0)
```

Currently, we are plumbing it in as MMA intrinsic enums for two variants
of unrolled k == 2 on the F16s(per discussion with @qedawkins and
@Groverkss ), as they are the easiest and non tangly way to
integrate/plumb through. all though in the future we can expose this
attribute as k-width for maximizing generability.

---------

Signed-off-by: Stanley Winata <stanley.winata@amd.com>