tree a8f0c93f9f38b51bd72389038a56c7c98c9347ce
parent d01fb2341c12d81ff54455712c9a33dea70b13ff
author Quinn Dawkins <quinn.dawkins@gmail.com> 1718915502 -0400
committer GitHub <noreply@github.com> 1718915502 -0400
gpgsig -----BEGIN PGP SIGNATURE-----
 
 wsFcBAABCAAQBQJmdJGuCRC1aQ7uu5UhlAAAeHUQABulj27ZlHQabVvY08TRFQet
 Ae43a8MNwzjAwxjptHrDV1/AY2Yb+/7jKMVAY+Ozs3A64LHYMDZlO0Ifi4bAzM2w
 pTtP0NfrakZauOZPwtjnFi+gtLAbJdaRnjEkluGBCfC9hfrnvuCTN0pKJiu/vLuX
 SVi2grLHgtON7VZDCuToyGyTk7DRWCIN+LT3h+wTM2Pd5BApS58euh7NcdQeNK1h
 N9uMDF+yvO8Vim/SQf5zQZphyjiQx3YvV+uFbA4AVcW2v3F/0O2j3WfzJ5wOJ0k7
 e8vpV5XT8LSYoLk56tpmhqAtvacJq3vKA+mAzPj1fVb8jWZbi8lGteJ8PjsYl130
 IQTyzPUGHjAz+QOMEjQWxwmoTsphSUn1oirLOsy/ymLmH6/+N5YaiaZH+NErU/i8
 +ldSiPRV/KCjPnZxT1w6AEmj96XrOnEylA4tD6c2sJXxcTmziNTc6+O/XSmhAthJ
 bEzqvopA0OTB/0hTac7In4DZjFNeGtwV80OiHIqQiqGPplLfM3lMWZt6+YkjFxJS
 OsHzpsLI1jEdr9Qr207lpF1I2PAG9na5hyRs2EBhi6NaQpZfePIbMMrCiTYvw1GO
 TTYQP2QtQiYna1T4xjn+FbqI26YZFeVGKgg0/lEAHGf4ze3QNeRWR0tfApdloWYb
 tHtiL3BSyd17QHtjIuc5
 =O48m
 -----END PGP SIGNATURE-----
 

[Codegen][GPU] Update greedy tile + fuse pipeline to generate mfma (#17617)

This adds intrinsic packing and reshape propagation patterns to
LLVMGPUTileAndFuse to allow for generating mfma operations. This adds a
few passes to invoke a few necessary patterns for the pipeline to
generate (good) code.

1. PropagateReshapesByExpansion to propagate reshapes introduced after
decomposing tensor.pack/unpack towards the edges of the kernel in the
hopes that the destination can line up properly.
2. IREE::GPU::PackToIntrinsics to pack based on the lowering config
specified mma kind.
3. IREE::GPU::DistributeMmaToLanes to distribute iree_gpu.multi_mma ops
to lanes, similar to another tiling level.

There are a few known outstanding issues.

1. We run `ConvertToDestinationPassingStyle` twice to re-link the kernel
destination with the body after decomposing `tensor.unpack`. This is to
work around an issue with EliminateEmptyTensors being unable to analyze
`flow.dispatch.tensor.store` ops with slicing behavior properly. After
workgroup distribution is refactored to generate an scf.forall, this
needs to be revisited.
4. iree_gpu.shuffle_tensor lowering to `tensor.insert_slice` is still
broken. This will need to be reworked to support dynamic shapes.
5. Currently, because of the way the layout works, only MFMA_16x16x16
works. To support other layouts we will need another level of expanding
to the intrinsic implicit layout and then propagating those
expand_shapes. This will likely need to happen after reduction tiling
unless we want to teach tile + fuse to swap tensor.expand_shape ops with
tensor.extract_slice.