Adding Batch Matmul and Matmul with Split-K to IREE Dispatch Profiler (#13396)

This PR upstreams addition of batch_matmul and matmul with split-k
variation to IREE dispatch profiler. This partially address issues
#13281 and #13290 by adding means to functionally verify and performance
profile these operations.


## Current Usage of IREE Dispatch Profiler 
Note that the usage is mostly going to remain as per the notes below,
but expect to see minor changes as we move on this further.

## Generation and compiling the generated mlir to vmfb
```bash
# Generate default shapes for matmul, batch_matmul, and split-k
 build-debug $ python3 ../iree/experimental/dispatch_profiler/generator.py
 
# Compile generated 
 build-debug $  python3 ../iree/experimental/dispatch_profiler/compile.py
```
## Functional verification and performance profiling of the generated
dispatches

```
# Run a single matmul dispatch `matmul_1024x512x2048_f16t_f16t_f16t_tile_config_128x64_32x5_tensorcore_mmasync`

build-debug $ python3 ../iree/experimental/dispatch_profiler/profiler.py --dispatches=matmul_1024x512x2048_f16t_f16t_f16t_tile_config_128x64_32x5_tensorcore_mmasync
---------------------------------------------------------------- 
Dispatch      : matmul_1024x512x2048_f16t_f16t_f16t_tile_config_128x64_32x5_tensorcore_mmasync
Provider      : IREE Codegen
OpKind        : OperationKind.Matmul
Operation     : matmul_1024x512x2048_f16t_f16t_f16t
Configuration : tile_config_128x64_32x5_tensorcore_mmasync
Arguments     : --batch_count=1 --m=1024 --n=512 --k=2048 --lhs=f16t --rhs=f16t --result=f16t
                --split_k_mode=N/A --split_k_slices=N/A
Verification  : SUCCESS
Runtime(ms)   : 0.023
GFLOPs        : 93368.85

# Run a matmul operation for tuning configurations and dump the results to csv file

build-debug $ python3 ../iree/experimental/dispatch_profiler/profiler.py --dispatches=matmul_1024x512x2048_f16t_f16t_f16t_tile_config_*_tensorcore_mmasync --output=data.csv
---------------------------------------------------------------- 
Dispatch      : matmul_1024x512x2048_f16t_f16t_f16t_tile_config_256x128_32x3_tensorcore_mmasync
Provider      : IREE Codegen
OpKind        : OperationKind.Matmul
Operation     : matmul_1024x512x2048_f16t_f16t_f16t
Configuration : tile_config_256x128_32x3_tensorcore_mmasync
Arguments     : --batch_count=1 --m=1024 --n=512 --k=2048 --lhs=f16t --rhs=f16t --result=f16t
                --split_k_mode=N/A --split_k_slices=N/A
Verification  : SUCCESS
Runtime(ms)   : 0.058
GFLOPs        : 37025.58
---------------------------------------------------------------- 
...
Dispatch      : matmul_1024x512x2048_f16t_f16t_f16t_tile_config_128x128_64x4_tensorcore_mmasync
Provider      : IREE Codegen
OpKind        : OperationKind.Matmul
Operation     : matmul_1024x512x2048_f16t_f16t_f16t
Configuration : tile_config_128x128_64x4_tensorcore_mmasync
Arguments     : --batch_count=1 --m=1024 --n=512 --k=2048 --lhs=f16t --rhs=f16t --result=f16t
                --split_k_mode=N/A --split_k_slices=N/A
Verification  : SUCCESS
Runtime(ms)   : 0.031
GFLOPs        : 69273.67
---------------------------------------------------------------- 
Dispatch      : matmul_1024x512x2048_f16t_f16t_f16t_tile_config_128x128_32x5_tensorcore_mmasync
Provider      : IREE Codegen
OpKind        : OperationKind.Matmul
Operation     : matmul_1024x512x2048_f16t_f16t_f16t
Configuration : tile_config_128x128_32x5_tensorcore_mmasync
Arguments     : --batch_count=1 --m=1024 --n=512 --k=2048 --lhs=f16t --rhs=f16t --result=f16t
                --split_k_mode=N/A --split_k_slices=N/A
Verification  : SUCCESS
Runtime(ms)   : 0.031
GFLOPs        : 69273.67
---------------------------------------------------------------- 
Dispatch      : matmul_1024x512x2048_f16t_f16t_f16t_tile_config_128x64_32x5_tensorcore_mmasync
Provider      : IREE Codegen
OpKind        : OperationKind.Matmul
Operation     : matmul_1024x512x2048_f16t_f16t_f16t
Configuration : tile_config_128x64_32x5_tensorcore_mmasync
Arguments     : --batch_count=1 --m=1024 --n=512 --k=2048 --lhs=f16t --rhs=f16t --result=f16t
                --split_k_mode=N/A --split_k_slices=N/A
Verification  : SUCCESS
Runtime(ms)   : 0.023
GFLOPs        : 93368.85
---------------------------------------------------------------- 

Writing performance report to data.csv


# Run batch_matmul

$ python3 ../iree/experimental/dispatch_profiler/profiler.py --dispatches=batch_matmul* --output=data.csv
---------------------------------------------------------------- 
Dispatch      : batch_matmul_16x512x64x512_f16t_f16t_f16t_tile_config_128x64_32x5_tensorcore_mmasync
Provider      : IREE Codegen
OpKind        : OperationKind.BatchMatmul
Operation     : batch_matmul_16x512x64x512_f16t_f16t_f16t
Configuration : tile_config_128x64_32x5_tensorcore_mmasync
Arguments     : --batch_count=16 --m=512 --n=64 --k=512 --lhs=f16t --rhs=f16t --result=f16t
                --split_k_mode=N/A --split_k_slices=N/A
Verification  : SUCCESS
Runtime(ms)   : 0.009
GFLOPs        : 59652.32
---------------------------------------------------------------- 
```

## Reference cache
Note that the reference is run once and is cached. For a large matmul
reference could take ~30sec. While cached reference runs within 4sec.

```bash
# Running numpy reference
$ time python3 ../iree/experimental/dispatch_profiler/profiler.py --dispatches=matmul_3456x1024x2048_f16t_f16t_f16t_tile_config_128x128_32x5_tensorcore_mmasync --output=data.csv
---------------------------------------------------------------- 
Dispatch      : matmul_3456x1024x2048_f16t_f16t_f16t_tile_config_128x128_32x5_tensorcore_mmasync
Provider      : IREE Codegen
OpKind        : OperationKind.Matmul
Operation     : matmul_3456x1024x2048_f16t_f16t_f16t
Configuration : tile_config_128x128_32x5_tensorcore_mmasync
Arguments     : --batch_count=1 --m=3456 --n=1024 --k=2048 --lhs=f16t --rhs=f16t --result=f16t
                --split_k_mode=N/A --split_k_slices=N/A
Verification  : SUCCESS
Runtime(ms)   : 0.062
GFLOPs        : 233798.62
Writing performance report to data.csv

real	0m30.083s
user	0m29.486s
sys	0m1.634s


# Skipping the reference run as the reference npy files are in `reference_cache`
$ time python3 ../iree/experimental/dispatch_profiler/profiler.py --dispatches=matmul_3456x1024x2048_f16t_f16t_f16t_tile_config_128x128_32x5_tensorcore_mmasync --output=data.csv
---------------------------------------------------------------- 
Dispatch      : matmul_3456x1024x2048_f16t_f16t_f16t_tile_config_128x128_32x5_tensorcore_mmasync
Provider      : IREE Codegen
OpKind        : OperationKind.Matmul
Operation     : matmul_3456x1024x2048_f16t_f16t_f16t
Configuration : tile_config_128x128_32x5_tensorcore_mmasync
Arguments     : --batch_count=1 --m=3456 --n=1024 --k=2048 --lhs=f16t --rhs=f16t --result=f16t
                --split_k_mode=N/A --split_k_slices=N/A
Verification  : SUCCESS
Runtime(ms)   : 0.062
GFLOPs        : 233798.62
Writing performance report to data.csv

real	0m4.779s
user	0m4.263s
sys	0m1.617s


```

## Rough edges and notes for the future progress 
- This PR also has code updates to pass matmul problem shapes as
command-line arguments to the generator. This enables the user to pass
comma separated list and/or python-style range of values.
 
Try the command below and see what all gets generated.
```bash
$ python3 ../iree/experimental/dispatch_profiler/generator.py --problem-m=768,512:1024:128 --problem-n=768 --problem-k=32:512:64
```

One can check the `generated` folder as follows:
```
build-debug $ ls -R generated/
generated/:
linalg  reference_cache

generated/linalg:
batch_matmul  matmul  matmul_splitk
...
```

- Passing operation-specific (matmul as our first example) command-line
arguments is not in a place I have imaged it to be. I am playing with
various ways to use argparse in reduced version
[here](https://github.com/manishucsd/python_explorations/tree/main/argparse).

Would like to see. Suggestions?
```
# Runs all the operations that are previously generated
$ profiler.py 

# general help
$ profiler.py --help

# operation-specific help
$ profiler.py --op-kind matmul --help

# Generate matmul
$ gernerator.py --op-kind=matmul --m=768 --n=512 --k=1024

# operation-specific run on a shape that is already generated
$ profiler.py --op-kind matmul --m=768 --n=512 --k=1024
```
13 files changed
tree: c1dac24acd965f9a1ad0b7007ab8786ef17d7bfc
  1. .devcontainer/
  2. .github/
  3. benchmarks/
  4. build_tools/
  5. compiler/
  6. docs/
  7. experimental/
  8. integrations/
  9. lib/
  10. llvm-external-projects/
  11. runtime/
  12. samples/
  13. tests/
  14. third_party/
  15. tools/
  16. .bazel_to_cmake.cfg.py
  17. .bazelignore
  18. .bazelrc
  19. .bazelversion
  20. .clang-format
  21. .dockerignore
  22. .gitignore
  23. .gitmodules
  24. .pylintrc
  25. .style.yapf
  26. .yamllint.yml
  27. AUTHORS
  28. BUILD.bazel
  29. CITATION.cff
  30. CMakeLists.txt
  31. configure_bazel.py
  32. CONTRIBUTING.md
  33. LICENSE
  34. README.md
  35. WORKSPACE
README.md

IREE: Intermediate Representation Execution Environment

IREE (Intermediate Representation Execution Environment, pronounced as “eerie”) is an MLIR-based end-to-end compiler and runtime that lowers Machine Learning (ML) models to a unified IR that scales up to meet the needs of the datacenter and down to satisfy the constraints and special considerations of mobile and edge deployments.

See our website for project details, user guides, and instructions on building from source.

CI Status

Project Status

IREE is still in its early phase. We have settled down on the overarching infrastructure and are actively improving various software components as well as project logistics. It is still quite far from ready for everyday use and is made available without any support at the moment. With that said, we welcome any kind of feedback on any communication channels!

Communication Channels

Related Project Channels

  • MLIR topic within LLVM Discourse: IREE is enabled by and heavily relies on MLIR. IREE sometimes is referred to in certain MLIR discussions. Useful if you are also interested in MLIR evolution.

Architecture Overview

IREE Architecture IREE Architecture

See our website for more information.

Presentations and Talks

  • 2021-06-09: IREE Runtime Design Tech Talk (recording and slides)
  • 2020-08-20: IREE CodeGen: MLIR Open Design Meeting Presentation (recording and slides)
  • 2020-03-18: Interactive HAL IR Walkthrough (recording)
  • 2020-01-31: End-to-end MLIR Workflow in IREE: MLIR Open Design Meeting Presentation (recording and slides)

License

IREE is licensed under the terms of the Apache 2.0 License with LLVM Exceptions. See LICENSE for more information.