| # IREE Dispatch Profiler |
| |
| The IREE Dispatch Profiler is a Python-based tool designed to achieve two primary objectives: functional verification and performance profiling for individual dispatches, such as matrix multiplication, batch matrix multiplication, and convolutions. This tool ensures that performance optimizations maintain functionality and provides a convenient way to quantitatively measure performance. Additionally, the tool offers dispatch generation and compilation capabilities. In summary, the IREE dispatch profiler accomplishes the following: |
| |
| - Auto-generation of MLIR dispatches (e.g., matmul, batch_matmul, convolutions, fused dispatches). |
| - Compilation of generated MLIR dispatches into binaries (vmfb). |
| - Functional verification against Python-based reference implementations. |
| - Performance profiling and reporting. |
| |
| ## Definitions |
| |
| - Operation: An operation structure captures and refers to the functional description of an operation. For example, a Matmul operation includes the datatype, layout, and matrix multiplication problem shape. |
| - Tuning Configuration: Tuning configurations are attributes applied to the IREE compilation flow that can alter the performance of the compiled dispatch without affecting its functionality. |
| - Dispatch: A dispatch is a combination of an operation and its corresponding tuning configuration. |
| |
| ## Auto-generation of MLIR Dispatches |
| |
| IREE dispatch profiler provides [`generator.py`](generator.py) that can be used to generate dispatches. Please find a sample run below: |
| |
| ```bash |
| $ python3 dispatch_profiler/generator.py --generated-dir </path/to/create/`generated`/dir> |
| [Generating]: ./generated/linalg/matmul/matmul_128x128x256_f16t_f16t_f16t/matmul_128x128x256_f16t_f16t_f16t.mlir |
| Emitting tuning configuration : tile_config_128x128_64x4_tensorcore_mmasync |
| Emitting tuning configuration : tile_config_128x128_32x5_tensorcore_mmasync |
| Emitting tuning configuration : tile_config_128x64_32x5_tensorcore_mmasync |
| Emitting tuning configuration : tile_config_64x64_64x5_tensorcore_mmasync |
| Emitting tuning configuration : tile_config_64x64_32x10_tensorcore_mmasync |
| ... |
| ``` |
| |
| This creates a `generated` folder containing dispatches organized in folders as `mlir_dialect/operation_name/`. The folder includes an .mlir file with all the dispatches for an operation. |
| |
| The `generator.py` script serves as a generator for implemented operation data types, using a predefined list of problem shapes. You can also provide specific matrix multiplication shapes of interest. Examples are provided below. |
| |
| #### Generating user-specified matmul shape `768x512x1024` |
| |
| ```bash |
| python3 ../iree/experimental/dispatch_profiler/generator.py --generated-dir </path/to/create/`generated`/dir> --problem-m=768 --problem-n=512 --problem-k=1024 |
| ... |
| [Generating]: ./generated/linalg/matmul/matmul_768x512x1024_f16t_f16t_f16t/matmul_768x512x1024_f16t_f16t_f16t.mlir |
| [Generating]: ./generated/linalg/matmul/matmul_768x512x1024_f32t_f32t_f32t/matmul_768x512x1024_f32t_f32t_f32t.mlir |
| ... |
| ``` |
| |
| #### Generate a user-specified sweep of matmul shapes |
| |
| Generate matmuls where M ranges from 64 to 1024 in increments of 128, N varies from 64 to 1024 in steps of 128, and K is fixed at 4096. |
| |
| ```bash |
| $ python3 ../iree/experimental/dispatch_profiler/generator.py --generated-dir </path/to/create/`generated`/dir> --problem-m=64:1024:128 --problem-n=64:1024:128 --problem-k=4096 |
| ... |
| ``` |
| |
| ## Compilation of generated MLIR dispatches into binaries (vmfb) |
| |
| IREE dispatch profiler provies `compile.py` that trigges `iree-compile` with appropiate compilation flags. The output of `iree-compile` vmfb files are placed in `mlir_dialect/operation_path/operation_name.mlir`. The `compiler.py` uses all the possible cpus on your machine to compile all different generated mlir source files. |
| |
| ```bash |
| python3 ../iree/experimental/dispatch_profiler/compile.py --build-dir </path/to/iree/build/dir> --generated-dir </path/to/create/`generated`/dir> |
| ``` |
| |
| Compiles all the generated source mlir dispatches. One can check the generated dispatched folder to find the vmfb files. |
| |
| ```bash |
| $ ls ./generated/linalg/matmul/matmul_64x64x4096_f16t_f16t_f16t/ |
| iree_compile_cmd_stdout.mlir matmul_64x64x4096_f16t_f16t_f16t.mlir matmul_64x64x4096_f16t_f16t_f16t_profile.vmfb matmul_64x64x4096_f16t_f16t_f16t_verify.vmfb |
| ``` |
| |
| ## Functional verification and performance profiling |
| |
| The tool provides [`profiler.py`](profiler.py) script which can be used to trigger both verification and profiler for all the compiled dispatches. Please find some example profiling commandlines below: |
| |
| ### Functional verification and performance profiling of a _single_ dispatch |
| |
| ``` |
| $ python3 profiler.py --build-dir </path/to/iree/build/dir> --generated-dir </path/to/create/`generated`/dir> --dispatches=matmul_3456x1024x2048_f16t_f16t_f16t_tile_config_128x128_32x5_tensorcore_mmasync --verification-enabled=true --profiling-enabled=true |
| ---------------------------------------------------------------- |
| Dispatch : matmul_3456x1024x2048_f16t_f16t_f16t_tile_config_128x128_32x5_tensorcore_mmasync |
| Provider : IREE Codegen |
| OpKind : OperationKind.Matmul |
| Operation : matmul_3456x1024x2048_f16t_f16t_f16t |
| Configuration : tile_config_128x128_32x5_tensorcore_mmasync |
| Arguments : --batch_count=1 --m=3456 --n=1024 --k=2048 --lhs=f16t --rhs=f16t --result=f16t |
| --split_k_mode=N/A --split_k_slices=N/A |
| Verification : SUCCESS |
| Runtime(ms) : 0.062 |
| GFLOPs : 233798.62 |
| ``` |
| |
| ### Performance profiling _single_ dispatch |
| |
| Verification, particularly for large matrix multiplications, can be time-consuming when using a CPU-based numpy reference. To prioritize profiling speed and when functional correctness is assured, disable verification using `--verification-enabled=false`. |
| |
| ```bash |
| python3 profiler.py --build-dir </path/to/iree/build/dir> --generated-dir </path/to/create/`generated`/dir> --dispatches=matmul_3456x1024x2048_f16t_f16t_f16t_tile_config_128x128_32x5_tensorcore_mmasync --verification-enabled=false --profiling-enabled=true |
| ``` |
| |
| ### Performance profile _single_ operation and _sweep_ tunning configurations |
| |
| The `--dispatch` option accepts a comma-separated list of regex patterns to profile all tuning configurations generated for a operation. The command-line argument is formatted as `--dispatch=<regex>,<regex>`. Additionally, you can export the profiled output to a CSV file for further analysis using `--output=<filepath>`. |
| |
| ```bash |
| $ python3 profiler.py --build-dir </path/to/iree/build/dir> --generated-dir </path/to/create/`generated`/dir> --dispatches=matmul_3456x1024x2048_f16t_f16t_f16t_*_tensorcore_mmasync --verification-enabled=false --profiling-enabled=true --output=data.csv |
| ---------------------------------------------------------------- |
| Dispatch : matmul_3456x1024x2048_f16t_f16t_f16t_tile_config_128x256_32x3_tensorcore_mmasync |
| Provider : IREE Codegen |
| OpKind : OperationKind.Matmul |
| Operation : matmul_3456x1024x2048_f16t_f16t_f16t |
| Configuration : tile_config_128x256_32x3_tensorcore_mmasync |
| Arguments : --batch_count=1 --m=3456 --n=1024 --k=2048 --lhs=f16t --rhs=f16t --result=f16t |
| --split_k_mode=N/A --split_k_slices=N/A |
| Verification : Not verified |
| Runtime(ms) : 0.062 |
| GFLOPs : 233798.62 |
| ---------------------------------------------------------------- |
| Dispatch : matmul_3456x1024x2048_f16t_f16t_f16t_tile_config_128x128_64x4_tensorcore_mmasync |
| Provider : IREE Codegen |
| OpKind : OperationKind.Matmul |
| Operation : matmul_3456x1024x2048_f16t_f16t_f16t |
| Configuration : tile_config_128x128_64x4_tensorcore_mmasync |
| Arguments : --batch_count=1 --m=3456 --n=1024 --k=2048 --lhs=f16t --rhs=f16t --result=f16t |
| --split_k_mode=N/A --split_k_slices=N/A |
| Verification : Not verified |
| Runtime(ms) : 0.064 |
| GFLOPs : 226492.42 |
| ---------------------------------------------------------------- |
| ... |
| ---------------------------------------------------------------- |
| Dispatch : matmul_3456x1024x2048_f16t_f16t_f16t_tile_config_64x64_32x10_tensorcore_mmasync |
| Provider : IREE Codegen |
| OpKind : OperationKind.Matmul |
| Operation : matmul_3456x1024x2048_f16t_f16t_f16t |
| Configuration : tile_config_64x64_32x10_tensorcore_mmasync |
| Arguments : --batch_count=1 --m=3456 --n=1024 --k=2048 --lhs=f16t --rhs=f16t --result=f16t |
| --split_k_mode=N/A --split_k_slices=N/A |
| Verification : Not verified |
| Runtime(ms) : 0.103 |
| GFLOPs : 140733.15 |
| |
| Writing performance report to data.csv |
| |
| ``` |
| |
| ### Performance profiling a large matmul targetting _F16_ and _F32_ datatype |
| |
| Another example showcasing the use of `--dispatch` to profile a matmul_3456x1024x2048 targetting F16 and F32 NVIDIA A100 Tensor Cores. |
| |
| ```bash |
| $ python3 profiler.py --build-dir </path/to/iree/build/dir> --generated-dir </path/to/create/`generated`/dir> --dispatches=matmul_3456x1024x2048_f16t_f16t_f16t_tile_config_128x128_32x5_tensorcore_mmasync,matmul_3456x1024x2048_f32t_f32t_f32t_tile_config_128x128_16x5_tensorcore_mmasync |
| ---------------------------------------------------------------- |
| Dispatch : matmul_3456x1024x2048_f16t_f16t_f16t_tile_config_128x128_32x5_tensorcore_mmasync |
| Provider : IREE Codegen |
| OpKind : OperationKind.Matmul |
| Operation : matmul_3456x1024x2048_f16t_f16t_f16t |
| Configuration : tile_config_128x128_32x5_tensorcore_mmasync |
| Arguments : --batch_count=1 --m=3456 --n=1024 --k=2048 --lhs=f16t --rhs=f16t --result=f16t |
| --split_k_mode=N/A --split_k_slices=N/A |
| Verification : SUCCESS |
| Runtime(ms) : 0.062 |
| GFLOPs : 233798.62 |
| ---------------------------------------------------------------- |
| Dispatch : matmul_3456x1024x2048_f32t_f32t_f32t_tile_config_128x128_16x5_tensorcore_mmasync |
| Provider : IREE Codegen |
| OpKind : OperationKind.Matmul |
| Operation : matmul_3456x1024x2048_f32t_f32t_f32t |
| Configuration : tile_config_128x128_16x5_tensorcore_mmasync |
| Arguments : --batch_count=1 --m=3456 --n=1024 --k=2048 --lhs=f32t --rhs=f32t --result=f32t |
| --split_k_mode=N/A --split_k_slices=N/A |
| Verification : SUCCESS |
| Runtime(ms) : 0.122 |
| GFLOPs : 118815.69 |
| ---------------------------------------------------------------- |
| ``` |