The IREE Dispatch Profiler is a Python-based tool designed to achieve two primary objectives: functional verification and performance profiling for individual dispatches, such as matrix multiplication, batch matrix multiplication, and convolutions. This tool ensures that performance optimizations maintain functionality and provides a convenient way to quantitatively measure performance. Additionally, the tool offers dispatch generation and compilation capabilities. In summary, the IREE dispatch profiler accomplishes the following:
IREE dispatch profiler provides generator.py
that can be used to generate dispatches. Please find a sample run below:
$ python3 dispatch_profiler/generator.py --generated-dir </path/to/create/`generated`/dir> [Generating]: ./generated/linalg/matmul/matmul_128x128x256_f16t_f16t_f16t/matmul_128x128x256_f16t_f16t_f16t.mlir Emitting tuning configuration : tile_config_128x128_64x4_tensorcore_mmasync Emitting tuning configuration : tile_config_128x128_32x5_tensorcore_mmasync Emitting tuning configuration : tile_config_128x64_32x5_tensorcore_mmasync Emitting tuning configuration : tile_config_64x64_64x5_tensorcore_mmasync Emitting tuning configuration : tile_config_64x64_32x10_tensorcore_mmasync ...
This creates a generated
folder containing dispatches organized in folders as mlir_dialect/operation_name/
. The folder includes an .mlir file with all the dispatches for an operation.
The generator.py
script serves as a generator for implemented operation data types, using a predefined list of problem shapes. You can also provide specific matrix multiplication shapes of interest. Examples are provided below.
768x512x1024
python3 ../iree/experimental/dispatch_profiler/generator.py --generated-dir </path/to/create/`generated`/dir> --problem-m=768 --problem-n=512 --problem-k=1024 ... [Generating]: ./generated/linalg/matmul/matmul_768x512x1024_f16t_f16t_f16t/matmul_768x512x1024_f16t_f16t_f16t.mlir [Generating]: ./generated/linalg/matmul/matmul_768x512x1024_f32t_f32t_f32t/matmul_768x512x1024_f32t_f32t_f32t.mlir ...
Generate matmuls where M ranges from 64 to 1024 in increments of 128, N varies from 64 to 1024 in steps of 128, and K is fixed at 4096.
$ python3 ../iree/experimental/dispatch_profiler/generator.py --generated-dir </path/to/create/`generated`/dir> --problem-m=64:1024:128 --problem-n=64:1024:128 --problem-k=4096 ...
IREE dispatch profiler provies compile.py
that trigges iree-compile
with appropiate compilation flags. The output of iree-compile
vmfb files are placed in mlir_dialect/operation_path/operation_name.mlir
. The compiler.py
uses all the possible cpus on your machine to compile all different generated mlir source files.
python3 ../iree/experimental/dispatch_profiler/compile.py --build-dir </path/to/iree/build/dir> --generated-dir </path/to/create/`generated`/dir>
Compiles all the generated source mlir dispatches. One can check the generated dispatched folder to find the vmfb files.
$ ls ./generated/linalg/matmul/matmul_64x64x4096_f16t_f16t_f16t/ iree_compile_cmd_stdout.mlir matmul_64x64x4096_f16t_f16t_f16t.mlir matmul_64x64x4096_f16t_f16t_f16t_profile.vmfb matmul_64x64x4096_f16t_f16t_f16t_verify.vmfb
The tool provides profiler.py
script which can be used to trigger both verification and profiler for all the compiled dispatches. Please find some example profiling commandlines below:
$ python3 profiler.py --build-dir </path/to/iree/build/dir> --generated-dir </path/to/create/`generated`/dir> --dispatches=matmul_3456x1024x2048_f16t_f16t_f16t_tile_config_128x128_32x5_tensorcore_mmasync --verification-enabled=true --profiling-enabled=true ---------------------------------------------------------------- Dispatch : matmul_3456x1024x2048_f16t_f16t_f16t_tile_config_128x128_32x5_tensorcore_mmasync Provider : IREE Codegen OpKind : OperationKind.Matmul Operation : matmul_3456x1024x2048_f16t_f16t_f16t Configuration : tile_config_128x128_32x5_tensorcore_mmasync Arguments : --batch_count=1 --m=3456 --n=1024 --k=2048 --lhs=f16t --rhs=f16t --result=f16t --split_k_mode=N/A --split_k_slices=N/A Verification : SUCCESS Runtime(ms) : 0.062 GFLOPs : 233798.62
Verification, particularly for large matrix multiplications, can be time-consuming when using a CPU-based numpy reference. To prioritize profiling speed and when functional correctness is assured, disable verification using --verification-enabled=false
.
python3 profiler.py --build-dir </path/to/iree/build/dir> --generated-dir </path/to/create/`generated`/dir> --dispatches=matmul_3456x1024x2048_f16t_f16t_f16t_tile_config_128x128_32x5_tensorcore_mmasync --verification-enabled=false --profiling-enabled=true
The --dispatch
option accepts a comma-separated list of regex patterns to profile all tuning configurations generated for a operation. The command-line argument is formatted as --dispatch=<regex>,<regex>
. Additionally, you can export the profiled output to a CSV file for further analysis using --output=<filepath>
.
$ python3 profiler.py --build-dir </path/to/iree/build/dir> --generated-dir </path/to/create/`generated`/dir> --dispatches=matmul_3456x1024x2048_f16t_f16t_f16t_*_tensorcore_mmasync --verification-enabled=false --profiling-enabled=true --output=data.csv ---------------------------------------------------------------- Dispatch : matmul_3456x1024x2048_f16t_f16t_f16t_tile_config_128x256_32x3_tensorcore_mmasync Provider : IREE Codegen OpKind : OperationKind.Matmul Operation : matmul_3456x1024x2048_f16t_f16t_f16t Configuration : tile_config_128x256_32x3_tensorcore_mmasync Arguments : --batch_count=1 --m=3456 --n=1024 --k=2048 --lhs=f16t --rhs=f16t --result=f16t --split_k_mode=N/A --split_k_slices=N/A Verification : Not verified Runtime(ms) : 0.062 GFLOPs : 233798.62 ---------------------------------------------------------------- Dispatch : matmul_3456x1024x2048_f16t_f16t_f16t_tile_config_128x128_64x4_tensorcore_mmasync Provider : IREE Codegen OpKind : OperationKind.Matmul Operation : matmul_3456x1024x2048_f16t_f16t_f16t Configuration : tile_config_128x128_64x4_tensorcore_mmasync Arguments : --batch_count=1 --m=3456 --n=1024 --k=2048 --lhs=f16t --rhs=f16t --result=f16t --split_k_mode=N/A --split_k_slices=N/A Verification : Not verified Runtime(ms) : 0.064 GFLOPs : 226492.42 ---------------------------------------------------------------- ... ---------------------------------------------------------------- Dispatch : matmul_3456x1024x2048_f16t_f16t_f16t_tile_config_64x64_32x10_tensorcore_mmasync Provider : IREE Codegen OpKind : OperationKind.Matmul Operation : matmul_3456x1024x2048_f16t_f16t_f16t Configuration : tile_config_64x64_32x10_tensorcore_mmasync Arguments : --batch_count=1 --m=3456 --n=1024 --k=2048 --lhs=f16t --rhs=f16t --result=f16t --split_k_mode=N/A --split_k_slices=N/A Verification : Not verified Runtime(ms) : 0.103 GFLOPs : 140733.15 Writing performance report to data.csv
Another example showcasing the use of --dispatch
to profile a matmul_3456x1024x2048 targetting F16 and F32 NVIDIA A100 Tensor Cores.
$ python3 profiler.py --build-dir </path/to/iree/build/dir> --generated-dir </path/to/create/`generated`/dir> --dispatches=matmul_3456x1024x2048_f16t_f16t_f16t_tile_config_128x128_32x5_tensorcore_mmasync,matmul_3456x1024x2048_f32t_f32t_f32t_tile_config_128x128_16x5_tensorcore_mmasync ---------------------------------------------------------------- Dispatch : matmul_3456x1024x2048_f16t_f16t_f16t_tile_config_128x128_32x5_tensorcore_mmasync Provider : IREE Codegen OpKind : OperationKind.Matmul Operation : matmul_3456x1024x2048_f16t_f16t_f16t Configuration : tile_config_128x128_32x5_tensorcore_mmasync Arguments : --batch_count=1 --m=3456 --n=1024 --k=2048 --lhs=f16t --rhs=f16t --result=f16t --split_k_mode=N/A --split_k_slices=N/A Verification : SUCCESS Runtime(ms) : 0.062 GFLOPs : 233798.62 ---------------------------------------------------------------- Dispatch : matmul_3456x1024x2048_f32t_f32t_f32t_tile_config_128x128_16x5_tensorcore_mmasync Provider : IREE Codegen OpKind : OperationKind.Matmul Operation : matmul_3456x1024x2048_f32t_f32t_f32t Configuration : tile_config_128x128_16x5_tensorcore_mmasync Arguments : --batch_count=1 --m=3456 --n=1024 --k=2048 --lhs=f32t --rhs=f32t --result=f32t --split_k_mode=N/A --split_k_slices=N/A Verification : SUCCESS Runtime(ms) : 0.122 GFLOPs : 118815.69 ----------------------------------------------------------------