commit | cb31cbdaa666dc15d822aaec25846599d260a000 | [log] [tgz] |
---|---|---|
author | Manish Gupta <manigupta@google.com> | Tue May 09 15:21:13 2023 -0700 |
committer | GitHub <noreply@github.com> | Tue May 09 15:21:13 2023 -0700 |
tree | c1dac24acd965f9a1ad0b7007ab8786ef17d7bfc | |
parent | ebf8e51ee9bd0165e2e63cfc8cbb28edde44077f [diff] |
Adding Batch Matmul and Matmul with Split-K to IREE Dispatch Profiler (#13396) This PR upstreams addition of batch_matmul and matmul with split-k variation to IREE dispatch profiler. This partially address issues #13281 and #13290 by adding means to functionally verify and performance profile these operations. ## Current Usage of IREE Dispatch Profiler Note that the usage is mostly going to remain as per the notes below, but expect to see minor changes as we move on this further. ## Generation and compiling the generated mlir to vmfb ```bash # Generate default shapes for matmul, batch_matmul, and split-k build-debug $ python3 ../iree/experimental/dispatch_profiler/generator.py # Compile generated build-debug $ python3 ../iree/experimental/dispatch_profiler/compile.py ``` ## Functional verification and performance profiling of the generated dispatches ``` # Run a single matmul dispatch `matmul_1024x512x2048_f16t_f16t_f16t_tile_config_128x64_32x5_tensorcore_mmasync` build-debug $ python3 ../iree/experimental/dispatch_profiler/profiler.py --dispatches=matmul_1024x512x2048_f16t_f16t_f16t_tile_config_128x64_32x5_tensorcore_mmasync ---------------------------------------------------------------- Dispatch : matmul_1024x512x2048_f16t_f16t_f16t_tile_config_128x64_32x5_tensorcore_mmasync Provider : IREE Codegen OpKind : OperationKind.Matmul Operation : matmul_1024x512x2048_f16t_f16t_f16t Configuration : tile_config_128x64_32x5_tensorcore_mmasync Arguments : --batch_count=1 --m=1024 --n=512 --k=2048 --lhs=f16t --rhs=f16t --result=f16t --split_k_mode=N/A --split_k_slices=N/A Verification : SUCCESS Runtime(ms) : 0.023 GFLOPs : 93368.85 # Run a matmul operation for tuning configurations and dump the results to csv file build-debug $ python3 ../iree/experimental/dispatch_profiler/profiler.py --dispatches=matmul_1024x512x2048_f16t_f16t_f16t_tile_config_*_tensorcore_mmasync --output=data.csv ---------------------------------------------------------------- Dispatch : matmul_1024x512x2048_f16t_f16t_f16t_tile_config_256x128_32x3_tensorcore_mmasync Provider : IREE Codegen OpKind : OperationKind.Matmul Operation : matmul_1024x512x2048_f16t_f16t_f16t Configuration : tile_config_256x128_32x3_tensorcore_mmasync Arguments : --batch_count=1 --m=1024 --n=512 --k=2048 --lhs=f16t --rhs=f16t --result=f16t --split_k_mode=N/A --split_k_slices=N/A Verification : SUCCESS Runtime(ms) : 0.058 GFLOPs : 37025.58 ---------------------------------------------------------------- ... Dispatch : matmul_1024x512x2048_f16t_f16t_f16t_tile_config_128x128_64x4_tensorcore_mmasync Provider : IREE Codegen OpKind : OperationKind.Matmul Operation : matmul_1024x512x2048_f16t_f16t_f16t Configuration : tile_config_128x128_64x4_tensorcore_mmasync Arguments : --batch_count=1 --m=1024 --n=512 --k=2048 --lhs=f16t --rhs=f16t --result=f16t --split_k_mode=N/A --split_k_slices=N/A Verification : SUCCESS Runtime(ms) : 0.031 GFLOPs : 69273.67 ---------------------------------------------------------------- Dispatch : matmul_1024x512x2048_f16t_f16t_f16t_tile_config_128x128_32x5_tensorcore_mmasync Provider : IREE Codegen OpKind : OperationKind.Matmul Operation : matmul_1024x512x2048_f16t_f16t_f16t Configuration : tile_config_128x128_32x5_tensorcore_mmasync Arguments : --batch_count=1 --m=1024 --n=512 --k=2048 --lhs=f16t --rhs=f16t --result=f16t --split_k_mode=N/A --split_k_slices=N/A Verification : SUCCESS Runtime(ms) : 0.031 GFLOPs : 69273.67 ---------------------------------------------------------------- Dispatch : matmul_1024x512x2048_f16t_f16t_f16t_tile_config_128x64_32x5_tensorcore_mmasync Provider : IREE Codegen OpKind : OperationKind.Matmul Operation : matmul_1024x512x2048_f16t_f16t_f16t Configuration : tile_config_128x64_32x5_tensorcore_mmasync Arguments : --batch_count=1 --m=1024 --n=512 --k=2048 --lhs=f16t --rhs=f16t --result=f16t --split_k_mode=N/A --split_k_slices=N/A Verification : SUCCESS Runtime(ms) : 0.023 GFLOPs : 93368.85 ---------------------------------------------------------------- Writing performance report to data.csv # Run batch_matmul $ python3 ../iree/experimental/dispatch_profiler/profiler.py --dispatches=batch_matmul* --output=data.csv ---------------------------------------------------------------- Dispatch : batch_matmul_16x512x64x512_f16t_f16t_f16t_tile_config_128x64_32x5_tensorcore_mmasync Provider : IREE Codegen OpKind : OperationKind.BatchMatmul Operation : batch_matmul_16x512x64x512_f16t_f16t_f16t Configuration : tile_config_128x64_32x5_tensorcore_mmasync Arguments : --batch_count=16 --m=512 --n=64 --k=512 --lhs=f16t --rhs=f16t --result=f16t --split_k_mode=N/A --split_k_slices=N/A Verification : SUCCESS Runtime(ms) : 0.009 GFLOPs : 59652.32 ---------------------------------------------------------------- ``` ## Reference cache Note that the reference is run once and is cached. For a large matmul reference could take ~30sec. While cached reference runs within 4sec. ```bash # Running numpy reference $ time python3 ../iree/experimental/dispatch_profiler/profiler.py --dispatches=matmul_3456x1024x2048_f16t_f16t_f16t_tile_config_128x128_32x5_tensorcore_mmasync --output=data.csv ---------------------------------------------------------------- Dispatch : matmul_3456x1024x2048_f16t_f16t_f16t_tile_config_128x128_32x5_tensorcore_mmasync Provider : IREE Codegen OpKind : OperationKind.Matmul Operation : matmul_3456x1024x2048_f16t_f16t_f16t Configuration : tile_config_128x128_32x5_tensorcore_mmasync Arguments : --batch_count=1 --m=3456 --n=1024 --k=2048 --lhs=f16t --rhs=f16t --result=f16t --split_k_mode=N/A --split_k_slices=N/A Verification : SUCCESS Runtime(ms) : 0.062 GFLOPs : 233798.62 Writing performance report to data.csv real 0m30.083s user 0m29.486s sys 0m1.634s # Skipping the reference run as the reference npy files are in `reference_cache` $ time python3 ../iree/experimental/dispatch_profiler/profiler.py --dispatches=matmul_3456x1024x2048_f16t_f16t_f16t_tile_config_128x128_32x5_tensorcore_mmasync --output=data.csv ---------------------------------------------------------------- Dispatch : matmul_3456x1024x2048_f16t_f16t_f16t_tile_config_128x128_32x5_tensorcore_mmasync Provider : IREE Codegen OpKind : OperationKind.Matmul Operation : matmul_3456x1024x2048_f16t_f16t_f16t Configuration : tile_config_128x128_32x5_tensorcore_mmasync Arguments : --batch_count=1 --m=3456 --n=1024 --k=2048 --lhs=f16t --rhs=f16t --result=f16t --split_k_mode=N/A --split_k_slices=N/A Verification : SUCCESS Runtime(ms) : 0.062 GFLOPs : 233798.62 Writing performance report to data.csv real 0m4.779s user 0m4.263s sys 0m1.617s ``` ## Rough edges and notes for the future progress - This PR also has code updates to pass matmul problem shapes as command-line arguments to the generator. This enables the user to pass comma separated list and/or python-style range of values. Try the command below and see what all gets generated. ```bash $ python3 ../iree/experimental/dispatch_profiler/generator.py --problem-m=768,512:1024:128 --problem-n=768 --problem-k=32:512:64 ``` One can check the `generated` folder as follows: ``` build-debug $ ls -R generated/ generated/: linalg reference_cache generated/linalg: batch_matmul matmul matmul_splitk ... ``` - Passing operation-specific (matmul as our first example) command-line arguments is not in a place I have imaged it to be. I am playing with various ways to use argparse in reduced version [here](https://github.com/manishucsd/python_explorations/tree/main/argparse). Would like to see. Suggestions? ``` # Runs all the operations that are previously generated $ profiler.py # general help $ profiler.py --help # operation-specific help $ profiler.py --op-kind matmul --help # Generate matmul $ gernerator.py --op-kind=matmul --m=768 --n=512 --k=1024 # operation-specific run on a shape that is already generated $ profiler.py --op-kind matmul --m=768 --n=512 --k=1024 ```
IREE (Intermediate Representation Execution Environment, pronounced as “eerie”) is an MLIR-based end-to-end compiler and runtime that lowers Machine Learning (ML) models to a unified IR that scales up to meet the needs of the datacenter and down to satisfy the constraints and special considerations of mobile and edge deployments.
See our website for project details, user guides, and instructions on building from source.
IREE is still in its early phase. We have settled down on the overarching infrastructure and are actively improving various software components as well as project logistics. It is still quite far from ready for everyday use and is made available without any support at the moment. With that said, we welcome any kind of feedback on any communication channels!
See our website for more information.
IREE is licensed under the terms of the Apache 2.0 License with LLVM Exceptions. See LICENSE for more information.