docs/developing_iree/e2e_benchmarking.md - 3p/openxla/iree - Git at Google

 # Benchmark IREE and TFLite

 We use our end-to-end TensorFlow integration tests to test compilation and
 numerical accuracy, and to generate compilation and benchmarking artifacts.
 This allows us to validate that our benchmarks are behaving as we expect them
 to, and to run them using valid inputs for each model.

 This guide assumes that you can run the tensorflow integration tests. See
 [this doc](https://google.github.io/iree/developing-iree/tensorflow-integrations)
 for more information. That doc also covers writing new tests, which you'll need
 to do if you'd like to benchmark a new TensorFlow model.

 ## 1. Run IREE's E2E TensorFlow tests to generate the benchmarking artifacts

 The `get_e2e_artifacts.py` script compiles and tests all of our integrated
 TensorFlow models, and gathers their compilation and benchmarking artifacts in
 `/tmp/iree/modules/`.

 ```shell
 # By default `get_e2e_artifacts.py` will run all of our test suites, including
 # those that take a long time to complete, so we specify
 # `--test_suites=e2e_tests` to only run the smaller tests.
 python3 ./scripts/get_e2e_artifacts.py --test_suites=e2e_tests
 ```

 Each test/module has a folder with the following artifacts (filtered to only
 include those relevant for benchmarking):

 ```shell
 # Example for a generic module `ModuleName`:
 /tmp/iree/modules/ModuleName
   ├── iree_vmla  # Or any other IREE backend.
   │   ├── compiled.vmfb
   │   │   # A flatbuffer containing IREE's compiled code.
   │   └── traces
   │       # Directory with a trace for each unittest in vision_model_test.py.
   │       ├── traced_function_1
   │       │   # Directory storing logs and serialization for a specific trace.
   │       │   └── flagfile
   │       │       # An Abseil flagfile containing arguments
   │       │       # iree-benchmark-module needs to benchmark this trace.
   │       ├── traced_function_2
   │       └── ...
   └── tflite
       ├── module_method_1.tflite
       │   # A method on ModuleName compiled to bytes with TFLite, which can
       │   # be used by the TFLite's benchmark_model binary.
       ├── module_method_2.tflite
       ├── ...
       └── traces
           ├── traced_function_1
           │   └── graph_path
           │       # In general, a trace's name does not have to match the name
           │       # of the method(s) on the tf.Module that it calls. This file
           │       # points to the correct module_method_*.tflite graph file
           │       # for TFLite's benchmark_model to use.
           ├── traced_function_2
           └── ...

 # Example for MatrixOpsStaticModule:
 /tmp/iree/modules/MatrixOpsStaticModule
   ├── iree_llvmjit
   │   ├── compiled.vmfb
   │   └── traces
   │       ├── basic_matmul
   │       │   └── flagfile
   │       ├── matmul_broadcast_singleton_dimension
   │       │   └── flagfile
   │       ├── matmul_lhs_batch
   │       │   └── flagfile
   │       └── matmul_rhs_batch
   │           └── flagfile
   ├── iree_vmla
   │   ├── compiled.vmfb
   │   └── traces  # ...same as iree_llvmjit/traces above.
   ├── iree_vulkan
   │   ├── compiled.vmfb
   │   └── traces  # ...same as iree_llvmjit/traces above.
   └── tflite
       ├── basic_matmul.tflite
       ├── matmul_broadcast_singleton_dimension.tflite
       ├── matmul_lhs_batch.tflite
       ├── matmul_rhs_batch.tflite
       └── traces
           ├── basic_matmul
           │   └── graph_path
           ├── matmul_broadcast_singleton_dimension
           │   └── graph_path
           ├── matmul_lhs_batch
           │   └── graph_path
           └── matmul_rhs_batch
               └── graph_path
 ```

 ### Optional: Compile the Keras Applications Vision tests

 The vision tests take a while to run, so we exclude them from the CI and
 wildcard expansion. They can be run by invoking the following test suite:

 ```shell
 python3 ./scripts/get_e2e_artifacts.py --test_suites=vision_external_tests
 ```

 The previous command compiles `MobileNet`, `MobileNetV2` and `ResNet50` to run
 on `cifar10` and `imagenet` weights on all backends. The artifacts generated by
 this test suite are slightly different than those above in that they are
 organized by `/tmp/iree/modules/ModelName/Dataset/backends` instead of just by
 `/tmp/iree/modules/ModelName/backends`.

 ### Optional: Manually get the benchmarking artifacts for a specific test

 You can manually get the benchmarking artifacts for a specific test by using
 `bazel run` on the `_manual` binary target we create for each test. This will
 automatically store the benchmarking artifacts in `/tmp/iree/modules/`.

 ```shell
 bazel run //integrations/tensorflow/e2e:matrix_ops_static_test_manual -- \
   --target_backends=iree_vmla,tflite
 ```

 ## 2. Benchmarking IREE on desktop

 ### 2.1 Optional: Build the `iree-benchmark-module`

 This step is optional, but allows running the benchmarks without running `bazel`
 at the same time.

 ```shell
 bazel build -c opt //iree/tools:iree-benchmark-module
 ```

 This creates `bazel-bin/iree/tools/iree-benchmark-module`. The rest of the guide
 will use this binary, but you could also use
 `bazel run iree/tools:iree-benchmark-module` in its place if your prefer.

 ### 2.2 Benchmark the model on IREE

 The E2E tests generate a flagfile with all of the information that
 `iree-benchmark-module` needs to benchmark each trace. Namely it handles
 providing the following flags:

 | Flag              | Description                                      |
 |-------------------|--------------------------------------------------|
 | --module_file     | Absolute path to the IREE compiled VM flatbuffer |
 | --function_inputs | A comma delimited string of input tensors        |
 | --driver          | The backend driver to use for the benchmark      |
 | --entry_function  | The method on the TensorFlow module to benchmark |

 You can find the flagfile to benchmark a specific TensorFlow module on a
 specific IREE backend and trace at the following path:

 ```shell
 /tmp/iree/modules/ModuleName/backend/traces/trace_name/flagfile
 ```

 For example, if we wanted to benchmark a static left-hand-side batched matmul
 using `MatrixOpsStaticModule` on VMLA we would run the following command:

 ```shell
 ./bazel-bin/iree/tools/iree-benchmark-module \
   --flagfile="/tmp/iree/modules/MatrixOpsStaticModule/iree_vmla/traces/matmul_lhs_batch/flagfile"
 ```

 If you ran the Keras Applications vision test suite, then you'll be able to
 benchmark `ResNet50`, `MobileNet` or `MobileNetV2` with `cifar10` or `imagenet`
 weights. For example:

 ```shell
 ./bazel-bin/iree/tools/iree-benchmark-module \
   --flagfile="/tmp/iree/modules/ResNet50/cifar10/iree_vmla/traces/predict/flagfile"
 ```

 ## 3. Benchmarking TFLite on desktop

 ### 3.1 Build TFLite's `benchmark_model` binary

 ```shell
 # Enter the TensorFlow Bazel workspace.
 cd third_party/tensorflow/

 # Build the benchmark_model binary without RUY...
 bazel build --copt=-mavx2 -c opt \
   //tensorflow/lite/tools/benchmark:benchmark_model

 # ...or build the benchmark_model binary with RUY. This will overwrite the
 # previous binary unless you move it.
 bazel build --copt=-mavx2 -c opt \
   --define=tflite_with_ruy=true \
   //tensorflow/lite/tools/benchmark:benchmark_model

 # The binary can now be found in the following directory:
 ls bazel-bin/tensorflow/lite/tools/benchmark/
 ```

 ### 3.2 Benchmark the model on TFLite

 TFLite doesn't support flagfiles, so we need to manually pass the path to the
 graph file via `cat`. TFLite will generate fake inputs for the model.

 Using `MatrixOpsStaticModule`'s left-hand-side batched matmul again as an
 example we can run the benchmark as follows:

 ```shell
 # Run within `third_party/tensorflow/`.
 ./bazel-bin/tensorflow/lite/tools/benchmark/benchmark_model \
   --graph=$(cat "/tmp/iree/modules/MatrixOpsStaticModule/tflite/traces/matmul_lhs_batch/graph_path") \
   --warmup_runs=1 \
   --num_threads=1 \
   --num_runs=100 \
   --enable_op_profiling=true
 ```

 ## 4. Benchmarking IREE on Android

 ### 4.1 Prepare the benchmarking tools

 IREE only supports compiling to Android with CMake. Documentation on setting up
 your environment to cross-compile to Android can be found
 [here](https://google.github.io/iree/get-started/getting-started-android-cmake).

 ```shell
 # After following the instructions above up to 'Build all targets', the
 # iree-benchmark-module binary should be in the following directory:
 ls build-android/iree/tools/

 # Copy the benchmarking binary to phone.
 adb push build-android/iree/tools/iree-benchmark-module /data/local/tmp

 # Allow executing benchmarking file as a program.
 adb shell chmod +x /data/local/tmp/iree-benchmark-module
 ```

 ### 4.2 Push the IREE's compilation / benchmarking artifacts to the device

 In this example we'll only copy over the files we need to benchmark a single
 module on a single backend, but you can easily copy all of the modules over
 as well.

 Using `MatrixOpsStaticModule`'s left-hand-side batched matmul again as an
 example:

 ```shell
 # Make a directory for the module/backend pair we want to benchmark.
 mkdir -p /data/local/tmp/MatrixOpsStaticModule/iree_vmla/

 # Transfer the files.
 adb push /tmp/iree/modules/MatrixOpsStaticModule/iree_vmla/* \
   /data/local/tmp/MatrixOpsStaticModule/iree_vmla/
 ```

 ### 4.3 Benchmark the module

 ```shell
 adb shell /data/local/tmp/iree-benchmark-module \
   --flagfile="/data/local/tmp/MatrixOpsStaticModule/iree_vmla/traces/matmul_lhs_batch/flagfile"
   --module_file="/data/local/tmp/MatrixOpsStaticModule/iree_vmla/compiled.vmfb"
 ```

 Note: Because the flagfile uses absolute paths, the `--module_file` flag must be
 specified manually if the location of the compiled flatbuffer (`compiled.vmfb`)
 changes. The flagfile can still take care of specifying the input data, driver
 and entry function however.

 ## 5. Benchmarking TFLite on Android

 ### 5.1 Prepare the benchmarking tools

 There are three options for getting TFLite's `benchmark_model` binary for
 Android.

 The first two are to build it directly, either in a
 [`docker` container](https://www.tensorflow.org/lite/guide/build_android#set_up_build_environment_using_docker)
 or
 [in your own environment](https://www.tensorflow.org/lite/guide/build_android#set_up_build_environment_without_docker). Assuming you can build
 TensorFlow with Android, you can configure the TFLite `benchmark_model` binary
 in the following ways:

 ```shell
 # Build the benchmark_model binary without any add-ons.
 bazel build -c opt \
   --config=android_arm64 \
   --cxxopt='--std=c++17' \
   //tensorflow/lite/tools/benchmark:benchmark_model

 # Copy the benchmarking binary to phone and allow execution.
 adb push bazel-bin/tensorflow/lite/tools/benchmark/benchmark_model \
   /data/local/tmp
 adb shell chmod +x /data/local/tmp/benchmark_model
 ```

 ```shell
 # Build the benchmark_model binary with ruy.
 bazel build --copt=-mavx2 -c opt \
   --config=android_arm64 \
   --cxxopt='--std=c++17' \
   --define=tflite_with_ruy=true \
   --copt=-DTFLITE_WITH_RUY_GEMV \
   //tensorflow/lite/tools/benchmark:benchmark_model

 # Rename the binary for comparison with the standard benchmark_model.
 mv bazel-bin/tensorflow/lite/tools/benchmark/benchmark_model \
   bazel-bin/tensorflow/lite/tools/benchmark/benchmark_model_plus_ruy
 adb push bazel-bin/tensorflow/lite/tools/benchmark/benchmark_model_plus_ruy \
   /data/local/tmp/
 adb shell chmod +x /data/local/tmp/benchmark_model_plus_ruy
 ```

 ```shell
 # Build the benchmark_model binary with flex.
 bazel build -c opt \
   --config=android_arm64 \
   --cxxopt='--std=c++17' \
   //tensorflow/lite/tools/benchmark:benchmark_model_plus_flex

 # Copy the benchmarking binary to phone and allow execution.
 adb push bazel-bin/tensorflow/lite/tools/benchmark/benchmark_model_plus_flex \
   /data/local/tmp
 adb shell chmod +x /data/local/tmp/benchmark_model_plus_flex
 ```

 Alternatively, you can download and install the
 [Android Benchmark App](https://www.tensorflow.org/lite/performance/measurement#android_benchmark_app). If you choose to install the app then
 you'll have to modify the benchmarking commands below slightly, as shown in
 [this example](https://www.tensorflow.org/lite/performance/measurement#run_benchmark).

 ### 5.2 Run the benchmark

 ```shell
 # Copy the data over to the phone.
 mkdir -p /data/local/tmp/MatrixOpsStaticModule/tflite
 adb push /tmp/iree/modules/MatrixOpsStaticModule/tflite/* \
   /data/local/tmp/MatrixOpsStaticModule/tflite/
 ```

 ```shell
 # Benchmark with TFLite.
 adb shell taskset f0 /data/local/tmp/benchmark_model \
   --graph=/data/local/tmp/MatrixOpsStaticModule/tflite/matmul_lhs_batch.tflite \
   --warmup_runs=1 \
   --num_threads=1 \
   --num_runs=10 \
   --enable_op_profiling=true
 ```

 ```shell
 # Benchmark with TFLite + RUY.
 adb shell taskset f0 /data/local/tmp/benchmark_model_plus_ruy \
   --graph=/data/local/tmp/MatrixOpsStaticModule/tflite/matmul_lhs_batch.tflite \
   --warmup_runs=1 \
   --num_threads=1 \
   --num_runs=10 \
   --enable_op_profiling=true
 ```

 ```shell
 # Benchmark with TFLite + Flex.
 adb shell taskset f0 /data/local/tmp/benchmark_model_plus_flex \
   --graph=/data/local/tmp/MatrixOpsStaticModule/tflite/matmul_lhs_batch.tflite \
   --warmup_runs=1 \
   --num_threads=1 \
   --num_runs=10 \
   --enable_op_profiling=true
 ```

 ```shell
 # Benchmark with TFLite running on GPU.
 adb shell taskset f0 /data/local/tmp/benchmark_model \
   --graph=/data/local/tmp/MatrixOpsStaticModule/tflite/matmul_lhs_batch.tflite \
   --warmup_runs=1 \
   --num_threads=1 \
   --num_runs=10 \
   --enable_op_profiling=true \
   --use_gpu=true
 ```

 Running benchmark on GPU won't give op profiling. To detailed profiling
 information for GPU you can run the following script:

 ```shell
 # Op profiling on GPU using OpenCL backend.
 sh tensorflow/lite/delegates/gpu/cl/testing/run_performance_profiling.sh \
   -m /data/local/tmp/MatrixOpsStaticModule/tflite/matmul_lhs_batch.tflite
 ```

 Note: You will have to manually specify the TFLite graph that you want to
 benchmark, as the `graph_path` file assumes that the graph has not moved. The
 name of the `.tflite` graph that you need to benchmark _may_ be different from
 the name of the trace that you want to benchmark, but you can use `cat` on
 the `graph_path` file to verify the correct `.tflite` filename if you're unsure.
	# Benchmark IREE and TFLite

	We use our end-to-end TensorFlow integration tests to test compilation and
	numerical accuracy, and to generate compilation and benchmarking artifacts.
	This allows us to validate that our benchmarks are behaving as we expect them
	to, and to run them using valid inputs for each model.

	This guide assumes that you can run the tensorflow integration tests. See
	[this doc](https://google.github.io/iree/developing-iree/tensorflow-integrations)
	for more information. That doc also covers writing new tests, which you'll need
	to do if you'd like to benchmark a new TensorFlow model.

	## 1. Run IREE's E2E TensorFlow tests to generate the benchmarking artifacts

	The `get_e2e_artifacts.py` script compiles and tests all of our integrated
	TensorFlow models, and gathers their compilation and benchmarking artifacts in
	`/tmp/iree/modules/`.

	```shell
	# By default `get_e2e_artifacts.py` will run all of our test suites, including
	# those that take a long time to complete, so we specify
	# `--test_suites=e2e_tests` to only run the smaller tests.
	python3 ./scripts/get_e2e_artifacts.py --test_suites=e2e_tests
	```

	Each test/module has a folder with the following artifacts (filtered to only
	include those relevant for benchmarking):

	```shell
	# Example for a generic module `ModuleName`:
	/tmp/iree/modules/ModuleName
	├── iree_vmla # Or any other IREE backend.
	│ ├── compiled.vmfb
	│ │ # A flatbuffer containing IREE's compiled code.
	│ └── traces
	│ # Directory with a trace for each unittest in vision_model_test.py.
	│ ├── traced_function_1
	│ │ # Directory storing logs and serialization for a specific trace.
	│ │ └── flagfile
	│ │ # An Abseil flagfile containing arguments
	│ │ # iree-benchmark-module needs to benchmark this trace.
	│ ├── traced_function_2
	│ └── ...
	└── tflite
	├── module_method_1.tflite
	│ # A method on ModuleName compiled to bytes with TFLite, which can
	│ # be used by the TFLite's benchmark_model binary.
	├── module_method_2.tflite
	├── ...
	└── traces
	├── traced_function_1
	│ └── graph_path
	│ # In general, a trace's name does not have to match the name
	│ # of the method(s) on the tf.Module that it calls. This file
	│ # points to the correct module_method_*.tflite graph file
	│ # for TFLite's benchmark_model to use.
	├── traced_function_2
	└── ...

	# Example for MatrixOpsStaticModule:
	/tmp/iree/modules/MatrixOpsStaticModule
	├── iree_llvmjit
	│ ├── compiled.vmfb
	│ └── traces
	│ ├── basic_matmul
	│ │ └── flagfile
	│ ├── matmul_broadcast_singleton_dimension
	│ │ └── flagfile
	│ ├── matmul_lhs_batch
	│ │ └── flagfile
	│ └── matmul_rhs_batch
	│ └── flagfile
	├── iree_vmla
	│ ├── compiled.vmfb
	│ └── traces # ...same as iree_llvmjit/traces above.
	├── iree_vulkan
	│ ├── compiled.vmfb
	│ └── traces # ...same as iree_llvmjit/traces above.
	└── tflite
	├── basic_matmul.tflite
	├── matmul_broadcast_singleton_dimension.tflite
	├── matmul_lhs_batch.tflite
	├── matmul_rhs_batch.tflite
	└── traces
	├── basic_matmul
	│ └── graph_path
	├── matmul_broadcast_singleton_dimension
	│ └── graph_path
	├── matmul_lhs_batch
	│ └── graph_path
	└── matmul_rhs_batch
	└── graph_path
	```

	### Optional: Compile the Keras Applications Vision tests

	The vision tests take a while to run, so we exclude them from the CI and
	wildcard expansion. They can be run by invoking the following test suite:

	```shell
	python3 ./scripts/get_e2e_artifacts.py --test_suites=vision_external_tests
	```

	The previous command compiles `MobileNet`, `MobileNetV2` and `ResNet50` to run
	on `cifar10` and `imagenet` weights on all backends. The artifacts generated by
	this test suite are slightly different than those above in that they are
	organized by `/tmp/iree/modules/ModelName/Dataset/backends` instead of just by
	`/tmp/iree/modules/ModelName/backends`.

	### Optional: Manually get the benchmarking artifacts for a specific test

	You can manually get the benchmarking artifacts for a specific test by using
	`bazel run` on the `_manual` binary target we create for each test. This will
	automatically store the benchmarking artifacts in `/tmp/iree/modules/`.

	```shell
	bazel run //integrations/tensorflow/e2e:matrix_ops_static_test_manual -- \
	--target_backends=iree_vmla,tflite
	```

	## 2. Benchmarking IREE on desktop

	### 2.1 Optional: Build the `iree-benchmark-module`

	This step is optional, but allows running the benchmarks without running `bazel`
	at the same time.

	```shell
	bazel build -c opt //iree/tools:iree-benchmark-module
	```

	This creates `bazel-bin/iree/tools/iree-benchmark-module`. The rest of the guide
	will use this binary, but you could also use
	`bazel run iree/tools:iree-benchmark-module` in its place if your prefer.

	### 2.2 Benchmark the model on IREE

	The E2E tests generate a flagfile with all of the information that
	`iree-benchmark-module` needs to benchmark each trace. Namely it handles
	providing the following flags:

	\| Flag \| Description \|
	\|-------------------\|--------------------------------------------------\|
	\| --module_file \| Absolute path to the IREE compiled VM flatbuffer \|
	\| --function_inputs \| A comma delimited string of input tensors \|
	\| --driver \| The backend driver to use for the benchmark \|
	\| --entry_function \| The method on the TensorFlow module to benchmark \|

	You can find the flagfile to benchmark a specific TensorFlow module on a
	specific IREE backend and trace at the following path:

	```shell
	/tmp/iree/modules/ModuleName/backend/traces/trace_name/flagfile
	```

	For example, if we wanted to benchmark a static left-hand-side batched matmul
	using `MatrixOpsStaticModule` on VMLA we would run the following command:

	```shell
	./bazel-bin/iree/tools/iree-benchmark-module \
	--flagfile="/tmp/iree/modules/MatrixOpsStaticModule/iree_vmla/traces/matmul_lhs_batch/flagfile"
	```

	If you ran the Keras Applications vision test suite, then you'll be able to
	benchmark `ResNet50`, `MobileNet` or `MobileNetV2` with `cifar10` or `imagenet`
	weights. For example:

	```shell
	./bazel-bin/iree/tools/iree-benchmark-module \
	--flagfile="/tmp/iree/modules/ResNet50/cifar10/iree_vmla/traces/predict/flagfile"
	```

	## 3. Benchmarking TFLite on desktop

	### 3.1 Build TFLite's `benchmark_model` binary

	```shell
	# Enter the TensorFlow Bazel workspace.
	cd third_party/tensorflow/

	# Build the benchmark_model binary without RUY...
	bazel build --copt=-mavx2 -c opt \
	//tensorflow/lite/tools/benchmark:benchmark_model

	# ...or build the benchmark_model binary with RUY. This will overwrite the
	# previous binary unless you move it.
	bazel build --copt=-mavx2 -c opt \
	--define=tflite_with_ruy=true \
	//tensorflow/lite/tools/benchmark:benchmark_model

	# The binary can now be found in the following directory:
	ls bazel-bin/tensorflow/lite/tools/benchmark/
	```

	### 3.2 Benchmark the model on TFLite

	TFLite doesn't support flagfiles, so we need to manually pass the path to the
	graph file via `cat`. TFLite will generate fake inputs for the model.

	Using `MatrixOpsStaticModule`'s left-hand-side batched matmul again as an
	example we can run the benchmark as follows:

	```shell
	# Run within `third_party/tensorflow/`.
	./bazel-bin/tensorflow/lite/tools/benchmark/benchmark_model \
	--graph=$(cat "/tmp/iree/modules/MatrixOpsStaticModule/tflite/traces/matmul_lhs_batch/graph_path") \
	--warmup_runs=1 \
	--num_threads=1 \
	--num_runs=100 \
	--enable_op_profiling=true
	```

	## 4. Benchmarking IREE on Android

	### 4.1 Prepare the benchmarking tools

	IREE only supports compiling to Android with CMake. Documentation on setting up
	your environment to cross-compile to Android can be found
	[here](https://google.github.io/iree/get-started/getting-started-android-cmake).

	```shell
	# After following the instructions above up to 'Build all targets', the
	# iree-benchmark-module binary should be in the following directory:
	ls build-android/iree/tools/

	# Copy the benchmarking binary to phone.
	adb push build-android/iree/tools/iree-benchmark-module /data/local/tmp

	# Allow executing benchmarking file as a program.
	adb shell chmod +x /data/local/tmp/iree-benchmark-module
	```

	### 4.2 Push the IREE's compilation / benchmarking artifacts to the device

	In this example we'll only copy over the files we need to benchmark a single
	module on a single backend, but you can easily copy all of the modules over
	as well.

	Using `MatrixOpsStaticModule`'s left-hand-side batched matmul again as an
	example:

	```shell
	# Make a directory for the module/backend pair we want to benchmark.
	mkdir -p /data/local/tmp/MatrixOpsStaticModule/iree_vmla/

	# Transfer the files.
	adb push /tmp/iree/modules/MatrixOpsStaticModule/iree_vmla/* \
	/data/local/tmp/MatrixOpsStaticModule/iree_vmla/
	```

	### 4.3 Benchmark the module

	```shell
	adb shell /data/local/tmp/iree-benchmark-module \
	--flagfile="/data/local/tmp/MatrixOpsStaticModule/iree_vmla/traces/matmul_lhs_batch/flagfile"
	--module_file="/data/local/tmp/MatrixOpsStaticModule/iree_vmla/compiled.vmfb"
	```

	Note: Because the flagfile uses absolute paths, the `--module_file` flag must be
	specified manually if the location of the compiled flatbuffer (`compiled.vmfb`)
	changes. The flagfile can still take care of specifying the input data, driver
	and entry function however.

	## 5. Benchmarking TFLite on Android

	### 5.1 Prepare the benchmarking tools

	There are three options for getting TFLite's `benchmark_model` binary for
	Android.

	The first two are to build it directly, either in a
	[`docker` container](https://www.tensorflow.org/lite/guide/build_android#set_up_build_environment_using_docker)
	or
	[in your own environment](https://www.tensorflow.org/lite/guide/build_android#set_up_build_environment_without_docker). Assuming you can build
	TensorFlow with Android, you can configure the TFLite `benchmark_model` binary
	in the following ways:

	```shell
	# Build the benchmark_model binary without any add-ons.
	bazel build -c opt \
	--config=android_arm64 \
	--cxxopt='--std=c++17' \
	//tensorflow/lite/tools/benchmark:benchmark_model

	# Copy the benchmarking binary to phone and allow execution.
	adb push bazel-bin/tensorflow/lite/tools/benchmark/benchmark_model \
	/data/local/tmp
	adb shell chmod +x /data/local/tmp/benchmark_model
	```

	```shell
	# Build the benchmark_model binary with ruy.
	bazel build --copt=-mavx2 -c opt \
	--config=android_arm64 \
	--cxxopt='--std=c++17' \
	--define=tflite_with_ruy=true \
	--copt=-DTFLITE_WITH_RUY_GEMV \
	//tensorflow/lite/tools/benchmark:benchmark_model

	# Rename the binary for comparison with the standard benchmark_model.
	mv bazel-bin/tensorflow/lite/tools/benchmark/benchmark_model \
	bazel-bin/tensorflow/lite/tools/benchmark/benchmark_model_plus_ruy
	adb push bazel-bin/tensorflow/lite/tools/benchmark/benchmark_model_plus_ruy \
	/data/local/tmp/
	adb shell chmod +x /data/local/tmp/benchmark_model_plus_ruy
	```

	```shell
	# Build the benchmark_model binary with flex.
	bazel build -c opt \
	--config=android_arm64 \
	--cxxopt='--std=c++17' \
	//tensorflow/lite/tools/benchmark:benchmark_model_plus_flex

	# Copy the benchmarking binary to phone and allow execution.
	adb push bazel-bin/tensorflow/lite/tools/benchmark/benchmark_model_plus_flex \
	/data/local/tmp
	adb shell chmod +x /data/local/tmp/benchmark_model_plus_flex
	```

	Alternatively, you can download and install the
	[Android Benchmark App](https://www.tensorflow.org/lite/performance/measurement#android_benchmark_app). If you choose to install the app then
	you'll have to modify the benchmarking commands below slightly, as shown in
	[this example](https://www.tensorflow.org/lite/performance/measurement#run_benchmark).

	### 5.2 Run the benchmark

	```shell
	# Copy the data over to the phone.
	mkdir -p /data/local/tmp/MatrixOpsStaticModule/tflite
	adb push /tmp/iree/modules/MatrixOpsStaticModule/tflite/* \
	/data/local/tmp/MatrixOpsStaticModule/tflite/
	```

	```shell
	# Benchmark with TFLite.
	adb shell taskset f0 /data/local/tmp/benchmark_model \
	--graph=/data/local/tmp/MatrixOpsStaticModule/tflite/matmul_lhs_batch.tflite \
	--warmup_runs=1 \
	--num_threads=1 \
	--num_runs=10 \
	--enable_op_profiling=true
	```

	```shell
	# Benchmark with TFLite + RUY.
	adb shell taskset f0 /data/local/tmp/benchmark_model_plus_ruy \
	--graph=/data/local/tmp/MatrixOpsStaticModule/tflite/matmul_lhs_batch.tflite \
	--warmup_runs=1 \
	--num_threads=1 \
	--num_runs=10 \
	--enable_op_profiling=true
	```

	```shell
	# Benchmark with TFLite + Flex.
	adb shell taskset f0 /data/local/tmp/benchmark_model_plus_flex \
	--graph=/data/local/tmp/MatrixOpsStaticModule/tflite/matmul_lhs_batch.tflite \
	--warmup_runs=1 \
	--num_threads=1 \
	--num_runs=10 \
	--enable_op_profiling=true
	```

	```shell
	# Benchmark with TFLite running on GPU.
	adb shell taskset f0 /data/local/tmp/benchmark_model \
	--graph=/data/local/tmp/MatrixOpsStaticModule/tflite/matmul_lhs_batch.tflite \
	--warmup_runs=1 \
	--num_threads=1 \
	--num_runs=10 \
	--enable_op_profiling=true \
	--use_gpu=true
	```

	Running benchmark on GPU won't give op profiling. To detailed profiling
	information for GPU you can run the following script:

	```shell
	# Op profiling on GPU using OpenCL backend.
	sh tensorflow/lite/delegates/gpu/cl/testing/run_performance_profiling.sh \
	-m /data/local/tmp/MatrixOpsStaticModule/tflite/matmul_lhs_batch.tflite
	```

	Note: You will have to manually specify the TFLite graph that you want to
	benchmark, as the `graph_path` file assumes that the graph has not moved. The
	name of the `.tflite` graph that you need to benchmark _may_ be different from
	the name of the trace that you want to benchmark, but you can use `cat` on
	the `graph_path` file to verify the correct `.tflite` filename if you're unsure.