benchmarks/dashboard.md - 3p/openxla/iree - Git at Google

 # IREE Performance Dashboard

 This documentation explains IREE's performance dashboard (https://perf.iree.dev).
 A [Buildkite pipeline](https://buildkite.com/iree/iree-benchmark) runs on each
 commit to the `main` branch and posts those results to the dashboard.

 ## Benchmarking philosophy

 Benchmarking and interpreting results properly is a delicate task. We can record
 metrics from various parts of a system, but depending on what we are trying to
 evaluate, those numbers may or may not be relevant. For example, for somebody
 working solely on better kernel code generation, the end-to-end model reference
 latency is unlikely meaningful given it also includes runtime overhead. The
 environment could also vary per benchmark run in uncontrollable ways, causing
 instability in the results. This is especially true for mobile and embedded
 systems, where a tight compromise between performance and thermal/battery limits
 is made. Too many aspects can affect the benchmarking results. So before going
 into details, it's worth nothing the general guideline to IREE benchmarking as
 context.

 The overarching goal for benchmarking here is to track IREE's performance
 progress and guard against regression. So the benchmarks are meant to understand
 the performance of IREE _itself_, not the absolute capability of the exercised
 hardware. In order to fulfill the above goal, we have the following guidelines
 for benchmarking:

 * We choose representative real-world models with varying characteristics.
 * We cover different IREE backends and different modes for each backend so that
   folks working on different components can find the metrics they need.

 ## Model benchmark specification

 Each benchmark in IREE has a unique identifier with the following format:

 ```
 <model-name> `[` <model-tag>.. `]` `(` <model-source> `)` <benchmark-mode>..
 `with` <iree-driver>
 `@` <device-name> `(` <target-architecture> `)`
 ```

 The following subsections explain possible choices in each field.

 ### Model source

 This field specifies the original model source:

 ```
 ├── TensorFlow
 │     * models authored in TensorFlow and imported with `iree-import-tf`
 └── TFLite
       * models converted to TensorFlow Lite and imported with `iree-import-tflite`
 ```

 ### Model name

 This field specifies the input model:

 * `DeepLabV3` [[source](https://tfhub.dev/tensorflow/lite-model/deeplabv3/1/default/1)]:
   Vision model for semantic image segmentation.
   Characteristics: convolution, feedforward NN.
 * `MobileBERT` [[source](https://github.com/google-research/google-research/tree/master/mobilebert)]:
   NLP for Q&A.
   Characteristics: matmul, attention, feedforward NN.
 * `MobileNetV2` [[source](https://www.tensorflow.org/api_docs/python/tf/keras/applications/MobileNetV2)]:
   Vision model for image classification.
   Characteristics: convolution, feedforward NN
 * `MobileNetV3Small` [[source](https://www.tensorflow.org/api_docs/python/tf/keras/applications/MobileNetV3Small)]:
   Vision model for image classification.
   Characteristics: convolution, feedforward NN.
 * `MobileSSD` [[source](https://www.tensorflow.org/lite/performance/gpu#demo_app_tutorials)]:
   Vision model for object detection.
   Characteristics: convolution, feedforward NN.
 * `PoseNet` [[source](https://tfhub.dev/tensorflow/lite-model/posenet/mobilenet/float/075/1/default/1)]:
   Vision model for pose estimation.
   Characteristics: convolution, feedforward NN.

 ### Model tag

 This field specifies the model variant. It depends on the model, but here are
 some examples:

 * `f32`: the model is working on float types.
 * `imagenet`: the model takes ImageNet-sized inputs (224x224x3).

 ### IREE driver

 This field specifies the IREE HAL driver:

 * [`Dylib`](https://google.github.io/iree/deployment-configurations/cpu-dylib/):
   For CPU via dynamic library. Kernels contain CPU native instructions AOT
   compiled using LLVM. This driver issues workload to the CPU in async
   manner and supports multithreading.
 * [`Dylib-Sync`](https://google.github.io/iree/deployment-configurations/cpu-dylib/):
   For CPU via dynamic library. Kernels contain contain CPU native instructions
   AOT compiled using LLVM. This driver issues workload to the CPU in sync
   manner.
 * [`VMVX`](https://github.com/google/iree/issues/5123):
   For CPU via dynamic library. Kernels contain vector-level intrinsics that
   are backed by fast implementations ([WIP](https://github.com/google/iree/issues/5819)).
   This driver issues workload to the CPU in async manner and supports
   multithreading.
 * [`Vulkan`](https://google.github.io/iree/deployment-configurations/gpu-vulkan/):
   For GPU via Vulkan. Kernels contain SPIR-V. This driver issues workload to
   the GPU via the Vulkan API.

 ### Device name and target architecture

 These two fields are tightly coupled. They specify the device and hardware
 target for executing the benchmark.

 Right now there are two Android devices:

 * `Pixel-4`: Google Pixel 4 running Android 11. The SoC is
   [Snapdragon 855](https://www.qualcomm.com/products/snapdragon-855-plus-and-860-mobile-platform),
   with 1+3+4 ARMv8.2 CPU cores and Adreno 640 GPU.
 * `SM-G980F`: Samsung Galaxy S20 running Android 11. The SoC is
   [Exynos 990](https://www.samsung.com/semiconductor/minisite/exynos/products/mobileprocessor/exynos-990/),
   with 2+2+4 ARMv8.2 CPU cores and Mali G77 MP11 GPU.

 Therefore the target architectures are:

 * `CPU-CPU-ARMv8.2-A`: can benchmark all CPU-based IREE backends and drivers.
 * `GPU-Adreno-640`: can benchmark IREE Vulkan with Adreno target triples.
 * `GPU-Mali-G77`: can benchmark IREE Vulkan with Mali target triples.

 ### Benchmark mode

 This field is to further specify the benchmark variant, given the same input
 model and target architecture. It controls important aspects like:

 * `*-core`: specifies the core flavor for CPU.
 * `*-thread`: specifies the number of threads for CPU.
 * `full-inference`: measures the latency for one full inference. Note that this
   does not include the IREE system initialization time.
 * `kernel-execution`: measures only kernel execution latency for GPU. Note that
   this is only possible for feedforward NN models that can be put into one
   command buffer.

 `*-core` and `*-thread` together determines the `taskset` mask used for
 benchmarking IREE backends and drivers on CPU. For example,

 * `1-thread,big-core` would mean `taskset 80`.
 * `1-thread,little-core` would mean `taskset 08`.
 * `3-thread,big-core` would mean `taskset f0`.
 * `3-thread,little-core` would mean `taskset 0f`.
	# IREE Performance Dashboard

	This documentation explains IREE's performance dashboard (https://perf.iree.dev).
	A [Buildkite pipeline](https://buildkite.com/iree/iree-benchmark) runs on each
	commit to the `main` branch and posts those results to the dashboard.

	## Benchmarking philosophy

	Benchmarking and interpreting results properly is a delicate task. We can record
	metrics from various parts of a system, but depending on what we are trying to
	evaluate, those numbers may or may not be relevant. For example, for somebody
	working solely on better kernel code generation, the end-to-end model reference
	latency is unlikely meaningful given it also includes runtime overhead. The
	environment could also vary per benchmark run in uncontrollable ways, causing
	instability in the results. This is especially true for mobile and embedded
	systems, where a tight compromise between performance and thermal/battery limits
	is made. Too many aspects can affect the benchmarking results. So before going
	into details, it's worth nothing the general guideline to IREE benchmarking as
	context.

	The overarching goal for benchmarking here is to track IREE's performance
	progress and guard against regression. So the benchmarks are meant to understand
	the performance of IREE _itself_, not the absolute capability of the exercised
	hardware. In order to fulfill the above goal, we have the following guidelines
	for benchmarking:

	* We choose representative real-world models with varying characteristics.
	* We cover different IREE backends and different modes for each backend so that
	folks working on different components can find the metrics they need.

	## Model benchmark specification

	Each benchmark in IREE has a unique identifier with the following format:

	```
	<model-name> `[` <model-tag>.. `]` `(` <model-source> `)` <benchmark-mode>..
	`with` <iree-driver>
	`@` <device-name> `(` <target-architecture> `)`
	```

	The following subsections explain possible choices in each field.

	### Model source

	This field specifies the original model source:

	```
	├── TensorFlow
	│ * models authored in TensorFlow and imported with `iree-import-tf`
	└── TFLite
	* models converted to TensorFlow Lite and imported with `iree-import-tflite`
	```

	### Model name

	This field specifies the input model:

	* `DeepLabV3` [[source](https://tfhub.dev/tensorflow/lite-model/deeplabv3/1/default/1)]:
	Vision model for semantic image segmentation.
	Characteristics: convolution, feedforward NN.
	* `MobileBERT` [[source](https://github.com/google-research/google-research/tree/master/mobilebert)]:
	NLP for Q&A.
	Characteristics: matmul, attention, feedforward NN.
	* `MobileNetV2` [[source](https://www.tensorflow.org/api_docs/python/tf/keras/applications/MobileNetV2)]:
	Vision model for image classification.
	Characteristics: convolution, feedforward NN
	* `MobileNetV3Small` [[source](https://www.tensorflow.org/api_docs/python/tf/keras/applications/MobileNetV3Small)]:
	Vision model for image classification.
	Characteristics: convolution, feedforward NN.
	* `MobileSSD` [[source](https://www.tensorflow.org/lite/performance/gpu#demo_app_tutorials)]:
	Vision model for object detection.
	Characteristics: convolution, feedforward NN.
	* `PoseNet` [[source](https://tfhub.dev/tensorflow/lite-model/posenet/mobilenet/float/075/1/default/1)]:
	Vision model for pose estimation.
	Characteristics: convolution, feedforward NN.

	### Model tag

	This field specifies the model variant. It depends on the model, but here are
	some examples:

	* `f32`: the model is working on float types.
	* `imagenet`: the model takes ImageNet-sized inputs (224x224x3).

	### IREE driver

	This field specifies the IREE HAL driver:

	* [`Dylib`](https://google.github.io/iree/deployment-configurations/cpu-dylib/):
	For CPU via dynamic library. Kernels contain CPU native instructions AOT
	compiled using LLVM. This driver issues workload to the CPU in async
	manner and supports multithreading.
	* [`Dylib-Sync`](https://google.github.io/iree/deployment-configurations/cpu-dylib/):
	For CPU via dynamic library. Kernels contain contain CPU native instructions
	AOT compiled using LLVM. This driver issues workload to the CPU in sync
	manner.
	* [`VMVX`](https://github.com/google/iree/issues/5123):
	For CPU via dynamic library. Kernels contain vector-level intrinsics that
	are backed by fast implementations ([WIP](https://github.com/google/iree/issues/5819)).
	This driver issues workload to the CPU in async manner and supports
	multithreading.
	* [`Vulkan`](https://google.github.io/iree/deployment-configurations/gpu-vulkan/):
	For GPU via Vulkan. Kernels contain SPIR-V. This driver issues workload to
	the GPU via the Vulkan API.

	### Device name and target architecture

	These two fields are tightly coupled. They specify the device and hardware
	target for executing the benchmark.

	Right now there are two Android devices:

	* `Pixel-4`: Google Pixel 4 running Android 11. The SoC is
	[Snapdragon 855](https://www.qualcomm.com/products/snapdragon-855-plus-and-860-mobile-platform),
	with 1+3+4 ARMv8.2 CPU cores and Adreno 640 GPU.
	* `SM-G980F`: Samsung Galaxy S20 running Android 11. The SoC is
	[Exynos 990](https://www.samsung.com/semiconductor/minisite/exynos/products/mobileprocessor/exynos-990/),
	with 2+2+4 ARMv8.2 CPU cores and Mali G77 MP11 GPU.

	Therefore the target architectures are:

	* `CPU-CPU-ARMv8.2-A`: can benchmark all CPU-based IREE backends and drivers.
	* `GPU-Adreno-640`: can benchmark IREE Vulkan with Adreno target triples.
	* `GPU-Mali-G77`: can benchmark IREE Vulkan with Mali target triples.

	### Benchmark mode

	This field is to further specify the benchmark variant, given the same input
	model and target architecture. It controls important aspects like:

	* `*-core`: specifies the core flavor for CPU.
	* `*-thread`: specifies the number of threads for CPU.
	* `full-inference`: measures the latency for one full inference. Note that this
	does not include the IREE system initialization time.
	* `kernel-execution`: measures only kernel execution latency for GPU. Note that
	this is only possible for feedforward NN models that can be put into one
	command buffer.

	`-core` and `-thread` together determines the `taskset` mask used for
	benchmarking IREE backends and drivers on CPU. For example,

	* `1-thread,big-core` would mean `taskset 80`.
	* `1-thread,little-core` would mean `taskset 08`.
	* `3-thread,big-core` would mean `taskset f0`.
	* `3-thread,little-core` would mean `taskset 0f`.