This documentation explains IREE's performance dashboard (https://perf.iree.dev). A Buildkite pipeline runs on each commit to the main
branch and posts those results to the dashboard.
Benchmarking and interpreting results properly is a delicate task. We can record metrics from various parts of a system, but depending on what we are trying to evaluate, those numbers may or may not be relevant. For example, for somebody working solely on better kernel code generation, the end-to-end model reference latency is unlikely meaningful given it also includes runtime overhead. The environment could also vary per benchmark run in uncontrollable ways, causing instability in the results. This is especially true for mobile and embedded systems, where a tight compromise between performance and thermal/battery limits is made. Too many aspects can affect the benchmarking results. So before going into details, it's worth nothing the general guideline to IREE benchmarking as context.
The overarching goal for benchmarking here is to track IREE's performance progress and guard against regression. So the benchmarks are meant to understand the performance of IREE itself, not the absolute capability of the exercised hardware. In order to fulfill the above goal, we have the following guidelines for benchmarking:
Each benchmark in IREE has a unique identifier with the following format:
<model-name> `[` <model-tag>.. `]` `(` <model-source> `)` <benchmark-mode>.. `with` <iree-driver> `@` <device-name> `(` <target-architecture> `)`
The following subsections explain possible choices in each field.
This field specifies the original model source:
TFLite
: Models originally in TensorFlow Lite Flatbuffer format.This field specifies the input model:
DeepLabV3
[source]: Vision model for semantic image segmentation. Characteristics: convolution, feedforward NN.MobileBERT
[source]: NLP for Q&A. Characteristics: matmul, attention, feedforward NN.MobileNetV2
[source]: Vision model for image classification. Characteristics: convolution, feedforward NNMobileNetV3Small
[source]: Vision model for image classification. Characteristics: convolution, feedforward NN.MobileSSD
[source]: Vision model for object detection. Characteristics: convolution, feedforward NN.PoseNet
[source]: Vision model for pose estimation. Characteristics: convolution, feedforward NN.This field specifies the model variant. It depends on the model, but here are some examples:
f32
: the model is working on float types.imagenet
: the model takes ImageNet-sized inputs (224x224x3).This field specifies the IREE HAL driver:
local-task
: For CPU via the local task system. Kernels contain CPU native instructions AOT compiled using LLVM. This driver issues workloads to the CPU asynchronously and supports multithreading.local-sync
: For CPU via the local ‘sync’ device. Kernels contain contain CPU native instructions AOT compiled using LLVM. This driver issues workloads to the CPU synchronously.Vulkan
: For GPU via Vulkan. Kernels contain SPIR-V. This driver issues workload to the GPU via the Vulkan API.These two fields are tightly coupled. They specify the device and hardware target for executing the benchmark.
Right now there are two Android devices:
Pixel-4
: Google Pixel 4 running Android 11. The SoC is Snapdragon 855, with 1+3+4 ARMv8.2 CPU cores and Adreno 640 GPU.Pixel-6
: Google Pixel 6 running Android 12. The SoC is Google Tensor, with 2+2+4 ARMv8 CPU cores and Mali G78 GPU.SM-G980F
: Samsung Galaxy S20 running Android 11. The SoC is Exynos 990, with 2+2+4 ARMv8.2 CPU cores and Mali G77 MP11 GPU.Therefore the target architectures are:
CPU-CPU-ARMv8.2-A
: can benchmark all CPU-based IREE backends and drivers.GPU-Adreno-640
: can benchmark IREE Vulkan with Adreno target triples.GPU-Mali-G77
: can benchmark IREE Vulkan with Mali target triples.GPU-Mali-G78
: can benchmark IREE Vulkan with Mali target triples.This field is to further specify the benchmark variant, given the same input model and target architecture. It controls important aspects like:
*-core
: specifies the core flavor for CPU.*-thread
: specifies the number of threads for CPU.full-inference
: measures the latency for one full inference. Note that this does not include the IREE system initialization time.kernel-execution
: measures only kernel execution latency for GPU. Note that this is only possible for feedforward NN models that can be put into one command buffer.*-core
and *-thread
together determines the taskset
mask used for benchmarking IREE backends and drivers on CPU. For example,
1-thread,big-core
would mean taskset 80
.1-thread,little-core
would mean taskset 08
.3-thread,big-core
would mean taskset f0
.3-thread,little-core
would mean taskset 0f
.