This guide describes the recommended high-level architecture and steps to add hardware-specific optimized kernels to TfLite Micro.
The goal with these optimizations and the process that we recommend to getting them merged into the TfLite Micro codebase is to have a measurable and documented performance improvement on a benchmark of interest.
Once the optimizations are merged, they will indeed be used for more than the benchmark but the context for why the optimizations were added is still very important.
Pick a benchmark that you would like to measure the performance for.
Do the groundwork and architecture needed to be able to add in optimizations for your target (more details in the software architecture section).
Create one pull request for each optimized kernel with the PR description clearly stating the commands that were used to measure the performance improvement.
We would like to explicitly point out (as have others) that the reference kernel implementations are not performant and there are plenty of opportunities to speed them up. This is by design and the reference kernels are meant to be a shared starting point to then be optimized in a target specific optimized kernel implementation.
Two previous discussions on this topic are on PR #42477 and PR #45227
Our current point of view on this topic is that while optimizing shared reference code in a portable manner is attractive, we are making an explicit choice to not go down that path and instead rely on target-specific optimized implementations. The TFLM codebase has a growing list of optimized kernel implementations, and we are investing in making the process of adding new implementations smoother.
The optimized kernel architecture is composed of the following three modules:
This library uses knowledge of the hardware and compiler to implement the underlying operations. Examples of this are CMSIS-NN from ARM and NNLib from Cadence.
The benefits of having this API separation are:
These will be (hopefully thin) wrappers that act as the glue between TFLM and the NN library.
The goal here is to delegate as much work as possible to the NN library while still allowing the two APIs (TFLM and NN library) to be independent of each other. If there is a performance degradation due to this (for example, unnecessary memory copies) then we can evaluate those on a case-by-case basis.
This code will be reviewed and merged in the TFLM github repository and must follow the development style of the TFLM codebase.
Some amount of refactoring of the existing code may be needed to ensure that code is suitably shared between the reference and optimized kernels. There is currently no fixed recipe for this refactor and we will evaluate on a case-by-case basis during the PR review.
For example, to add an optimized implementation for fully_conntected
for the Xtensa Fusion F1 the steps were:
This module is the least defined but we strongly recommend the following: 1. A single target makefile.inc for all the architectures that you would like to support along with optional target-specific system_setup.cc. See cortex_m_generic_makefile.inc and xtensa_makefile.inc as examples.
A single ext_libs.inc
(and associated scripts) that downloads any external dependencies (including the NN library). For example:
The optimized kernels will then live in a kernels subdirectory (e.g. kernels/cmsis_nn and kernels/xtensa)
Two development workflows that the TFLM team would like to encourage and support:
Export static library + headers into target-specific development environment
make -f tensorflow/lite/micro/tools/make/Makefile TARGET=<target> OPTIMIZED_KERNEL_DIR=<optimize_dir> microlite
Integrate TFLM with IDE:
This has historically been done using the TFLM Makefile’s support for project generation.
However, given the learning curve and high-maintenance overhead, we are moving away from supporting project generation via the Makefile and are encouraging future IDE integrations to be done outside of the TFLM Makefiles.
The TFLM team is currently working through the details on this topic.
The kernel tests are the primary method of ensuring that the optimized kernel implementations are accurate.
Currently, most of the tests require the optimizations to be bit-exact to the quantized reference implementation. We can revisit this requirement if it ends up having a high associated cost on the latency.
We strongly encourage optimized kernel implementations to have an associated continuous build that runs through all the unit tests and publishes a build badge to the TFLM community supported builds table. Running the units tests once a day is often a good place to start.