Status | Proposed |
---|---|
RFC #2 | 46767 |
Author(s) | Daniel Situnayake (dan@edgeimpulse.com) |
Sponsor | Pete Warden (petewarden@google.com) |
Updated | 2021-01-28 |
TensorFlow Lite has kernel implementations that support 8 bit quantized weights but use 16 bit activations. We wish to port these implementations to TensorFlow Lite for Microcontrollers. The increased precision available for activations can improve performance for some quantized models.
Arm have agreed to support the initiative by adding the necessary 16x8 APIs to CMSIS-NN and porting the CMSIS-NN kernels.
Some networks that suffer unacceptable degradation when quantized with 8 bit weights and 8 bit activations perform adequately when quantized with 8 bit weights and 16 bit activations. The TensorFlow Lite documentation states the following:
[16x8 quantization] mode can improve accuracy of the quantized model significantly, when activations are sensitive to the quantization, while still achieving almost 3-4x reduction in model size. Moreover, this fully quantized model can be consumed by integer-only hardware accelerators.
Edge Impulse, a company that deploys TensorFlow Lite for Microcontrollers as part of its embedded machine learning pipeline, has gathered feedback from customers with production models for which 8 bit quantization results in unacceptable degradation but for whom 16x8 is fine.
While 16x8 quantization is well supported within TensorFlow Lite, it is not currently supported within TensorFlow Lite for Microcontrollers. Porting the TensorFlow Lite reference kernels is relatively straightforward and will improve adoption of TensorFlow Lite for Microcontrollers with users for whom degradation is too severe with full 8 bit quantization.
The headline would be “16x8 kernels improve accuracy for quantized models on microcontrollers without increasing model size”.
Users would benefit in the following ways:
We propose that the 16x8 kernels are ported from the TensorFlow Lite reference kernels to TensorFlow Lite for Microcontrollers following the process in the Porting TensorFlow Lite Ops to Micro guide.
We wish to ensure that the following kernels are compatible with 16x8 mode:
Adding the 16x8 kernels directly to TFLM alongside the existing kernels would increase the default code size by an unacceptable amount. Instead, we will make use of the kernel registration API currently under development by the TFLM team. The use of this is demonstrated in the Keyword benchmark code. By doing this, the end user can decide which kernels and dependencies they want to include (e.g. 8 bit, 16x8, or float32).
For example, the following could be registered:
// Support for all datatypes op_resolver->AddFullyConnected(tflite::Register_FULLY_CONNECTED); // Support for 8 bit quantized models op_resolver->AddFullyConnected(tflite::Register_FULLY_CONNECTED_INT8); // Support for 16x8 quantized models op_resolver->AddFullyConnected(tflite::Register_FULLY_CONNECTED_INT16X8());
This means that kernels not currently using this registration API will need to be refactored to use it. Currently only FullyConnected uses the API.
The following associated tasks will be required to support this work:
tensorflow/lite/micro/benchmarks
that demonstrates the use of the ops that provide a 16x8 kernel.The work will be broken down into a series of pull requests, some for the benchmarks and some for each kernel.
Benchmark pull requests:
tensorflow/lite/micro/benchmarks
that attempts to run a 16x8 model that includes the kernels mentioned in this RFC. The model’s weights and biases can be random. The benchmark should use the MicroMutableOpResolver. The PR should include the Colab used to generate the model.RecordingMemoryAllocator
.For each kernel:
Note that @njeffrie from the TF Lite Micro team also plans to prepare PR(s) for the kernels that are of interest internally (without using the kernel variant registation API for binary size). This will provide some quick examples of porting the kernels.