Online Memory Allocation Overview in TensorFlow Lite Micro

This document outlines how “online” memory is managed in TensorFlow Lite Micro (TFLM).

Arena

Online memory planning strategically places allocations in a single uint8_t buffer array. The buffer is split into two main sections: the “head” and the “tail”. Generally, non-persistent allocations are placed in the “head” and persistent allocations are placed in the “tail”. More details about the arena can be found here.

Existing buffers in the flatbuffer

The TFLite flatbuffer model contains a variety of information required to run a model in TFLite or TFLM. The TFLM online memory planner will walk the main subgraph and find all tensors required for the model (represented as TfLiteTensor and TfLiteEvalTensor C structs at runtime). Persistent tensors in the flatbuffer (e.g. weight tensors) will point at a buffer inlined in the flatbuffer. These buffers are reused during online memory planning. The corresponding C structures will point back at the buffer packed into the flatbuffer.

Model Init Phase

Either through the first call of MicroInterpreter::Invoke() or an explicit call to MicroInterpreter::AllocateTensors() the online model allocation will begin. The MicroInterpreter instance will invoke MicroAllocator::StartModelAllocation(). This function will begin pulling data out of the serialized flatbuffer and begin walking through the main subgraph.

The method MicroAllocator::StartModelAllocation() begins allocation in the following order:

  • Initializes internal state for scratch buffer allocations
  • Allocates a list of TfLiteEvalTensor C structs based on the number of tensors in the subgraph.
  • Allocations are persistent and stored in the tail section.
  • Tensors that reference buffers in the flatbuffer are assigned at this point.
  • Allocates a list of TfLiteRegistration and TfLiteNode C structs for every operator in the model subgraph
  • Allocations are persistent and stored in the tail section.
  • Walks back through the list of subgraph operators and assigns all C structs with relevant information from the flatbuffer.

At the conclusion of this phase, the operator kernel implementations are ready for calls to the TfLiteRegistration::init() function. The MicroInterpreter walks through the operator list and invokes all operator implementations that have this function. Typically, operator implementations return the object to store in the user_data field of a TfLiteNode struct.

Model Prepare Phase

After the interpreter has initialized all operator kernels, another pass through the subgraph is done. This time, each operator implementations that provides a TfLiteRegistration::prepare() function is called. This phase in TFLM is used for kernels to verify capabilities from model information, validate shapes, allocate any scratch buffers requested (through TfLiteContext::GetScratchBuffer()), and calculate quantization runtime data.

At this time, operator implementation will request tensor data through the TfLiteTensor C struct. This struct is heavier and contains more information that operators will need during this phase of initialization. Internally, TFLM will allocate these instances per request in the temp section. The temp section is the space between the head and the tail in the arena. During the prepare phase, nothing is yet been placed in the head section. This extra space between the head and tail is used to allocate buffers that are available until MicroAllocator::ResetTempAllocations() is called. Additional information available here.

NOTE: The TfLiteTensor struct is only available in TFLM during TfLiteRegistration::prepare(), after this allocation phase tensor data can only be accessed via a TfLiteEvalTensor struct.

Additionally, at this time each operator implementation may request scratch buffer requests through TfLiteContext::RequestScratchBufferInArena(). These requests are limited to kMaxScratchBuffersPerOp and are stored in an instance variable for each operator prepare block. All requests are eventually moved to the head section when the interpreter moves to the next operator.

After each call to TfLiteRegistration::prepare() the MicroInterpreter calls MicroAllocator::FinishPrepareNodeAllocations(). This method resets temp allocations and begins to store all scratch buffer requests inside the head section of the arena.

After all operators have been prepared, the MicroInterpreter calls MicroAllocator::FinishModelAllocation() to begin finalizing the online memory plan.

Finish Model Allocation Phase

The last phase of online memory planning is handled in MicroAllocator::FinishModelAllocation(). This function performs the following tasks

  • Allocates space in the tail for all persistent buffer requests that are currently in the head.
  • Commits Static Memory Plan
    • Uses the GreedyMemoryPlanner to optimize the non-persistent space in the head.
    • Optimizes for the operator that requires the largest byte-width buffer.
    • Allocates pointers in the tail that provide pointers into shared space and offsets in the head.
    • Sets the size of the head based on the result of GreedyMemoryPlanner::GetMaxiumMemorySize().
  • Allocates variable tensor buffers in the tail section.

Once TFLM has finalized online model allocation, all buffers are prepared and ready for optimal speed for inference. The system no longer enables operator implementations to allocate scratch buffers after this point.