tree: 3fa096c61c90a658398af28555a8e8686d61e78f
  1. decoder/
  2. format/
  3. model/
  4. normalizer/
  5. regex/
  6. segmenter/
  7. testdata/
  8. testing/
  9. tools/
  10. vocab/
  11. BUILD.bazel
  12. byte_level_tables.h
  13. CMakeLists.txt
  14. decoder.c
  15. decoder.h
  16. model.c
  17. model.h
  18. normalizer.c
  19. normalizer.h
  20. postprocessor.c
  21. postprocessor.h
  22. postprocessor_fuzz.cc
  23. postprocessor_test.cc
  24. README.md
  25. segmenter.c
  26. segmenter.h
  27. special_tokens.c
  28. special_tokens.h
  29. special_tokens_fuzz.cc
  30. special_tokens_test.cc
  31. tokenizer.c
  32. tokenizer.h
  33. tokenizer_batch_decode_fuzz.cc
  34. tokenizer_batch_encode_fuzz.cc
  35. tokenizer_benchmark.cc
  36. tokenizer_consistency_test.cc
  37. tokenizer_decode_fuzz.cc
  38. tokenizer_decode_test.cc
  39. tokenizer_encode_fuzz.cc
  40. tokenizer_encode_test.cc
  41. tokenizer_huggingface_test.cc
  42. tokenizer_ringbuffer_fuzz.cc
  43. tokenizer_roundtrip_fuzz.cc
  44. tokenizer_test.cc
  45. tokenizer_test_util.h
  46. types.h
runtime/src/iree/tokenizer/README.md

IREE Tokenizer

A high-performance streaming tokenizer library for ML inference and training, written in C. Loads HuggingFace tokenizer.json and OpenAI tiktoken .tiktoken files, producing identical output to the respective reference implementations, with full support for BPE, WordPiece, and Unigram models.

What This Is

The IREE tokenizer is a format-agnostic tokenization engine with a pull-based streaming architecture. It processes arbitrarily long input with fixed memory, produces tokens incrementally, and tracks byte-exact offsets through the full normalization pipeline for training use cases.

The tokenizer is thread-safe and allocation-free in the hot path: all buffers are caller-provided, all state is in fixed-size structs, and the tokenizer object itself is immutable after construction. This makes it suitable for high-throughput inference servers (pool state objects, share the tokenizer across threads) as well as embedded systems (statically allocate everything).

Key Properties

  • Streaming-first: encode(text) is just initialize() -> feed(text) -> finalize(). Chunked input produces identical output regardless of chunk boundaries.
  • Bounded memory: Fixed-size state (~1-2KB) plus a caller-provided transform buffer. No hidden allocations during encode/decode. Processes infinite-length I/O with constant memory.
  • Offset tracking: Every token maps back to its exact byte range in the original input, propagated forward through normalization (NFC, lowercase, etc.) using run-length encoded offset maps. Essential for training data provenance and RAG chunking.
  • HuggingFace compatible: Bit-exact match against the HuggingFace tokenizers library across all supported normalizers, pre-tokenizers, models, and decoders.
  • Tiktoken compatible: Loads OpenAI tiktoken .tiktoken files (cl100k_base, o200k_base, r50k_base, p50k_base) and produces token-for-token identical output to the tiktoken Python library. Merge pairs are reconstructed from ranks via BPE simulation — verified 100% accurate against HuggingFace's explicit merge lists.
  • Safe for untrusted input: All arithmetic uses checked overflow helpers. No undefined behavior on malformed UTF-8, pathological regex patterns, or adversarial input.

Performance

Throughput in MiB/s (one-shot encode and decode, pool rotation with 64 copies defeating cache warming, Clang -O3 -march=native with ThinLTO, AMD EPYC 5.4 GHz). Corpus: 593KB ASCII (Sherlock Holmes), 1.8MB CJK (Chinese classic), 1.4MB Code (concatenated C source):

ModelAlgoVocabASCII EncCJK EncCode EncASCII DecCJK DecCode Dec
GPT-2BPE50K2815445871,588630
Llama 3BPE128K429547091,236781
Gemma 2BBPE256K406715213366215
Qwen 2.5BPE151K248447361,270835
BLOOMBPE250K2472621984762
Mistral NeMoBPE131K3910396011,221780
DeepSeek V3BPE129K1310166581,6672,125
WhisperBPE51K2611435881,439596
BERTWordPiece30K4436408143,061792
T5Unigram32K183436461,808556

Streaming encode produces bit-exact output regardless of buffer size and tracks one-shot throughput closely. DeepSeek V3 streaming encode on ASCII (same settings as above, pool rotation, Sherlock Holmes corpus):

Buffer SizeThroughputTokensvs One-Shot
1 KB10.7 MiB/s143,0530.82x
4 KB10.7 MiB/s143,0530.82x
16 KB11.6 MiB/s143,0530.89x
64 KB11.6 MiB/s143,0530.89x
One-Shot13.1 MiB/s143,0531.00x

Throughput is flat across all buffer sizes. The 0.82-0.89x streaming ratio is the cost of the pass-through probe: when a segmenter child finds no matches (e.g., Numbers split on English text), the pipeline probes the final child's DFA to determine the natural segment boundary before expanding through the full child chain. This avoids forcing word splits at arbitrary buffer boundaries.

Running Benchmarks

# Single model (HuggingFace JSON):
iree-bazel-run --copt=-O3 --copt=-march=native --features=thin_lto \
  //runtime/src/iree/tokenizer/tools:comprehensive_benchmark -- \
  --tokenizer_json=path/to/tokenizer.json --benchmark_min_time=1s

# Single model (tiktoken — encoding inferred from filename):
iree-bazel-run --copt=-O3 --copt=-march=native --features=thin_lto \
  //runtime/src/iree/tokenizer/tools:comprehensive_benchmark -- \
  --tokenizer_json=path/to/cl100k_base.tiktoken --benchmark_min_time=1s

# All 10 models (downloads tokenizers automatically):
runtime/src/iree/tokenizer/tools/run_benchmarks.sh --benchmark_min_time=1s

The --rotate flag allocates 64 copies of each corpus at separate heap addresses and cycles through them. The total working set exceeds L3 cache, defeating cache warming. Off by default.

Architecture

Streaming Encode Pipeline

Input ──┐
(bytes) │
        v
 ┌─────────────┐       (by config)
 │ UTF-8       │  OR  ┌─────────────┐
 │ Decoder     │      │ ByteLevel   │
 └──────┬──────┘      │ Encoder     │
        │             └──────┬──────┘
        └───────┬────────────┘
                │ codepoints (batched)
                v
 ┌──────────────────────────────┐
 │ Normalizer Chain             │
 │ (NFC, lowercase, strip, ...) │
 └──────────────┬───────────────┘
                │ codepoints (batched)
                v
         ┌─────────────┐
         │ UTF-8       │
         │ Encoder     │
         └──────┬──────┘
                │ bytes
                v
 ┌───────────────────────────────────┐
 │ Transform Buffer (ring buffer)    │
 │ [normalized bytes]                │
 └────────────────┬──────────────────┘
                  │
                  v
 ┌──────────────────────────────┐
 │ Segmenter                    │
 │ (Metaspace, ByteLevel, BERT, │
 │  whitespace, regex split)    │
 └──────────────┬───────────────┘
                │ segments (byte ranges)
                v
 ┌──────────────────────────────┐
 │ BPE / WordPiece / Unigram    │
 │ (subword tokenization)       │
 └──────────────┬───────────────┘
                │
                v
        Token IDs + Offsets

The pipeline is pull-based: output buffer capacity drives processing. When the caller calls feed(), the encoder pulls data through each stage only as needed to fill the output. Each stage maintains a small batch buffer (default 64 codepoints) to amortize indirect call overhead. For 1M codepoints through 4 pipeline stages, this means ~63K indirect calls instead of 4M.

Streaming Decode Pipeline

Token IDs ──┐
            v
 ┌─────────────────────────────┐
 │ Vocabulary Lookup           │
 │ (id -> token string)        │
 └──────────────┬──────────────┘
                │ token strings
                v
         ┌─────────────┐
         │ UTF-8       │
         │ Decoder     │
         └──────┬──────┘
                │ codepoints (batched)
                v
 ┌──────────────────────────────┐
 │ Decoder Chain                │
 │ (ByteFallback, Metaspace,    │
 │  WordPiece strip, CTC, ...)  │
 └──────────────┬───────────────┘
                │ codepoints (batched)
                v
         ┌─────────────┐
         │ UTF-8       │
         │ Encoder     │
         └──────┬──────┘
                │
                v
           Output bytes

Data Flow

User                    Tokenizer                 Pipeline Stages
 │                          │                            │
 │  initialize(state,       │                            │
 │    buffer, capacity)     │                            │
 │─────────────────────────>│                            │
 │                          │                            │
 │  feed(state, chunk1)     │                            │
 │─────────────────────────>│  pull codepoints           │
 │                          │───────────────────────────>│
 │                          │  batched codepoints        │
 │                          │<───────────────────────────│
 │                          │  [transform, split, BPE]   │
 │  (partial tokens)        │                            │
 │<─────────────────────────│                            │
 │                          │                            │
 │  feed(state, chunk2)     │                            │
 │─────────────────────────>│            ...             │
 │  (more tokens)           │                            │
 │<─────────────────────────│                            │
 │                          │                            │
 │  finalize(state)         │  flush remaining           │
 │─────────────────────────>│───────────────────────────>│
 │  (final tokens)          │<───────────────────────────│
 │<─────────────────────────│                            │
 │                          │                            │
 │  deinitialize(state)     │                            │
 │─────────────────────────>│                            │

Design Choices

Cache and Locality Efficiency

The pipeline is organized to keep working data in L1/L2 cache:

  • Batched pull model: Each stage processes 64 items at a time from a small internal buffer (~256 bytes), which fits in L1. The upstream indirect call happens once per batch, not once per item.
  • Ring buffer transform: Normalized bytes flow through a power-of-two ring buffer. The segmenter reads from one region while the normalizer writes to another, keeping both hot in cache.
  • Compact vocabulary: Token strings are stored contiguously in a single allocation with offset indexing. Vocabulary lookup is a single indexed load, not a hash table chase.
  • Merge hash table: BPE merge lookups use an open-addressing hash table with linear probing, keeping probe sequences cache-line-local.

Algorithmic Efficiency

  • BPE: Priority-queue based merge with a fixed-size heap. Processes one segment at a time with O(n log n) merges where n is the segment length (typically 5-20 bytes, not the full input length). The heap is embedded in the encode state, not heap-allocated.
  • Trie-based vocabulary: Token lookup during encoding uses a trie for longest-prefix matching, giving O(k) lookup where k is the max token length (bounded, typically < 50 bytes).
  • DFA regex: The regex engine compiles patterns to DFA at tokenizer load time. Matching is O(n) in input length with no backtracking, making it safe for untrusted input patterns.

Memory Consumption

The tokenizer uses a fixed, small amount of memory for streaming state:

ComponentSizeNotes
Encode state~1-2 KBPipeline stage batch buffers + state
Transform buffer4-64 KBCaller-provided, tunable
Decode state~1 KBVocab string batch buffer + decoder state
Tokenizer (shared)VariesVocabulary + merge table + regex DFA

The encode and decode state sizes are independent of input length. A 1-byte input and a 1-GB input use the same state allocation. This makes the tokenizer suitable for streaming/speculative decoding where you process tokens incrementally without ever needing to re-decode previous output.

Composability with Streaming Workflows

Because the tokenizer processes input incrementally with bounded state:

  • Speculative decoding: Decode tentative tokens, discard on misprediction, continue from the same state. No re-decode of accepted prefix.
  • Streaming chat: Feed each LLM-generated token to the decoder as it arrives. UTF-8 output is produced as soon as complete characters are available.
  • Chunked encoding: Feed input in arbitrary-sized chunks (network packets, file read buffers). The tokenizer handles chunk boundaries transparently, including mid-codepoint splits.
  • Multi-turn context: Control the AT_INPUT_START flag to handle Metaspace prepend behavior correctly across conversation turns without re-encoding history.

Features

Tokenization Models

  • BPE (Byte-Pair Encoding): GPT-2, Llama, Gemma, Qwen, Mistral, DeepSeek, BLOOM, Whisper, etc.
  • WordPiece: BERT and variants (DistilBERT, RoBERTa, DeBERTa, etc.)
  • Unigram: T5, XLM-RoBERTa, CamemBERT, etc.

Normalizers

NFC, NFD, NFKD, lowercase, strip, strip accents, BERT normalization, precompiled (HuggingFace format), prepend, replace, regex replace, sequences of the above.

Segmenters (Pre-tokenizers)

ByteLevel (GPT-2), Metaspace (SentencePiece), BERT (whitespace + punctuation + CJK), whitespace, digits, punctuation, regex split, sequences of the above. All five HuggingFace split behaviors (Removed, Isolated, MergedWithPrevious, MergedWithNext, Contiguous) are supported.

Decoders

ByteLevel, ByteFallback, Metaspace, WordPiece (## strip), CTC, Replace, Strip, sequences of the above.

Post-processing

Template-based special token insertion (BOS, EOS, CLS, SEP) with HuggingFace-compatible template pair support and type ID assignment.

Format Loaders

HuggingFace tokenizer.json: Full coverage of the HuggingFace tokenizers library's JSON format, including all normalizer types, pre-tokenizer configurations, model types (BPE, WordPiece, Unigram), decoder chains, added tokens, and post-processor templates. A single tokenizer.json is self-contained — it carries the vocabulary, merge list, regex patterns, special tokens, and all pipeline configuration.

OpenAI tiktoken .tiktoken: Loads the .tiktoken BPE vocabulary format used by GPT-4, GPT-4o, GPT-3, and Codex models. Tiktoken files store only base64-encoded byte tokens with integer ranks — no explicit merge list, no regex pattern, no special tokens. The loader reconstructs the full BPE merge table from ranks via simulation: for each multi-byte token at rank R, it simulates BPE encoding of that token‘s raw bytes using only merges with rank < R, yielding the exact merge pair. Predefined configs provide the regex pattern and special tokens for each standard encoding (cl100k_base, o200k_base, r50k_base, p50k_base). Cross-validated against HuggingFace’s explicit merge lists with 100% token-for-token accuracy.

Offset Tracking

Forward-propagated byte offsets through the full normalization pipeline. Offsets are stored as run-length encoded maps (O(discontinuities) memory, not O(input_length)), making them practical for arbitrarily long inputs.

Regex Engine

Self-contained regex compiler and DFA execution engine. Compiles patterns at tokenizer load time; no runtime regex dependency. O(n) matching with no backtracking.

Directory Layout

tokenizer/
  tokenizer.h/c          Core API: encode, decode, streaming state
  types.h                Shared types (token_id_t, offset_t, etc.)
  model.h/c              Model vtable interface
  normalizer.h/c         Normalizer vtable interface
  segmenter.h/c          Segmenter vtable interface
  decoder.h/c            Decoder vtable interface
  postprocessor.h/c      Template-based special token insertion
  special_tokens.h/c     Added token matching (pre/post normalization)

  model/                 Tokenization algorithms
    bpe.h/c              BPE with priority-queue merge
    wordpiece.h/c        WordPiece greedy longest-match
    unigram.h/c          Unigram (Viterbi) model

  normalizer/            Codepoint transforms
    nfc.c, nfd.c, nfkd.c, lowercase.c, strip.c, strip_accents.c,
    bert.c, precompiled.c, prepend.c, replace.c, regex_replace.c,
    sequence.c, passthrough.c

  segmenter/             Pre-tokenization (word boundaries)
    bert.c, whitespace.c, metaspace.c, digits.c, punctuation.c,
    split.c, sequence.c, passthrough.c

  decoder/               Token-to-text transforms
    byte_level.c, byte_fallback.c, metaspace.c, wordpiece.c,
    ctc.c, replace.c, strip.c, sequence.c, passthrough.c

  vocab/                 Vocabulary management
    vocab.h/c            Vocab lookup (id->string, string->id)
    vocab_builder.h/c    Incremental vocab construction
    vocab_hash.h/c       String-to-ID hash table
    vocab_merge_hash.h/c BPE merge pair hash table
    vocab_trie.h/c       Trie for longest-prefix matching

  regex/                 Self-contained regex engine
    compile.h/c          Pattern -> NFA -> DFA compilation
    exec.h/c             DFA execution
    internal/            Lexer, parser, NFA, DFA internals

  format/                Format loaders (tokenizer core is format-agnostic)
    huggingface/         HuggingFace tokenizer.json loader
      tokenizer_json.h/c   Top-level JSON parser
      model_json.h/c       BPE/WordPiece/Unigram JSON
      normalizer_json.h/c  Normalizer chain JSON
      segmenter_json.h/c   Pre-tokenizer JSON
      decoder_json.h/c     Decoder chain JSON
      added_tokens_json.h/c  Special token JSON
      postprocessor_json.h/c Post-processor template JSON
    tiktoken/            OpenAI tiktoken .tiktoken loader
      tiktoken.h/c         Parser, vocab/merge reconstruction, assembly

  tools/                 Benchmarks and test infrastructure
    comprehensive_benchmark.cc  Google Benchmark suite
    run_benchmarks.sh           Downloads models, runs full suite
    huggingface_smoketest.py    Correctness tests against HuggingFace
    run_smoketest.sh            Wrapper for smoketest dependencies

  testing/               Test utilities
    scoped_resource.h    RAII wrappers for test cleanup

The format/ directory contains only loaders: they parse a specific file format and populate the tokenizer builder. The core tokenizer has no knowledge of any file format. Future formats (SentencePiece protobuf, a custom mmap format for zero-copy loading) would be additional subdirectories under format/.

Quick Start

#include "iree/tokenizer/format/huggingface/tokenizer_json.h"
#include "iree/tokenizer/format/tiktoken/tiktoken.h"
#include "iree/tokenizer/tokenizer.h"

// Load from HuggingFace JSON:
iree_tokenizer_t* tokenizer = NULL;
iree_status_t status = iree_tokenizer_from_huggingface_json(
    json_contents, iree_allocator_system(), &tokenizer);

// -- OR load from tiktoken (e.g., cl100k_base for GPT-4):
// iree_status_t status = iree_tokenizer_from_tiktoken(
//     tiktoken_contents, iree_tokenizer_tiktoken_config_cl100k_base(),
//     iree_allocator_system(), &tokenizer);

// One-shot encode.
iree_tokenizer_token_id_t tokens[1024];
iree_host_size_t count = 0;
status = iree_tokenizer_encode(
    tokenizer, iree_make_cstring_view("Hello, world!"),
    IREE_TOKENIZER_ENCODE_FLAG_NONE,
    iree_tokenizer_make_token_output(tokens, NULL, NULL, 1024),
    iree_allocator_system(), &count);

// One-shot decode.
char text[4096];
iree_host_size_t text_length = 0;
status = iree_tokenizer_decode(
    tokenizer, iree_tokenizer_make_token_id_list(tokens, count),
    IREE_TOKENIZER_DECODE_FLAG_NONE,
    iree_make_mutable_string_view(text, sizeof(text)),
    iree_allocator_system(), &text_length);

iree_tokenizer_free(tokenizer);

Streaming Encode

// Calculate required state size.
iree_host_size_t state_size = 0;
iree_tokenizer_encode_state_calculate_size(tokenizer, &state_size);

// Allocate state + transform buffer (caller controls placement).
uint8_t* state_storage = malloc(state_size);
size_t buffer_size = iree_tokenizer_transform_buffer_recommended_size(4096);
uint8_t* transform_buffer = malloc(buffer_size);

// Initialize streaming state.
iree_tokenizer_encode_state_t* state = NULL;
iree_tokenizer_encode_state_initialize(
    tokenizer,
    iree_make_byte_span(state_storage, state_size),
    iree_make_byte_span(transform_buffer, buffer_size),
    iree_tokenizer_offset_run_list_empty(),
    IREE_TOKENIZER_ENCODE_FLAG_AT_INPUT_START,
    &state);

// Feed chunks as they arrive (network, file, etc.).
while (has_more_input()) {
    iree_string_view_t chunk = get_next_chunk();
    while (chunk.size > 0) {
        iree_host_size_t bytes_consumed = 0, tokens_written = 0;
        iree_tokenizer_encode_state_feed(
            state, chunk, output, &bytes_consumed, &tokens_written);
        chunk.data += bytes_consumed;
        chunk.size -= bytes_consumed;
        process_tokens(output, tokens_written);
    }
}

// Flush remaining tokens.
iree_host_size_t final_count = 0;
iree_tokenizer_encode_state_finalize(state, output, &final_count);
process_tokens(output, final_count);

iree_tokenizer_encode_state_deinitialize(state);
free(transform_buffer);
free(state_storage);

Streaming Decode (Token-at-a-Time)

iree_tokenizer_decode_state_t* state = NULL;
iree_tokenizer_decode_state_initialize(
    tokenizer, IREE_TOKENIZER_DECODE_FLAG_NONE,
    iree_make_byte_span(state_storage, state_size), &state);

// Feed each token as the LLM generates it.
while (llm_has_more_tokens()) {
    iree_tokenizer_token_id_t token = llm_sample_next();
    char text[64];
    iree_host_size_t consumed = 0, written = 0;
    iree_tokenizer_decode_state_feed(
        state, iree_tokenizer_make_token_id_list(&token, 1),
        iree_make_mutable_string_view(text, sizeof(text)),
        &consumed, &written);
    if (written > 0) {
        stream_to_user(text, written);  // UTF-8 safe for display
    }
}

iree_host_size_t final_bytes = 0;
iree_tokenizer_decode_state_finalize(
    state, iree_make_mutable_string_view(text, sizeof(text)), &final_bytes);
iree_tokenizer_decode_state_deinitialize(state);