A high-performance streaming tokenizer library for ML inference and training, written in C. Loads HuggingFace tokenizer.json and OpenAI tiktoken .tiktoken files, producing identical output to the respective reference implementations, with full support for BPE, WordPiece, and Unigram models.
The IREE tokenizer is a format-agnostic tokenization engine with a pull-based streaming architecture. It processes arbitrarily long input with fixed memory, produces tokens incrementally, and tracks byte-exact offsets through the full normalization pipeline for training use cases.
The tokenizer is thread-safe and allocation-free in the hot path: all buffers are caller-provided, all state is in fixed-size structs, and the tokenizer object itself is immutable after construction. This makes it suitable for high-throughput inference servers (pool state objects, share the tokenizer across threads) as well as embedded systems (statically allocate everything).
encode(text) is just initialize() -> feed(text) -> finalize(). Chunked input produces identical output regardless of chunk boundaries..tiktoken files (cl100k_base, o200k_base, r50k_base, p50k_base) and produces token-for-token identical output to the tiktoken Python library. Merge pairs are reconstructed from ranks via BPE simulation — verified 100% accurate against HuggingFace's explicit merge lists.Throughput in MiB/s (one-shot encode and decode, pool rotation with 64 copies defeating cache warming, Clang -O3 -march=native with ThinLTO, AMD EPYC 5.4 GHz). Corpus: 593KB ASCII (Sherlock Holmes), 1.8MB CJK (Chinese classic), 1.4MB Code (concatenated C source):
| Model | Algo | Vocab | ASCII Enc | CJK Enc | Code Enc | ASCII Dec | CJK Dec | Code Dec |
|---|---|---|---|---|---|---|---|---|
| GPT-2 | BPE | 50K | 28 | 15 | 44 | 587 | 1,588 | 630 |
| Llama 3 | BPE | 128K | 42 | 9 | 54 | 709 | 1,236 | 781 |
| Gemma 2B | BPE | 256K | 40 | 67 | 15 | 213 | 366 | 215 |
| Qwen 2.5 | BPE | 151K | 24 | 8 | 44 | 736 | 1,270 | 835 |
| BLOOM | BPE | 250K | 24 | 7 | 2 | 621 | 984 | 762 |
| Mistral NeMo | BPE | 131K | 39 | 10 | 39 | 601 | 1,221 | 780 |
| DeepSeek V3 | BPE | 129K | 13 | 10 | 16 | 658 | 1,667 | 2,125 |
| Whisper | BPE | 51K | 26 | 11 | 43 | 588 | 1,439 | 596 |
| BERT | WordPiece | 30K | 44 | 36 | 40 | 814 | 3,061 | 792 |
| T5 | Unigram | 32K | 18 | 34 | 3 | 646 | 1,808 | 556 |
Streaming encode produces bit-exact output regardless of buffer size and tracks one-shot throughput closely. DeepSeek V3 streaming encode on ASCII (same settings as above, pool rotation, Sherlock Holmes corpus):
| Buffer Size | Throughput | Tokens | vs One-Shot |
|---|---|---|---|
| 1 KB | 10.7 MiB/s | 143,053 | 0.82x |
| 4 KB | 10.7 MiB/s | 143,053 | 0.82x |
| 16 KB | 11.6 MiB/s | 143,053 | 0.89x |
| 64 KB | 11.6 MiB/s | 143,053 | 0.89x |
| One-Shot | 13.1 MiB/s | 143,053 | 1.00x |
Throughput is flat across all buffer sizes. The 0.82-0.89x streaming ratio is the cost of the pass-through probe: when a segmenter child finds no matches (e.g., Numbers split on English text), the pipeline probes the final child's DFA to determine the natural segment boundary before expanding through the full child chain. This avoids forcing word splits at arbitrary buffer boundaries.
# Single model (HuggingFace JSON): iree-bazel-run --copt=-O3 --copt=-march=native --features=thin_lto \ //runtime/src/iree/tokenizer/tools:comprehensive_benchmark -- \ --tokenizer_json=path/to/tokenizer.json --benchmark_min_time=1s # Single model (tiktoken — encoding inferred from filename): iree-bazel-run --copt=-O3 --copt=-march=native --features=thin_lto \ //runtime/src/iree/tokenizer/tools:comprehensive_benchmark -- \ --tokenizer_json=path/to/cl100k_base.tiktoken --benchmark_min_time=1s # All 10 models (downloads tokenizers automatically): runtime/src/iree/tokenizer/tools/run_benchmarks.sh --benchmark_min_time=1s
The --rotate flag allocates 64 copies of each corpus at separate heap addresses and cycles through them. The total working set exceeds L3 cache, defeating cache warming. Off by default.
Input ──┐
(bytes) │
v
┌─────────────┐ (by config)
│ UTF-8 │ OR ┌─────────────┐
│ Decoder │ │ ByteLevel │
└──────┬──────┘ │ Encoder │
│ └──────┬──────┘
└───────┬────────────┘
│ codepoints (batched)
v
┌──────────────────────────────┐
│ Normalizer Chain │
│ (NFC, lowercase, strip, ...) │
└──────────────┬───────────────┘
│ codepoints (batched)
v
┌─────────────┐
│ UTF-8 │
│ Encoder │
└──────┬──────┘
│ bytes
v
┌───────────────────────────────────┐
│ Transform Buffer (ring buffer) │
│ [normalized bytes] │
└────────────────┬──────────────────┘
│
v
┌──────────────────────────────┐
│ Segmenter │
│ (Metaspace, ByteLevel, BERT, │
│ whitespace, regex split) │
└──────────────┬───────────────┘
│ segments (byte ranges)
v
┌──────────────────────────────┐
│ BPE / WordPiece / Unigram │
│ (subword tokenization) │
└──────────────┬───────────────┘
│
v
Token IDs + Offsets
The pipeline is pull-based: output buffer capacity drives processing. When the caller calls feed(), the encoder pulls data through each stage only as needed to fill the output. Each stage maintains a small batch buffer (default 64 codepoints) to amortize indirect call overhead. For 1M codepoints through 4 pipeline stages, this means ~63K indirect calls instead of 4M.
Token IDs ──┐
v
┌─────────────────────────────┐
│ Vocabulary Lookup │
│ (id -> token string) │
└──────────────┬──────────────┘
│ token strings
v
┌─────────────┐
│ UTF-8 │
│ Decoder │
└──────┬──────┘
│ codepoints (batched)
v
┌──────────────────────────────┐
│ Decoder Chain │
│ (ByteFallback, Metaspace, │
│ WordPiece strip, CTC, ...) │
└──────────────┬───────────────┘
│ codepoints (batched)
v
┌─────────────┐
│ UTF-8 │
│ Encoder │
└──────┬──────┘
│
v
Output bytes
User Tokenizer Pipeline Stages │ │ │ │ initialize(state, │ │ │ buffer, capacity) │ │ │─────────────────────────>│ │ │ │ │ │ feed(state, chunk1) │ │ │─────────────────────────>│ pull codepoints │ │ │───────────────────────────>│ │ │ batched codepoints │ │ │<───────────────────────────│ │ │ [transform, split, BPE] │ │ (partial tokens) │ │ │<─────────────────────────│ │ │ │ │ │ feed(state, chunk2) │ │ │─────────────────────────>│ ... │ │ (more tokens) │ │ │<─────────────────────────│ │ │ │ │ │ finalize(state) │ flush remaining │ │─────────────────────────>│───────────────────────────>│ │ (final tokens) │<───────────────────────────│ │<─────────────────────────│ │ │ │ │ │ deinitialize(state) │ │ │─────────────────────────>│ │
The pipeline is organized to keep working data in L1/L2 cache:
The tokenizer uses a fixed, small amount of memory for streaming state:
| Component | Size | Notes |
|---|---|---|
| Encode state | ~1-2 KB | Pipeline stage batch buffers + state |
| Transform buffer | 4-64 KB | Caller-provided, tunable |
| Decode state | ~1 KB | Vocab string batch buffer + decoder state |
| Tokenizer (shared) | Varies | Vocabulary + merge table + regex DFA |
The encode and decode state sizes are independent of input length. A 1-byte input and a 1-GB input use the same state allocation. This makes the tokenizer suitable for streaming/speculative decoding where you process tokens incrementally without ever needing to re-decode previous output.
Because the tokenizer processes input incrementally with bounded state:
AT_INPUT_START flag to handle Metaspace prepend behavior correctly across conversation turns without re-encoding history.NFC, NFD, NFKD, lowercase, strip, strip accents, BERT normalization, precompiled (HuggingFace format), prepend, replace, regex replace, sequences of the above.
ByteLevel (GPT-2), Metaspace (SentencePiece), BERT (whitespace + punctuation + CJK), whitespace, digits, punctuation, regex split, sequences of the above. All five HuggingFace split behaviors (Removed, Isolated, MergedWithPrevious, MergedWithNext, Contiguous) are supported.
ByteLevel, ByteFallback, Metaspace, WordPiece (## strip), CTC, Replace, Strip, sequences of the above.
Template-based special token insertion (BOS, EOS, CLS, SEP) with HuggingFace-compatible template pair support and type ID assignment.
HuggingFace tokenizer.json: Full coverage of the HuggingFace tokenizers library's JSON format, including all normalizer types, pre-tokenizer configurations, model types (BPE, WordPiece, Unigram), decoder chains, added tokens, and post-processor templates. A single tokenizer.json is self-contained — it carries the vocabulary, merge list, regex patterns, special tokens, and all pipeline configuration.
OpenAI tiktoken .tiktoken: Loads the .tiktoken BPE vocabulary format used by GPT-4, GPT-4o, GPT-3, and Codex models. Tiktoken files store only base64-encoded byte tokens with integer ranks — no explicit merge list, no regex pattern, no special tokens. The loader reconstructs the full BPE merge table from ranks via simulation: for each multi-byte token at rank R, it simulates BPE encoding of that token‘s raw bytes using only merges with rank < R, yielding the exact merge pair. Predefined configs provide the regex pattern and special tokens for each standard encoding (cl100k_base, o200k_base, r50k_base, p50k_base). Cross-validated against HuggingFace’s explicit merge lists with 100% token-for-token accuracy.
Forward-propagated byte offsets through the full normalization pipeline. Offsets are stored as run-length encoded maps (O(discontinuities) memory, not O(input_length)), making them practical for arbitrarily long inputs.
Self-contained regex compiler and DFA execution engine. Compiles patterns at tokenizer load time; no runtime regex dependency. O(n) matching with no backtracking.
tokenizer/
tokenizer.h/c Core API: encode, decode, streaming state
types.h Shared types (token_id_t, offset_t, etc.)
model.h/c Model vtable interface
normalizer.h/c Normalizer vtable interface
segmenter.h/c Segmenter vtable interface
decoder.h/c Decoder vtable interface
postprocessor.h/c Template-based special token insertion
special_tokens.h/c Added token matching (pre/post normalization)
model/ Tokenization algorithms
bpe.h/c BPE with priority-queue merge
wordpiece.h/c WordPiece greedy longest-match
unigram.h/c Unigram (Viterbi) model
normalizer/ Codepoint transforms
nfc.c, nfd.c, nfkd.c, lowercase.c, strip.c, strip_accents.c,
bert.c, precompiled.c, prepend.c, replace.c, regex_replace.c,
sequence.c, passthrough.c
segmenter/ Pre-tokenization (word boundaries)
bert.c, whitespace.c, metaspace.c, digits.c, punctuation.c,
split.c, sequence.c, passthrough.c
decoder/ Token-to-text transforms
byte_level.c, byte_fallback.c, metaspace.c, wordpiece.c,
ctc.c, replace.c, strip.c, sequence.c, passthrough.c
vocab/ Vocabulary management
vocab.h/c Vocab lookup (id->string, string->id)
vocab_builder.h/c Incremental vocab construction
vocab_hash.h/c String-to-ID hash table
vocab_merge_hash.h/c BPE merge pair hash table
vocab_trie.h/c Trie for longest-prefix matching
regex/ Self-contained regex engine
compile.h/c Pattern -> NFA -> DFA compilation
exec.h/c DFA execution
internal/ Lexer, parser, NFA, DFA internals
format/ Format loaders (tokenizer core is format-agnostic)
huggingface/ HuggingFace tokenizer.json loader
tokenizer_json.h/c Top-level JSON parser
model_json.h/c BPE/WordPiece/Unigram JSON
normalizer_json.h/c Normalizer chain JSON
segmenter_json.h/c Pre-tokenizer JSON
decoder_json.h/c Decoder chain JSON
added_tokens_json.h/c Special token JSON
postprocessor_json.h/c Post-processor template JSON
tiktoken/ OpenAI tiktoken .tiktoken loader
tiktoken.h/c Parser, vocab/merge reconstruction, assembly
tools/ Benchmarks and test infrastructure
comprehensive_benchmark.cc Google Benchmark suite
run_benchmarks.sh Downloads models, runs full suite
huggingface_smoketest.py Correctness tests against HuggingFace
run_smoketest.sh Wrapper for smoketest dependencies
testing/ Test utilities
scoped_resource.h RAII wrappers for test cleanup
The format/ directory contains only loaders: they parse a specific file format and populate the tokenizer builder. The core tokenizer has no knowledge of any file format. Future formats (SentencePiece protobuf, a custom mmap format for zero-copy loading) would be additional subdirectories under format/.
#include "iree/tokenizer/format/huggingface/tokenizer_json.h" #include "iree/tokenizer/format/tiktoken/tiktoken.h" #include "iree/tokenizer/tokenizer.h" // Load from HuggingFace JSON: iree_tokenizer_t* tokenizer = NULL; iree_status_t status = iree_tokenizer_from_huggingface_json( json_contents, iree_allocator_system(), &tokenizer); // -- OR load from tiktoken (e.g., cl100k_base for GPT-4): // iree_status_t status = iree_tokenizer_from_tiktoken( // tiktoken_contents, iree_tokenizer_tiktoken_config_cl100k_base(), // iree_allocator_system(), &tokenizer); // One-shot encode. iree_tokenizer_token_id_t tokens[1024]; iree_host_size_t count = 0; status = iree_tokenizer_encode( tokenizer, iree_make_cstring_view("Hello, world!"), IREE_TOKENIZER_ENCODE_FLAG_NONE, iree_tokenizer_make_token_output(tokens, NULL, NULL, 1024), iree_allocator_system(), &count); // One-shot decode. char text[4096]; iree_host_size_t text_length = 0; status = iree_tokenizer_decode( tokenizer, iree_tokenizer_make_token_id_list(tokens, count), IREE_TOKENIZER_DECODE_FLAG_NONE, iree_make_mutable_string_view(text, sizeof(text)), iree_allocator_system(), &text_length); iree_tokenizer_free(tokenizer);
// Calculate required state size. iree_host_size_t state_size = 0; iree_tokenizer_encode_state_calculate_size(tokenizer, &state_size); // Allocate state + transform buffer (caller controls placement). uint8_t* state_storage = malloc(state_size); size_t buffer_size = iree_tokenizer_transform_buffer_recommended_size(4096); uint8_t* transform_buffer = malloc(buffer_size); // Initialize streaming state. iree_tokenizer_encode_state_t* state = NULL; iree_tokenizer_encode_state_initialize( tokenizer, iree_make_byte_span(state_storage, state_size), iree_make_byte_span(transform_buffer, buffer_size), iree_tokenizer_offset_run_list_empty(), IREE_TOKENIZER_ENCODE_FLAG_AT_INPUT_START, &state); // Feed chunks as they arrive (network, file, etc.). while (has_more_input()) { iree_string_view_t chunk = get_next_chunk(); while (chunk.size > 0) { iree_host_size_t bytes_consumed = 0, tokens_written = 0; iree_tokenizer_encode_state_feed( state, chunk, output, &bytes_consumed, &tokens_written); chunk.data += bytes_consumed; chunk.size -= bytes_consumed; process_tokens(output, tokens_written); } } // Flush remaining tokens. iree_host_size_t final_count = 0; iree_tokenizer_encode_state_finalize(state, output, &final_count); process_tokens(output, final_count); iree_tokenizer_encode_state_deinitialize(state); free(transform_buffer); free(state_storage);
iree_tokenizer_decode_state_t* state = NULL; iree_tokenizer_decode_state_initialize( tokenizer, IREE_TOKENIZER_DECODE_FLAG_NONE, iree_make_byte_span(state_storage, state_size), &state); // Feed each token as the LLM generates it. while (llm_has_more_tokens()) { iree_tokenizer_token_id_t token = llm_sample_next(); char text[64]; iree_host_size_t consumed = 0, written = 0; iree_tokenizer_decode_state_feed( state, iree_tokenizer_make_token_id_list(&token, 1), iree_make_mutable_string_view(text, sizeof(text)), &consumed, &written); if (written > 0) { stream_to_user(text, written); // UTF-8 safe for display } } iree_host_size_t final_bytes = 0; iree_tokenizer_decode_state_finalize( state, iree_make_mutable_string_view(text, sizeof(text)), &final_bytes); iree_tokenizer_decode_state_deinitialize(state);