| commit | 30c1ffd32dff89080e72b57bc29e0d400bc58c91 | [log] [tgz] |
|---|---|---|
| author | Ben Vanik <ben.vanik@gmail.com> | Sun Jan 11 19:56:14 2026 -0800 |
| committer | GitHub <noreply@github.com> | Sun Jan 11 19:56:14 2026 -0800 |
| tree | b2e18da127c5f8a39e3af17845bb4cdc850c94e8 | |
| parent | 4bb9b7110a03b049d798d0e1b16cf4a34abbc0c2 [diff] |
Unicode normalization and JSON parsing enhancements. (#23088) ## Summary Adds foundational Unicode and JSON parsing infrastructure required for HuggingFace tokenizer support: - **Unicode normalization** - NFD decomposition and NFC-like composition for BERT-style tokenizer preprocessing - **Enhanced JSON parsing** - Full Unicode escape support and new convenience APIs for parsing tokenizer.json files ## Changes ### Unicode Normalization (`runtime/src/iree/base/internal/unicode.c`) Adds Unicode normalization capabilities for tokenizer preprocessing: | Addition | Description | |----------|-------------| | CCC table | ~850 entries for canonical ordering of combining marks | | NFC composition pairs | ~930 entries for canonical composition (base + combining → precomposed) | | `iree_unicode_ccc()` | Returns CCC for any codepoint (0 for starters, non-zero for combining marks) | | `iree_unicode_compose_pair()` | Looks up (base, combining) → composed codepoint | | `iree_unicode_compose()` | Applies canonical ordering and composition to UTF-8 strings | The composition function uses chunk-based processing with a small fixed internal buffer (128 bytes), avoiding the earlier requirement of 4× input size for scratch space. > **Design Note**: This implements composition-only, not full NFC (which requires `Compose(Decompose(input))`). This is sufficient for tokenizer use cases since BERT uses NFD + accent stripping rather than NFC, and GPT-2/LLaMA use no normalizer. Real-world input is typically already NFC from OS input methods. ### JSON Parsing Enhancements (`runtime/src/iree/base/internal/json.c`) - **Locale-independent number parsing** - Removes `ctype.h` dependency for consistent behavior - **Full Unicode escape support** - `iree_json_unescape_string()` handles `\uXXXX` and UTF-16 surrogate pairs for codepoints > U+FFFF - **New lookup APIs**: - `iree_json_try_lookup_object_value()` - Returns empty string view for missing keys (not an error) - `iree_json_enumerate_array_typed()` - Array enumeration with value type inference - `iree_json_array_length()` and `iree_json_array_get()` - Direct indexed access - **Performance optimization** - Uses single-character prefix functions instead of `strncmp` for delimiter checking ### String View Utilities (`runtime/src/iree/base/string_view.h`) - `iree_string_view_starts_with_char()` - Check if string starts with a single character - `iree_string_view_consume_prefix_char()` - Remove leading char if it matches These inline functions avoid `strncmp` overhead for common single-character operations in parsers. ### Unicode Table Generator (`build_tools/scripts/unicode_tables_gen.py`) - Parses `DerivedNormalizationProps.txt` for composition exclusions - Generates CCC entries and NFC composition pairs - Reports statistics for new table types ## Statistics ``` 10 files changed, 3054 insertions(+), 54 deletions(-) ``` Key additions: - ~850 CCC table entries (~2.5KB) - ~930 NFC composition pairs (~11KB) - ~600 new lines of comprehensive test coverage ## Testing - Comprehensive unit tests for Unicode normalization including edge cases (empty strings, ASCII-only, multiple combining marks, boundary conditions) - Extended JSON tests for Unicode escape handling, surrogate pairs, and new APIs - All existing tests continue to pass ## Motivation This is pre-work for the tokenizer implementation. HuggingFace tokenizer.json files require: 1. Full Unicode escape decoding (for special tokens like `\u0120` representing ` ` with preceding space marker) 2. NFD/NFC normalization for BERT-style tokenizers that strip accents
IREE (Intermediate Representation Execution Environment, pronounced as “eerie”) is an MLIR-based end-to-end compiler and runtime that lowers Machine Learning (ML) models to a unified IR that scales up to meet the needs of the datacenter and down to satisfy the constraints and special considerations of mobile and edge deployments.
See our website for project details, user guides, and instructions on building from source.
Releases notes are published on GitHub releases.
| Package | Release status |
|---|---|
| GitHub release (stable) | |
| GitHub release (nightly) | |
iree-base-compiler | |
iree-base-runtime |
For more details on the release process, see https://iree.dev/developers/general/release-management/.
| Operating system | Build status |
|---|---|
| Linux | |
| macOS | |
| macOS |
For the full list of workflows see https://iree.dev/developers/general/github-actions/.
See our website for more information.
Community meeting recordings: IREE YouTube channel
| Date | Title | Recording | Slides |
|---|---|---|---|
| 2025-06-10 | Data-Tiling in IREE: Achieving High Performance Through Compiler Design (AsiaLLVM) | recording | slides |
| 2025-05-17 | Introduction to GPU architecture and IREE's GPU CodeGen Pipeline | recording | slides |
| 2025-02-12 | The Long Tail of AI: SPIR-V in IREE and MLIR (Vulkanised) | recording | slides |
| 2024-10-01 | Unveiling the Inner Workings of IREE: An MLIR-Based Compiler for Diverse Hardware | recording | |
| 2021-06-09 | IREE Runtime Design Tech Talk | recording | slides |
| 2020-08-20 | IREE CodeGen (MLIR Open Design Meeting) | recording | slides |
| 2020-03-18 | Interactive HAL IR Walkthrough | recording | |
| 2020-01-31 | End-to-end MLIR Workflow in IREE (MLIR Open Design Meeting) | recording | slides |
IREE is licensed under the terms of the Apache 2.0 License with LLVM Exceptions. See LICENSE for more information.