Unicode normalization and JSON parsing enhancements. (#23088)

## Summary

Adds foundational Unicode and JSON parsing infrastructure required for
HuggingFace tokenizer support:

- **Unicode normalization** - NFD decomposition and NFC-like composition
for BERT-style tokenizer preprocessing
- **Enhanced JSON parsing** - Full Unicode escape support and new
convenience APIs for parsing tokenizer.json files

## Changes

### Unicode Normalization (`runtime/src/iree/base/internal/unicode.c`)

Adds Unicode normalization capabilities for tokenizer preprocessing:

| Addition | Description |
|----------|-------------|
| CCC table | ~850 entries for canonical ordering of combining marks |
| NFC composition pairs | ~930 entries for canonical composition (base +
combining → precomposed) |
| `iree_unicode_ccc()` | Returns CCC for any codepoint (0 for starters,
non-zero for combining marks) |
| `iree_unicode_compose_pair()` | Looks up (base, combining) → composed
codepoint |
| `iree_unicode_compose()` | Applies canonical ordering and composition
to UTF-8 strings |

The composition function uses chunk-based processing with a small fixed
internal buffer (128 bytes), avoiding the earlier requirement of 4×
input size for scratch space.

> **Design Note**: This implements composition-only, not full NFC (which
requires `Compose(Decompose(input))`). This is sufficient for tokenizer
use cases since BERT uses NFD + accent stripping rather than NFC, and
GPT-2/LLaMA use no normalizer. Real-world input is typically already NFC
from OS input methods.

### JSON Parsing Enhancements (`runtime/src/iree/base/internal/json.c`)

- **Locale-independent number parsing** - Removes `ctype.h` dependency
for consistent behavior
- **Full Unicode escape support** - `iree_json_unescape_string()`
handles `\uXXXX` and UTF-16 surrogate pairs for codepoints > U+FFFF
- **New lookup APIs**:
- `iree_json_try_lookup_object_value()` - Returns empty string view for
missing keys (not an error)
- `iree_json_enumerate_array_typed()` - Array enumeration with value
type inference
- `iree_json_array_length()` and `iree_json_array_get()` - Direct
indexed access
- **Performance optimization** - Uses single-character prefix functions
instead of `strncmp` for delimiter checking

### String View Utilities (`runtime/src/iree/base/string_view.h`)

- `iree_string_view_starts_with_char()` - Check if string starts with a
single character
- `iree_string_view_consume_prefix_char()` - Remove leading char if it
matches

These inline functions avoid `strncmp` overhead for common
single-character operations in parsers.

### Unicode Table Generator
(`build_tools/scripts/unicode_tables_gen.py`)

- Parses `DerivedNormalizationProps.txt` for composition exclusions
- Generates CCC entries and NFC composition pairs
- Reports statistics for new table types

## Statistics

```
 10 files changed, 3054 insertions(+), 54 deletions(-)
```

Key additions:
- ~850 CCC table entries (~2.5KB)
- ~930 NFC composition pairs (~11KB)
- ~600 new lines of comprehensive test coverage

## Testing

- Comprehensive unit tests for Unicode normalization including edge
cases (empty strings, ASCII-only, multiple combining marks, boundary
conditions)
- Extended JSON tests for Unicode escape handling, surrogate pairs, and
new APIs
- All existing tests continue to pass

## Motivation

This is pre-work for the tokenizer implementation. HuggingFace
tokenizer.json files require:

1. Full Unicode escape decoding (for special tokens like `\u0120`
representing ` ` with preceding space marker)
2. NFD/NFC normalization for BERT-style tokenizers that strip accents
10 files changed
tree: b2e18da127c5f8a39e3af17845bb4cdc850c94e8
  1. .github/
  2. build_tools/
  3. compiler/
  4. docs/
  5. experimental/
  6. integrations/
  7. lib/
  8. llvm-external-projects/
  9. runtime/
  10. samples/
  11. tests/
  12. third_party/
  13. tools/
  14. .bazel_to_cmake.cfg.py
  15. .bazelignore
  16. .bazelrc
  17. .bazelversion
  18. .clang-format
  19. .git-blame-ignore-revs
  20. .gitattributes
  21. .gitignore
  22. .gitmodules
  23. .pre-commit-config.yaml
  24. .yamllint.yml
  25. AUTHORS
  26. BUILD.bazel
  27. CITATION.cff
  28. CMakeLists.txt
  29. configure_bazel.py
  30. CONTRIBUTING.md
  31. LICENSE
  32. MAINTAINERS.md
  33. MODULE.bazel
  34. README.md
  35. RELEASING.md
README.md

IREE: Intermediate Representation Execution Environment

IREE (Intermediate Representation Execution Environment, pronounced as “eerie”) is an MLIR-based end-to-end compiler and runtime that lowers Machine Learning (ML) models to a unified IR that scales up to meet the needs of the datacenter and down to satisfy the constraints and special considerations of mobile and edge deployments.

See our website for project details, user guides, and instructions on building from source.

IREE Discord Status pre-commit OpenSSF Best Practices

Project news

Project status

Release status

Releases notes are published on GitHub releases.

PackageRelease status
GitHub release (stable)GitHub Release
GitHub release (nightly)GitHub Release
iree-base-compilerPyPI version
iree-base-runtimePyPI version

For more details on the release process, see https://iree.dev/developers/general/release-management/.

Build status

CI PkgCI

Nightly build status

Operating systemBuild status
LinuxCI - Linux arm64 clang
macOSCI - macOS x64 clang
macOSCI - macOS arm64 clang

For the full list of workflows see https://iree.dev/developers/general/github-actions/.

Communication channels

Related project channels

  • MLIR topic within LLVM Discourse: IREE is enabled by and heavily relies on MLIR. IREE sometimes is referred to in certain MLIR discussions. Useful if you are also interested in MLIR evolution.

Architecture overview

IREE Architecture IREE Architecture

See our website for more information.

Presentations and talks

Community meeting recordings: IREE YouTube channel

DateTitleRecordingSlides
2025-06-10Data-Tiling in IREE: Achieving High Performance Through Compiler Design (AsiaLLVM)recordingslides
2025-05-17Introduction to GPU architecture and IREE's GPU CodeGen Pipelinerecordingslides
2025-02-12The Long Tail of AI: SPIR-V in IREE and MLIR (Vulkanised)recordingslides
2024-10-01Unveiling the Inner Workings of IREE: An MLIR-Based Compiler for Diverse Hardwarerecording
2021-06-09IREE Runtime Design Tech Talkrecordingslides
2020-08-20IREE CodeGen (MLIR Open Design Meeting)recordingslides
2020-03-18Interactive HAL IR Walkthroughrecording
2020-01-31End-to-end MLIR Workflow in IREE (MLIR Open Design Meeting)recordingslides

License

IREE is licensed under the terms of the Apache 2.0 License with LLVM Exceptions. See LICENSE for more information.