Refreshing VM performance by separating verification and tweaking buffer access. (#12426)

This moves much of the interpreter checks to an ahead-of-time bytecode
verifier. This allows us to share the same verification with JITs and
disable it entirely for code size reasons using the
`-DIREE_VM_BYTECODE_VERIFICATION_ENABLE=0` compiler flag. Verification
is pretty exhaustive but may still need some additions. It's
significantly better than before, though, so even if not the final form
it's a good step. Simpler verification (and dispatch) will come with a
pending bytecode shuffling into instruction classes.

Since this required breaking binary compatibility I did some deferred
changes that had been sitting in
https://github.com/openxla/iree/projects/32, namely requirement bits and
fixing vm.buffer.fill from bytes to elements.

The requirements give us much nicer error messages by supporting
per-module and per-function bitfields indicating features required to
execute the bytecode they contain, e.g.:
```
D:\Dev\iree\runtime\src\iree\vm\bytecode\module.c:309: INVALID_ARGUMENT; required module features [EXT_F32] are not available in this runtime configuration; have [] while module requires [EXT_F32]; while invoking native function hal.executable.create; while calling import;
[ 1]   native hal.executable.create:0 -
[ 0] bytecode module.__init:446 D:\Dev\iree/tests/e2e/models/unidirectional_lstm.mlir:0:0
```

Splitting out the verification from dispatch is good for code
reuse/optionality but also now lets us dispatch without verifying.
Between removing the inlined verification/register masking/etc and
streamlining buffer access we get a near 2x speedup of compute-heavy
VMVX workloads. resnet50 for example on my ryzen system (with
`--iree-vm-target-index-bits=64`, which I need to make default):

```
before:
1 core:  BM_predict/process_time/real_time     343774 ms       343766 ms            1 items_per_second=2.90888m/s
8 core:  BM_predict/process_time/real_time      48306 ms       361156 ms            1 items_per_second=0.0207012/s
32 core: BM_predict/process_time/real_time      18943 ms       408922 ms            2 items_per_second=0.0527891/s

after:
1 core:  BM_predict/process_time/real_time     147856 ms       147859 ms            1 items_per_second=6.76332m/s
8 core:  BM_predict/process_time/real_time      21569 ms       158781 ms            1 items_per_second=0.0463637/s
32 core: BM_predict/process_time/real_time       8962 ms       186276 ms            3 items_per_second=0.111579/s
```

About ~20-30% of the remaining runtime is spent in bytecode dispatch
which needs a larger op table reworking to make better. That'll also
reduce code size quite a bit as today we have a lot of duplicate
decoding work. The most expensive ops remaining are buffer loads/stores
and short of JIT or scatter/gather such as #8477 there's not much to do
besides less work. Today codegen is producing some phenomenally bad code
and we're executing ~100x+ more instructions than required
(https://gist.github.com/benvanik/e2b45891e02baf8318109b60189a1b12 for
example - that's _a lot_ of loop arithmetic, a useless fill that should
be removed or at least turned into a util.buffer.fill, and unfused
writebacks) - even without op table shuffling or microkernels we should
be well under 900ms instead of 9000ms. We've also got to parameterize
our workgroup distribution - today we don't tend to use more than 4-16
cores so we don't see the latency improvement we'd expect going 8->32
cores (16 cores is nearly identical).

These issues also impact emitc paths as the C compiler downstream of
that is dealing with all this difficult to analyze output and can't do
much. I thought an emitc resnet with inline VMVX would be a good
approximation of what a JIT could do and it's not good: 448154ms vs the
147856ms of the interpreter! It's also 2x larger in size on disk (380KB
x86_64 vs 200KB bytecode) and that's prior to optimization of the
bytecode. We should definitely be able to do 2-4x faster with a naïve
JIT - if not 10x!

Fixes #5732.
Fixes #12373.
tree: 20a7a9fcd8cca014af097e06927019159d3b8070
  1. .github/
  2. benchmarks/
  3. build_tools/
  4. compiler/
  5. docs/
  6. experimental/
  7. integrations/
  8. lib/
  9. llvm-external-projects/
  10. runtime/
  11. samples/
  12. tests/
  13. third_party/
  14. tools/
  15. .bazelignore
  16. .bazelrc
  17. .bazelversion
  18. .clang-format
  19. .dockerignore
  20. .gitignore
  21. .gitmodules
  22. .pylintrc
  23. .style.yapf
  24. .yamllint.yml
  25. AUTHORS
  26. BUILD.bazel
  27. CITATION.cff
  28. CMakeLists.txt
  29. configure_bazel.py
  30. CONTRIBUTING.md
  31. LICENSE
  32. README.md
  33. WORKSPACE
README.md

IREE: Intermediate Representation Execution Environment

IREE (Intermediate Representation Execution Environment, pronounced as “eerie”) is an MLIR-based end-to-end compiler and runtime that lowers Machine Learning (ML) models to a unified IR that scales up to meet the needs of the datacenter and down to satisfy the constraints and special considerations of mobile and edge deployments.

See our website for project details, user guides, and instructions on building from source.

CI Status

Project Status

IREE is still in its early phase. We have settled down on the overarching infrastructure and are actively improving various software components as well as project logistics. It is still quite far from ready for everyday use and is made available without any support at the moment. With that said, we welcome any kind of feedback on any communication channels!

Communication Channels

Related Project Channels

  • MLIR topic within LLVM Discourse: IREE is enabled by and heavily relies on MLIR. IREE sometimes is referred to in certain MLIR discussions. Useful if you are also interested in MLIR evolution.

Architecture Overview

IREE Architecture IREE Architecture

See our website for more information.

Presentations and Talks

  • 2021-06-09: IREE Runtime Design Tech Talk (recording and slides)
  • 2020-08-20: IREE CodeGen: MLIR Open Design Meeting Presentation (recording and slides)
  • 2020-03-18: Interactive HAL IR Walkthrough (recording)
  • 2020-01-31: End-to-end MLIR Workflow in IREE: MLIR Open Design Meeting Presentation (recording and slides)

License

IREE is licensed under the terms of the Apache 2.0 License with LLVM Exceptions. See LICENSE for more information.