commit | 08ae97f9007c8a54e619f6252dd0962ee63342a6 | [log] [tgz] |
---|---|---|
author | Ben Vanik <ben.vanik@gmail.com> | Fri Mar 03 20:43:47 2023 -0800 |
committer | GitHub <noreply@github.com> | Fri Mar 03 20:43:47 2023 -0800 |
tree | 20a7a9fcd8cca014af097e06927019159d3b8070 | |
parent | 38abd137115ff15f6ee8fd1f5f6c5055b81c9867 [diff] | |
parent | 784db04adc74c5bfa5a434f1e686a0ff69906e8b [diff] |
Refreshing VM performance by separating verification and tweaking buffer access. (#12426) This moves much of the interpreter checks to an ahead-of-time bytecode verifier. This allows us to share the same verification with JITs and disable it entirely for code size reasons using the `-DIREE_VM_BYTECODE_VERIFICATION_ENABLE=0` compiler flag. Verification is pretty exhaustive but may still need some additions. It's significantly better than before, though, so even if not the final form it's a good step. Simpler verification (and dispatch) will come with a pending bytecode shuffling into instruction classes. Since this required breaking binary compatibility I did some deferred changes that had been sitting in https://github.com/openxla/iree/projects/32, namely requirement bits and fixing vm.buffer.fill from bytes to elements. The requirements give us much nicer error messages by supporting per-module and per-function bitfields indicating features required to execute the bytecode they contain, e.g.: ``` D:\Dev\iree\runtime\src\iree\vm\bytecode\module.c:309: INVALID_ARGUMENT; required module features [EXT_F32] are not available in this runtime configuration; have [] while module requires [EXT_F32]; while invoking native function hal.executable.create; while calling import; [ 1] native hal.executable.create:0 - [ 0] bytecode module.__init:446 D:\Dev\iree/tests/e2e/models/unidirectional_lstm.mlir:0:0 ``` Splitting out the verification from dispatch is good for code reuse/optionality but also now lets us dispatch without verifying. Between removing the inlined verification/register masking/etc and streamlining buffer access we get a near 2x speedup of compute-heavy VMVX workloads. resnet50 for example on my ryzen system (with `--iree-vm-target-index-bits=64`, which I need to make default): ``` before: 1 core: BM_predict/process_time/real_time 343774 ms 343766 ms 1 items_per_second=2.90888m/s 8 core: BM_predict/process_time/real_time 48306 ms 361156 ms 1 items_per_second=0.0207012/s 32 core: BM_predict/process_time/real_time 18943 ms 408922 ms 2 items_per_second=0.0527891/s after: 1 core: BM_predict/process_time/real_time 147856 ms 147859 ms 1 items_per_second=6.76332m/s 8 core: BM_predict/process_time/real_time 21569 ms 158781 ms 1 items_per_second=0.0463637/s 32 core: BM_predict/process_time/real_time 8962 ms 186276 ms 3 items_per_second=0.111579/s ``` About ~20-30% of the remaining runtime is spent in bytecode dispatch which needs a larger op table reworking to make better. That'll also reduce code size quite a bit as today we have a lot of duplicate decoding work. The most expensive ops remaining are buffer loads/stores and short of JIT or scatter/gather such as #8477 there's not much to do besides less work. Today codegen is producing some phenomenally bad code and we're executing ~100x+ more instructions than required (https://gist.github.com/benvanik/e2b45891e02baf8318109b60189a1b12 for example - that's _a lot_ of loop arithmetic, a useless fill that should be removed or at least turned into a util.buffer.fill, and unfused writebacks) - even without op table shuffling or microkernels we should be well under 900ms instead of 9000ms. We've also got to parameterize our workgroup distribution - today we don't tend to use more than 4-16 cores so we don't see the latency improvement we'd expect going 8->32 cores (16 cores is nearly identical). These issues also impact emitc paths as the C compiler downstream of that is dealing with all this difficult to analyze output and can't do much. I thought an emitc resnet with inline VMVX would be a good approximation of what a JIT could do and it's not good: 448154ms vs the 147856ms of the interpreter! It's also 2x larger in size on disk (380KB x86_64 vs 200KB bytecode) and that's prior to optimization of the bytecode. We should definitely be able to do 2-4x faster with a naïve JIT - if not 10x! Fixes #5732. Fixes #12373.
IREE (Intermediate Representation Execution Environment, pronounced as “eerie”) is an MLIR-based end-to-end compiler and runtime that lowers Machine Learning (ML) models to a unified IR that scales up to meet the needs of the datacenter and down to satisfy the constraints and special considerations of mobile and edge deployments.
See our website for project details, user guides, and instructions on building from source.
IREE is still in its early phase. We have settled down on the overarching infrastructure and are actively improving various software components as well as project logistics. It is still quite far from ready for everyday use and is made available without any support at the moment. With that said, we welcome any kind of feedback on any communication channels!
See our website for more information.
IREE is licensed under the terms of the Apache 2.0 License with LLVM Exceptions. See LICENSE for more information.