)]}'
{
  "commit": "08ae97f9007c8a54e619f6252dd0962ee63342a6",
  "tree": "20a7a9fcd8cca014af097e06927019159d3b8070",
  "parents": [
    "38abd137115ff15f6ee8fd1f5f6c5055b81c9867",
    "784db04adc74c5bfa5a434f1e686a0ff69906e8b"
  ],
  "author": {
    "name": "Ben Vanik",
    "email": "ben.vanik@gmail.com",
    "time": "Fri Mar 03 20:43:47 2023 -0800"
  },
  "committer": {
    "name": "GitHub",
    "email": "noreply@github.com",
    "time": "Fri Mar 03 20:43:47 2023 -0800"
  },
  "message": "Refreshing VM performance by separating verification and tweaking buffer access. (#12426)\n\nThis moves much of the interpreter checks to an ahead-of-time bytecode\r\nverifier. This allows us to share the same verification with JITs and\r\ndisable it entirely for code size reasons using the\r\n`-DIREE_VM_BYTECODE_VERIFICATION_ENABLE\u003d0` compiler flag. Verification\r\nis pretty exhaustive but may still need some additions. It\u0027s\r\nsignificantly better than before, though, so even if not the final form\r\nit\u0027s a good step. Simpler verification (and dispatch) will come with a\r\npending bytecode shuffling into instruction classes.\r\n\r\nSince this required breaking binary compatibility I did some deferred\r\nchanges that had been sitting in\r\nhttps://github.com/openxla/iree/projects/32, namely requirement bits and\r\nfixing vm.buffer.fill from bytes to elements.\r\n\r\nThe requirements give us much nicer error messages by supporting\r\nper-module and per-function bitfields indicating features required to\r\nexecute the bytecode they contain, e.g.:\r\n```\r\nD:\\Dev\\iree\\runtime\\src\\iree\\vm\\bytecode\\module.c:309: INVALID_ARGUMENT; required module features [EXT_F32] are not available in this runtime configuration; have [] while module requires [EXT_F32]; while invoking native function hal.executable.create; while calling import;\r\n[ 1]   native hal.executable.create:0 -\r\n[ 0] bytecode module.__init:446 D:\\Dev\\iree/tests/e2e/models/unidirectional_lstm.mlir:0:0\r\n```\r\n\r\nSplitting out the verification from dispatch is good for code\r\nreuse/optionality but also now lets us dispatch without verifying.\r\nBetween removing the inlined verification/register masking/etc and\r\nstreamlining buffer access we get a near 2x speedup of compute-heavy\r\nVMVX workloads. resnet50 for example on my ryzen system (with\r\n`--iree-vm-target-index-bits\u003d64`, which I need to make default):\r\n\r\n```\r\nbefore:\r\n1 core:  BM_predict/process_time/real_time     343774 ms       343766 ms            1 items_per_second\u003d2.90888m/s\r\n8 core:  BM_predict/process_time/real_time      48306 ms       361156 ms            1 items_per_second\u003d0.0207012/s\r\n32 core: BM_predict/process_time/real_time      18943 ms       408922 ms            2 items_per_second\u003d0.0527891/s\r\n\r\nafter:\r\n1 core:  BM_predict/process_time/real_time     147856 ms       147859 ms            1 items_per_second\u003d6.76332m/s\r\n8 core:  BM_predict/process_time/real_time      21569 ms       158781 ms            1 items_per_second\u003d0.0463637/s\r\n32 core: BM_predict/process_time/real_time       8962 ms       186276 ms            3 items_per_second\u003d0.111579/s\r\n```\r\n\r\nAbout ~20-30% of the remaining runtime is spent in bytecode dispatch\r\nwhich needs a larger op table reworking to make better. That\u0027ll also\r\nreduce code size quite a bit as today we have a lot of duplicate\r\ndecoding work. The most expensive ops remaining are buffer loads/stores\r\nand short of JIT or scatter/gather such as #8477 there\u0027s not much to do\r\nbesides less work. Today codegen is producing some phenomenally bad code\r\nand we\u0027re executing ~100x+ more instructions than required\r\n(https://gist.github.com/benvanik/e2b45891e02baf8318109b60189a1b12 for\r\nexample - that\u0027s _a lot_ of loop arithmetic, a useless fill that should\r\nbe removed or at least turned into a util.buffer.fill, and unfused\r\nwritebacks) - even without op table shuffling or microkernels we should\r\nbe well under 900ms instead of 9000ms. We\u0027ve also got to parameterize\r\nour workgroup distribution - today we don\u0027t tend to use more than 4-16\r\ncores so we don\u0027t see the latency improvement we\u0027d expect going 8-\u003e32\r\ncores (16 cores is nearly identical).\r\n\r\nThese issues also impact emitc paths as the C compiler downstream of\r\nthat is dealing with all this difficult to analyze output and can\u0027t do\r\nmuch. I thought an emitc resnet with inline VMVX would be a good\r\napproximation of what a JIT could do and it\u0027s not good: 448154ms vs the\r\n147856ms of the interpreter! It\u0027s also 2x larger in size on disk (380KB\r\nx86_64 vs 200KB bytecode) and that\u0027s prior to optimization of the\r\nbytecode. We should definitely be able to do 2-4x faster with a naïve\r\nJIT - if not 10x!\r\n\r\nFixes #5732.\r\nFixes #12373.",
  "tree_diff": []
}
