2020 Q4 Objectives (OKRs)
- CPU perf burndown. Bring up the perf burndown process and turn the crank a few times on CPU codegen. The initial workload we'll be targeting is the MobileBERT encoder. In parallel we will assess potential alternative workloads for next quarter.
- GPU back-end. Continue to land critical infrastructure, take a moment to pause and evaluate where we stand on performance, and pursue any low-hanging perf fruit in anticipation of a GPU perf burndown in Q1. To the extent it makes sense, prioritize work that benefits both CPU and GPU perf.
- Infrastructure. Continue to make critical improvements to build infrastructure that improve development velocity. Support the CPU perf burndown effort by ensuring that benchmarking can be performed easily and its results are meaningful and reliable.
P1 O: [Perf burndown] Bring up and execute perf burndown process for MobileBERT workload
- P1 KR: IREE CPU codegen achieves near peak throughput on all of the MobileBERT matmul shapes
- At least the ones with > 4% weight in the above profile.
- P1 KR: Ensure that IREE CPU codegen achieves decent performance on softmax. That means matching TFLite performance on the softmax layers in the MobileBERT model.
- P1 KR: Able to benchmark and profile the whole MobileBERT workload in both TFLite and IREE and compare results.
- P1 KR: Performed at least 2 cycles of: (a) assess whole workload performance (b) identify key source of performance delta between TFLite and IREE (c) fix the issue (d) repeat.
- P1 KR: Prioritized list of key sources of performance delta between TFLite and IREE
- This list should be largely composed of issues that have non-trivial resolutions, like a re-design of a key IREE component. It should be updated on an ongoing basis.
- P2 KR: MobileBERT end-to-end benchmark matches 80% of TFLite performance.
- P2 KR: Sources of remaining perf gap in TFLite vs IREE on MobileBERT end-to-end benchmark characterized.
P1 O: [Perf burndown] Improve benchmarking and profiling tooling to support perf burndown
P1 KR: Able to micro-benchmark kernels using a shared, documented tool
- Improve the dump of input and output for each dispatch function.
- Dump dispatch functions to files
- Improve diffing tools
- Report performance for each dispatch function?
P1 KR: Able to perform IREE profiling using Tracy
P1 KR: Able to map time spent in execution to back to source using Tracy
P1 KR: Able to track compile-time performance-related statistics
- See https://github.com/google/iree/issues/1409
- Initial stats to track: number of executables, the serialized size of constant data, the serialized size of the executables, the number of host readbacks (flow.tensor.load), backend specific stats like the number of split dispatches in the SPIR-V backend, dynamic shape info like the number of tensors with dynamic shapes that survive after shape propagation
P1 KR: Internal and external contributors able to confidently assess performance impact of a change.
- Like correctness, have tests to guard against performance regressions
- Either some kind of presubmit using a consistent environment or instructions for running something manually that we believe will offer a useful before/after signal.
- Include TFLite as a baseline.
- Build test suite and mechanism for tracking performance
P1 O: [Perf burndown] Identify target workload for IREE perf credibility burndown
Note: This is in preparation for a 2021Q1 objective: Establish IREE's credibility at delivering competitive production levels of performance on a realistic use case.
- P1 KR: Defined criteria for evaluating workloads for the burndown.
- P1 KR: Evaluated criteria (including performance analysis) for all candidate workloads.
- P1 KR: Selected target workload for the CPU burndown.
- P1 KR: Selected target workload for the GPU burndown in Q1.
- Representative for GPU. what to compare. What we can achieve in 1Q
P1 O: [Perf burndown] Add initial support for multi-threaded workloads
- P1 KR: Selected a target multi-threaded workload for development and initial benchmarking
- P1 KR: Able to benchmark a multi-threaded workload on CPU.
- P1 KR: Use GPU tiling pass on CPU, as well as treating CPU as another device
- P2 KR: Documentation of known issues / architectural challenges with current approach to multi-threading.
P1 O: [Infra] Improve OSS build infrastructure to support continued development
P1 KR: Minimal-effort merging process for integrating new LLVM commits with no dependency on TF
- Continuous build for OSS LLVM build files. Propose upstreaming to LLVM community.
P1 KR: IREE Core build doesn't depend on TF
- Requires integration with separate MLIR-HLO repo.
P1 KR: Extended build bot / lint coverage catches issues in OSS
- asan, clang-tidy, windows iff volunteer with windows machine, yapf, bazel android, unibeautify for formatting more generally(?), binary size.
P1 KR: General build health
- A bunch of small things that require attention. RBE warnings on diagnostics that are disabled, remove use of globs in cmake, keep up to date with Bazel versions, see if we can speedup Bazel bot builds by moving sandbox root to tmpfs, remove common include directories.
P2 KR: Check tests are fast in dbg mode
- Currently we've got quadratic growth for a slow path in swiftshader dbg. Potential solutions: IREE multi-module -> archive compilation support, conditional sharing of instances in tests.
P1 O: [User-facing] Prepare to support real-world use cases
Notes: Keep a pulse on deployment user journeys, continue to gather requirements from interested users, set ourselves on a path to production use on at least one platform.
P1 KR: A new sample application showing high-level IREE behavior
- Android/Java speech demo and/or desktop Vulkan image-to-image pipeline
P1 KR: A PRD that captures deployment requirements (such as platform and device support, async model download, etc.)
P2 KR: Identify and track list of potential customers, their use cases, and deployment requirements
P2 KR: Proof of concept deployment with at least one other team
- Focus on requirements gathering and prototyping, find what features we're missing (e.g. Android or Vulkan memory sharing, build configurations compatible with Stadia, etc.)
P1 O: [User-facing] Expand Java API to support a useful sample Android app
- P1 KR: Support for different input and output types
- P1 KR: Support for driver creation
- P1 KR: Support for different module types, including: HAL, bytecode, tensorlist, and string
- P2 KR: Design for supporting custom modules
P1 O: [Model support] Achieve target Tensorflow front-end fidelity
- P1 KR: Full ASR Decoder to HLO
- P1 KR: Investigate TF to HLO lowering without TF folding (e.g. linspace)
- P2 KR: Transformer ASR to HLO
- P2 KR: Documentation detailing unlowerable operations to HLO
- P3 KR: Stretch: SASP rewrite for better shape inference
P1 O: [Model support] Implement remaining VMLA ops
- P1 KR: FFT operations
- P1 KR: Sort operation
- P1 KR: Full ASR Decoder executing on VMLA
- P2 KR: Transformer ASR executing on VMLA
P1 O: [CPU Codegen] Improve CPU codegen infrastructure
P1 KR: Lowering path MHLO ops --> linalg named Ops (library calls).
- Sort, TopK and similar indexing stye ops
P1 KR: Support Linalg fusion on buffers using stack allocations.
P1 KR: Improve AOT linking and support automatic toolchain discovery
- Link all executables in a single dylib.
- Support exporting/loading dylib to standalone binary.
P1 O: [MLIR codegen] Retargetable codegeneration (Vector dialect-based approach)
- P1 KR: Develop mechanisms to distribute vector operation at workgroup level to vector operation at subgroup level / work item level
- P1 KR: Handle distribution of producer-consumer vector operations to implement fusion
P1 O: [GPU Codegen] Improve generic GPU codegen performance
P1 KR: Using Linalg fusion on buffers
- Fuse operations like matmul/conv with its producers/consumers, using workgroup memory as intermediate tile storage. Examples: Elementwise -> Matmul -> Elementwise, Padding -> Conv/Pool operations
P1 KR: Fast matmul for mobile GPUs that don't have tensor core units
- This applies work from Q2 to more architectures (e.g., ARM and Qualcomm GPUs)
- Match handwritten vulkan kernel (On Pixel4: current iree matmul for a 1K matrix runs in 128ms and handwritten kernel runs in 19ms)
P1 KR: Using subgroup operations for reduction
P1 KR: Achieve reasonable performance for one model on one mobile GPU
- Match TFLite MobileNetV2 performance (f32, imagenet)
P1 O: [GPU Codegen] Add support for targeting different GPU hardware
P1 O: [GPU Codegen] Build infrastructure for performance improvement
P1 O: [Strategy] Define high-level strategy for quantization support in IREE
- P1 KR: Strategy document describing 2021 roadmap, resourcing, and approach
- P1 KR: Two initial quantization projects described and queued up for 2021 Q1