| # Simple IR Walkthrough |
| |
| Note that this doc is quite outdated. We expect to update it soon. |
| |
| ## Overview |
| |
| This walks through the process of lowering TensorFlow python to an IREE module, |
| demonstrating the MLIR that exists at each stage. Many individual intermediate |
| transforms are skipped for clarity but the major dialect milestones during |
| lowering are present. |
| |
| **NOTE**: this represents the IR as it exists at the time of writing, which is |
| planned to undergo significant changes soon. Take this more as a conceptual |
| walkthrough than a reference for what the IR looks like. |
| |
| ## TensorFlow to XLA HLO |
| |
| The "frontend" in this example is TensorFlow and we import that into MLIR in the |
| TensorFlow dialect and lower it to the mid-level IR of XLA HLO. Many backends |
| can consume the XLA HLO (such as TPU and CUDA), not just IREE, meaning that the |
| work required to convert the TensorFlow ops to the much more restricted set of |
| XLA HLO is shared amongst many projects. |
| |
| ### TensorFlow Python |
| |
| This is using the TensorFlow 1.0 syntax producing a GraphDef. IREE is designed |
| to work best with the TensorFlow 2.0 SavedModel representation. |
| |
| ```python |
| import tensorflow as tf |
| with tf.Session() as session: |
| arg0 = tf.placeholder(tf.float32, shape=[4]) |
| arg1 = tf.placeholder(tf.float32, shape=[4]) |
| result = tf.multiply(arg0, arg1) |
| print(session.graph_def) |
| ``` |
| |
| ### TensorFlow GraphDef |
| |
| GraphDefs do not contain information about the feeds and fetches nor do they |
| identify exported functions. TensorFlow 2.0 makes this significantly easier but |
| since most are familiar with TF1.0 the GraphDef is displayed here. |
| |
| ```protobuf |
| node { |
| name: "Placeholder" |
| op: "Placeholder" |
| attr { |
| key: "dtype" |
| value { type: DT_FLOAT } |
| } |
| attr { |
| key: "shape" |
| value { shape { dim { size: 4 } } } |
| } |
| } |
| node { |
| name: "Placeholder_1" |
| op: "Placeholder" |
| attr { |
| key: "dtype" |
| value { type: DT_FLOAT } |
| } |
| attr { |
| key: "shape" |
| value { shape { dim { size: 4 } } } |
| } |
| } |
| node { |
| name: "Mul" |
| op: "Mul" |
| input: "Placeholder" |
| input: "Placeholder_1" |
| attr { |
| key: "T" |
| value { type: DT_FLOAT } |
| } |
| } |
| ``` |
| |
| ### XLA HLO |
| |
| XLA HLO is the dialect we try to lower to as instead of 1400+ ops in TensorFlow |
| we end up with ~30 ops that better represent the actual math being performed. |
| The |
| [XLA Operation Semantics](https://www.tensorflow.org/xla/operation_semantics) |
| are well(ish) documented and a great starting point for lowering into other |
| dialects. The existing |
| [tf2xla](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/compiler/tf2xla) |
| bridge can be used to convert the ops from GraphDef to XLA HLO, while a |
| [new implementation](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/compiler/mlir/xla/transforms) |
| based in MLIR is currently being written. |
| |
| ```mlir |
| func @simple_mul(%arg0: tensor<4xf32>, %arg1: tensor<4xf32>) -> tensor<4xf32> { |
| %0 = mhlo.multiply(%arg0, %arg1) : tensor<4xf32> |
| return %0 : tensor<4xf32> |
| } |
| ``` |
| |
| What was a graph of nodes now looks much more like a traditional program: |
| there's a function with a well-defined signature, an operation that performs |
| some math on the operands, and the result of the math is returned. |
| |
| In the XLA HLO dialect it's possible to express control flow (calls, loops, |
| conditionals), complex multi-operation regions like reductions, etc. All |
| TensorFlow graph semantics (control edges, switch/merge, etc) are lowered to |
| this form, and all data edges are converted to SSA values. |
| |
| ## IREE Module IR |
| |
| Once lowered to XLA HLO the IREE transformations work to legalize and lower to a |
| high-level sequencer dialect (iree_hl_seq). At this point we are still operating |
| on tensors with value-semantics allowing us to use the SSA representation in |
| MLIR to do some relatively complex (yet easy to express) transforms. |
| |
| ### Dispatch Region Identification |
| |
| The final IREE module is designed to have as few sequencer operations as |
| possible. This is achieved by clustering operations into regions such that data |
| dependencies and execution order are correctly observed and that the dispatch |
| workload (roughly the shape of the output) is compatible. Jumping ahead a bit, |
| the dispatch regions correspond to dispatches against the target API (such as |
| Vulkan vkCmdDispatch) modulo threadgroup sizes. When still operating with value |
| semantics it's easy to use SSA use-def chains to ensure we are preserving the |
| expected behavior of the program. |
| |
| ```mlir |
| func @simple_mul(%arg0: tensor<4xf32>, %arg1: tensor<4xf32>) -> tensor<4xf32> { |
| %cst = arith.constant dense<[4, 1, 1]> : tensor<3xi32> |
| %0 = iree.dispatch_region[%cst : tensor<3xi32>](%arg2 = %arg0 : tensor<4xf32>, %arg3 = %arg1 : tensor<4xf32>) : tensor<4xf32> { |
| %1 = arith.mulf %arg2, %arg3 : tensor<4xf32> |
| iree.return %1 : tensor<4xf32> |
| } |
| return %0 : tensor<4xf32> |
| } |
| ``` |
| |
| In the above example, the workload is defined by `%cst` as 4x1x1. If there were |
| other ops that were also of a `dot(4,1,1)` workload we could cluster those here. |
| |
| Other dispatch-like operations, such as reductions, are also identified and |
| clustered appropriately at this stage. What we end up with is a top-level IR |
| performing dispatches with nested regions containing the work to perform. When |
| all identification has completed the goal is to have no math outside of the |
| dispatch regions (though copies are permitted). |
| |
| Additional passes may run that combine, split, or otherwise transform the |
| dispatch regions based on a set of cost functions or target capability metrics. |
| For example, to ensure predictable maximum latency larger dispatch regions may |
| be split based on how much memory bandwidth they are likely to consume. |
| |
| ### Executable Outlining and High-level Sequencer IR |
| |
| The first step in lowering to the IREE sequencer IR describing the runtime |
| sequence of operations to perform is to isolate the work being performed from |
| how it is to be dispatched. We outline dispatch regions into `iree.executable`s |
| and replace the original `iree.dispatch_region` with `iree_hl_seq.dispatch` ops |
| referencing those executables. At this point we still have not specified what |
| our exact lowering targets are, however we know enough to establish the basic |
| ABI used to pass parameters. |
| |
| ```mlir |
| module { |
| iree.multi_arch_executable @simple_mul_ex_dispatch_0() { |
| iree.executable(Unspecified) { |
| module { |
| func @simple_mul_rgn_dispatch_0(%arg0: memref<4xf32>, %arg1: memref<4xf32>, %arg2: memref<4xf32>) |
| attributes {iree.executable.export} { |
| %0 = iree.load_input(%arg0 : memref<4xf32>) : tensor<4xf32> |
| %1 = iree.load_input(%arg1 : memref<4xf32>) : tensor<4xf32> |
| %2 = arith.mulf %0, %1 : tensor<4xf32> |
| iree.store_output(%2 : tensor<4xf32>, %arg2 : memref<4xf32>) |
| iree.return |
| } |
| } |
| } |
| } |
| func @simple_mul(%arg0: memref<4xf32>, %arg1: memref<4xf32>) -> memref<4xf32> { |
| %0 = iree_interp.constant dense<[4, 1, 1]> : tensor<3xi32> |
| %1 = "iree_hl_seq.alloc_heap"() : () -> memref<4xf32> |
| iree_hl_seq.dispatch simple_mul_ex_dispatch_0::simple_mul_rgn_dispatch_0[%0 : memref<3xi32>](%arg0, %arg1, %1) : (memref<4xf32>, memref<4xf32>, memref<4xf32>) -> () |
| iree_hl_seq.return %1 : memref<4xf32> |
| } |
| } |
| ``` |
| |
| Here we've allocated the output argument for the dispatch region via |
| `iree_hl_seq.alloc_heap` and passed it as an argument into the dispatch. The |
| executable entry point function gains a matching output argument where the final |
| result is stored. The `iree.load_input` and `iree.store_output` pseudo-commands |
| are used by backends in following lowering steps to determine how to load and |
| store their arguments. |
| |
| ### Low-level Sequencer IR |
| |
| Once we've established the signatures between the sequencer and the executable |
| we can lower the sequencer IR to an explicitly-allocated dialect and perform |
| memory allocation. Here we attempt to alias/reuse buffers, determine buffers |
| that can be entirely elided, and reorder dispatches so that they can more easily |
| be grouped based on required barriers. Thanks to MLIR's built-in folding logic |
| we can also do some IR optimizations such as converting the generic dispatch to |
| a `iree_ll_seq.static_dispatch`, as we know the workload size at compile-time. |
| |
| As part of this we also propagate any static information we can determine, such |
| as the workload, into the executables. This is to help aid backends in lowering |
| more efficiently when possible. |
| |
| ```mlir |
| module { |
| iree.multi_arch_executable @simple_mul_ex_dispatch_0[0]() { |
| iree.executable(Unspecified) { |
| module { |
| func @simple_mul_rgn_dispatch_0(%arg0: memref<4xf32>, %arg1: memref<4xf32>, %arg2: memref<4xf32>) |
| attributes {iree.executable.export, iree.executable.workload = dense<[4, 1, 1]> : tensor<3xi32>, iree.ordinal = 0 : i32} { |
| %0 = iree.load_input(%arg0 : memref<4xf32>) : tensor<4xf32> |
| %1 = iree.load_input(%arg1 : memref<4xf32>) : tensor<4xf32> |
| %2 = arith.mulf %0, %1 : tensor<4xf32> |
| iree.store_output(%2 : tensor<4xf32>, %arg2 : memref<4xf32>) |
| iree.return |
| } |
| } |
| } |
| } |
| func @simple_mul(%arg0: memref<4xf32>, %arg1: memref<4xf32>) -> memref<4xf32> |
| attributes {iree.ordinal = 0 : i32} { |
| %0 = "iree_ll_seq.alloc_heap"() : () -> memref<4xf32> |
| iree_ll_seq.static_dispatch simple_mul_ex_dispatch_0::simple_mul_rgn_dispatch_0[dense<[4, 1, 1]> : tensor<3xi32>](%arg0, %arg1, %0) : (memref<4xf32>, memref<4xf32>, memref<4xf32>) -> () |
| iree_ll_seq.return %0 : memref<4xf32> |
| } |
| } |
| ``` |
| |
| ### Executable Lowering to SPIR-V |
| |
| For each executable and target combination we invoke an MLIR translation to some |
| target dialect. Here, we are lowering to the SPIR-V dialect, and use the current |
| IREE-specific XLA HLO-to-SPIR-V lowering passes. Other lowerings, as they become |
| available, can be swapped in. Below is the `simple_mul_ex_dispatch_0` executable |
| fully lowered to SPIR-V in the canonical MLIR SPIR-V dialect, which can be |
| trivially serialized to SPIR-V words. Note how the `iree.load_input` and |
| `iree.load_output` ops are lowered to storage buffer loads and stores. |
| |
| ```mlir |
| module { |
| spv.module "Logical" "GLSL450" { |
| spv.globalVariable @globalInvocationID built_in("GlobalInvocationId") : !spv.ptr<vector<3xi32>, Input> |
| spv.globalVariable @simple_mul_rgn_dispatch_0_arg_0 bind(0, 0) : !spv.ptr<!spv.struct<!spv.array<4 x f32 [4]> [0]>, StorageBuffer> |
| spv.globalVariable @simple_mul_rgn_dispatch_0_arg_1 bind(0, 1) : !spv.ptr<!spv.struct<!spv.array<4 x f32 [4]> [0]>, StorageBuffer> |
| spv.globalVariable @simple_mul_rgn_dispatch_0_arg_2 bind(0, 2) : !spv.ptr<!spv.struct<!spv.array<4 x f32 [4]> [0]>, StorageBuffer> |
| func @simple_mul_rgn_dispatch_0() { |
| %0 = spv._address_of @globalInvocationID : !spv.ptr<vector<3xi32>, Input> |
| %1 = spv.Load "Input" %0 : vector<3xi32> |
| %2 = spv.CompositeExtract %1[0 : i32] : vector<3xi32> |
| %3 = spv.CompositeExtract %1[1 : i32] : vector<3xi32> |
| %4 = spv.CompositeExtract %1[2 : i32] : vector<3xi32> |
| %5 = spv._address_of @simple_mul_rgn_dispatch_0_arg_0 : !spv.ptr<!spv.struct<!spv.array<4 x f32 [4]> [0]>, StorageBuffer> |
| %6 = spv.constant 0 : i32 |
| %7 = spv.AccessChain %5[%6, %2] : !spv.ptr<!spv.struct<!spv.array<4 x f32 [4]> [0]>, StorageBuffer> |
| %8 = spv.Load "StorageBuffer" %7 : f32 |
| %9 = spv._address_of @simple_mul_rgn_dispatch_0_arg_1 : !spv.ptr<!spv.struct<!spv.array<4 x f32 [4]> [0]>, StorageBuffer> |
| %10 = spv.constant 0 : i32 |
| %11 = spv.AccessChain %9[%10, %2] : !spv.ptr<!spv.struct<!spv.array<4 x f32 [4]> [0]>, StorageBuffer> |
| %12 = spv.Load "StorageBuffer" %11 : f32 |
| %13 = spv.FMul %8, %12 : f32 |
| %14 = spv._address_of @simple_mul_rgn_dispatch_0_arg_2 : !spv.ptr<!spv.struct<!spv.array<4 x f32 [4]> [0]>, StorageBuffer> |
| %15 = spv.constant 0 : i32 |
| %16 = spv.AccessChain %14[%15, %2] : !spv.ptr<!spv.struct<!spv.array<4 x f32 [4]> [0]>, StorageBuffer> |
| spv.Store "StorageBuffer" %16, %13 : f32 |
| spv.Return |
| } |
| spv.EntryPoint "GLCompute" @simple_mul_rgn_dispatch_0, @globalInvocationID |
| spv.ExecutionMode @simple_mul_rgn_dispatch_0 "LocalSize", 1, 1, 1 |
| } attributes {capabilities = ["Shader"], extensions = ["SPV_KHR_storage_buffer_storage_class"]} |
| } |
| ``` |
| |
| ### Final Module |
| |
| Below is the final module containing executables for both the IREE reference |
| interpreter backend and the Vulkan/SPIR-V backend, as well as the sequencer IR |
| function detailing how to dispatch the workload. |
| |
| ```mlir |
| module { |
| iree.multi_arch_executable @simple_mul_ex_dispatch_0[0]() { |
| iree.executable(IreeBytecode) { |
| module { |
| func @simple_mul_rgn_dispatch_0(%arg0: memref<4xf32>, %arg1: memref<4xf32>, %arg2: memref<4xf32>) |
| attributes {iree.executable.export, iree.executable.workload = dense<[4, 1, 1]> : tensor<3xi32>, iree.ordinal = 0 : i32} { |
| %0 = "iree_ll_interp.alloc_heap"() : () -> memref<4xf32> |
| "iree_ll_interp.mul_f"(%arg0, %arg1, %0) : (memref<4xf32>, memref<4xf32>, memref<4xf32>) -> () |
| %1 = "iree_ll_interp.constant"() {value = dense<0> : tensor<1xi32>} : () -> memref<1xi32> |
| %2 = "iree_ll_interp.constant"() {value = dense<4> : tensor<1xi32>} : () -> memref<1xi32> |
| "iree_ll_interp.dynamic_copy"(%0, %1, %arg2, %1, %2) : (memref<4xf32>, memref<1xi32>, memref<4xf32>, memref<1xi32>, memref<1xi32>) -> () |
| iree.return |
| } |
| } |
| } |
| iree.executable(SPIRV) { |
| spv.module "Logical" "GLSL450" { |
| spv.globalVariable @globalInvocationID built_in("GlobalInvocationId") : !spv.ptr<vector<3xi32>, Input> |
| spv.globalVariable @simple_mul_rgn_dispatch_0_arg_0 bind(0, 0) : !spv.ptr<!spv.struct<!spv.array<4 x f32 [4]> [0]>, StorageBuffer> |
| spv.globalVariable @simple_mul_rgn_dispatch_0_arg_1 bind(0, 1) : !spv.ptr<!spv.struct<!spv.array<4 x f32 [4]> [0]>, StorageBuffer> |
| spv.globalVariable @simple_mul_rgn_dispatch_0_arg_2 bind(0, 2) : !spv.ptr<!spv.struct<!spv.array<4 x f32 [4]> [0]>, StorageBuffer> |
| func @simple_mul_rgn_dispatch_0() { |
| %0 = spv._address_of @globalInvocationID : !spv.ptr<vector<3xi32>, Input> |
| %1 = spv.Load "Input" %0 : vector<3xi32> |
| %2 = spv.CompositeExtract %1[0 : i32] : vector<3xi32> |
| %3 = spv.CompositeExtract %1[1 : i32] : vector<3xi32> |
| %4 = spv.CompositeExtract %1[2 : i32] : vector<3xi32> |
| %5 = spv._address_of @simple_mul_rgn_dispatch_0_arg_0 : !spv.ptr<!spv.struct<!spv.array<4 x f32 [4]> [0]>, StorageBuffer> |
| %6 = spv.constant 0 : i32 |
| %7 = spv.AccessChain %5[%6, %2] : !spv.ptr<!spv.struct<!spv.array<4 x f32 [4]> [0]>, StorageBuffer> |
| %8 = spv.Load "StorageBuffer" %7 : f32 |
| %9 = spv._address_of @simple_mul_rgn_dispatch_0_arg_1 : !spv.ptr<!spv.struct<!spv.array<4 x f32 [4]> [0]>, StorageBuffer> |
| %10 = spv.constant 0 : i32 |
| %11 = spv.AccessChain %9[%10, %2] : !spv.ptr<!spv.struct<!spv.array<4 x f32 [4]> [0]>, StorageBuffer> |
| %12 = spv.Load "StorageBuffer" %11 : f32 |
| %13 = spv.FMul %8, %12 : f32 |
| %14 = spv._address_of @simple_mul_rgn_dispatch_0_arg_2 : !spv.ptr<!spv.struct<!spv.array<4 x f32 [4]> [0]>, StorageBuffer> |
| %15 = spv.constant 0 : i32 |
| %16 = spv.AccessChain %14[%15, %2] : !spv.ptr<!spv.struct<!spv.array<4 x f32 [4]> [0]>, StorageBuffer> |
| spv.Store "StorageBuffer" %16, %13 : f32 |
| spv.Return |
| } |
| spv.EntryPoint "GLCompute" @simple_mul_rgn_dispatch_0, @globalInvocationID |
| spv.ExecutionMode @simple_mul_rgn_dispatch_0 "LocalSize", 1, 1, 1 |
| } attributes {capabilities = ["Shader"], extensions = ["SPV_KHR_storage_buffer_storage_class"]} |
| } |
| } |
| func @simple_mul(%arg0: memref<4xf32>, %arg1: memref<4xf32>) -> memref<4xf32> |
| attributes {iree.ordinal = 0 : i32} { |
| %0 = "iree_ll_seq.alloc_heap"() : () -> memref<4xf32> |
| iree_ll_seq.static_dispatch simple_mul_ex_dispatch_0::simple_mul_rgn_dispatch_0[dense<[4, 1, 1]> : tensor<3xi32>](%arg0, %arg1, %0) : (memref<4xf32>, memref<4xf32>, memref<4xf32>) -> () |
| iree_ll_seq.return %0 : memref<4xf32> |
| } |
| } |
| ``` |
| |
| ## Runtime |
| |
| ### IREE VM |
| |
| The above IREE module (containing the sequencer function IR and the SPIR-V |
| executable) can be serialized to a FlatBuffer. This FlatBuffer is optimized for |
| minimal runtime overhead and there's zero load-time work required. This is |
| useful in scenarios where either ease of debugging or dynamic deployment is |
| required (such as when downloading models to run in a store-signed app on |
| Android or iOS). Since the majority of the compute-intensive work is happening |
| on the GPU (or CPU) via the generated SPIR-V the overhead for processing the |
| sequencer IR is minimal, often an order of magnitude less than traditional ML |
| runtimes. |
| |
| The VM targets the IREE HAL API, meaning that you get access to Vulkan, CPU, and |
| other backends that are available in IREE. The HAL API is just an interface, |
| though, and is easy to map to existing application abstractions that may exist. |
| This means that implementing a HAL that maps to app primitives gives you access |
| to the VM without needing to modify the IREE compiler or VM code. |
| |
| The VM is really simple and effectively does the same as demonstrated below in |
| the HAL codegen example, just with a bytecode instead of C++ code. This layering |
| allows us to optimize for the fast case (codegen) while still being able to |
| reuse almost the entire infrastructure for the dynamic case. |
| |
| ### IREE HAL Codegen |
| |
| For models that are static at application compile-time and app binaries can be |
| redeployed if the model changes it's possible to generate C++ code that uses the |
| IREE HAL API. This avoids the need for a VM at the cost of recompilations when |
| the model changes and less debugger support. Since the HAL API is still used the |
| heterogeneous device support IREE provides is still available. |
| |
| As with the VM the HAL API is just an interface; implementing a custom mapping |
| from that interface to an existing API is easy and gives the ability to switch |
| between VM or codegen approaches with no code beyond the interface |
| implementation required. |
| |
| **NOTE**: this is not yet fully implemented/open sourced, but is coming soon. |
| Here's a pseudo-codeish example of what a module would look like: |
| |
| ```c++ |
| class SimpleMulModule : public iree::vm::Module { |
| public: |
| // Creates the module and prepares it for execution in the given context. |
| // This may assign device handles, cache executables, etc. |
| static iree::StatusOr<std::unique_ptr<SimpleMulModule>> Create( |
| iree::vm::Context* context) { |
| // <prepare executable, allocate transient buffers, etc> |
| } |
| |
| // Synchronous call to @simple_mul. Simplest form of the API and may perform |
| // internal pipelining but will appear synchronous to callers. |
| // |
| // Note that this assumes that the inputs are available and visible to the |
| // target devices. If you are exclusively using the synchronous API that will |
| // be the case. |
| // |
| // Matches IR: |
| // func @simple_mul(%arg0: memref<4xf32>, |
| // %arg1: memref<4xf32>) -> memref<4xf32> |
| iree::StatusOr<iree::hal::BufferView> simple_mul( |
| iree::hal::BufferView arg0, |
| iree::hal::BufferView arg1) { |
| iree::hal::Device* device = select_device(0); |
| |
| // Buffers are allocated conservatively as we don't know what the caller |
| // will do with it. Buffers used internally or across async calls can be |
| // placed in device memory. |
| // |
| // Matches IR: |
| // %0 = "iree_ll_seq.alloc_heap"() : () -> memref<4xf32> |
| ASSIGN_OR_RETURN(auto result, device->allocator()->Allocate( |
| IREE_HAL_MEMORY_TYPE_HOST_LOCAL | IREE_HAL_MEMORY_TYPE_DEVICE_VISIBLE, |
| IREE_HAL_BUFFER_USAGE_DISPATCH | IREE_HAL_BUFFER_USAGE_MAPPING)); |
| auto result_view = iree::hal::BufferView( |
| std::move(result), {4}, sizeof(float)); |
| |
| // To show that this is just a wrapper around the real execution we just |
| // call into the async version of the function. |
| ASSIGN_OR_RETURN(auto fence, device->CreateFence(0u)); |
| auto completed_fence_value = iree::hal::FenceValue{add_ref(fence), 1u}; |
| RETURN_IF_ERROR(simple_mul( |
| device, |
| /*wait_semaphore=*/{}, |
| arg0, arg1, result_view, |
| /*signal_semaphore=*/{}, |
| completed_fence_value)); |
| |
| // Wait until results are ready. |
| RETURN_IF_ERROR(device->WaitAllFences( |
| {completed_fence_value}, absl::InfiniteDuration())); |
| |
| // The allocated buffer escapes this function. |
| // Callers can provide already-allocated buffers with the async API. |
| // |
| // Matches IR: |
| // iree_ll_seq.return %0 : memref<4xf32> |
| return result_view; |
| } |
| |
| // Asynchronous variant of the function that can (optionally) wait on existing |
| // semaphores that indicate that arguments are ready for use and |
| // (optionally) signal both semaphores and fences when the results are ready. |
| // |
| // Multiple variants of this API can be exposed such as ones returning a |
| // iree::hal::SubmissionBatch that can be submitted by the caller, however |
| // this is usually fine for most uses as any additional required submissions |
| // are handled internally as needed. |
| iree::Status simple_mul( |
| iree::hal::Device* device, |
| iree::hal::SemaphoreValue wait_semaphore, |
| iree::hal::BufferView arg0, |
| iree::hal::BufferView arg1, |
| iree::hal::BufferView out0, |
| iree::hal::SemaphoreValue signal_semaphore, |
| iree::hal::FenceValue signal_fence) { |
| // Record the command buffer with any commands we can. |
| // In more complex examples this would include barriers, events, transfers, |
| // and multiple dispatches. In many cases only one command buffer is |
| // required however more complex flow control may require multiple. |
| // |
| // Matches IR: |
| // iree_ll_seq.static_dispatch ... |
| ASSIGN_OR_RETURN(auto cmd, device->CreateCommandBuffer( |
| IREE_HAL_COMMAND_BUFFER_MODE_ONE_SHOT, |
| IREE_HAL_COMMAND_CATEGORY_DISPATCH)); |
| RETURN_IF_ERROR(cmd->Begin()); |
| iree::hal::DispatchRequest dispatch_request; |
| dispatch_request.executable = device_executable(device, 0); |
| dispatch_request.workload = {4, 1, 1}; |
| dispatch_request.bindings = { |
| {arg0.buffer, arg0.shape, arg0.element_size}, |
| {arg1.buffer, arg1.shape, arg1.element_size}, |
| {out0.buffer, out0.shape, out0.element_size}, |
| }; |
| RETURN_IF_ERROR(cmd->Dispatch(dispatch_request)); |
| RETURN_IF_ERROR(cmd->End()); |
| |
| // TBD: show resource tracking. |
| |
| // Submit for execution using the semaphores we were told to wait on. |
| // In more complex examples where we may have to submit multiple command |
| // buffers we'll use the wait/signal semaphores as the boundary |
| // synchronization primitives. |
| auto* command_queue = device->command_queues()[0]; |
| return command_queue->Submit({ |
| iree::hal::SubmissionBatch{ |
| {wait_semaphore}, |
| {cmd}, |
| {signal_semaphore}, |
| }, |
| }, signal_fence); |
| } |
| }; |
| ``` |
| |
| ### Custom Codegen |
| |
| Using the final IREE module (containing executables and sequencer IR) it's |
| possible to generate code for any target. For example, instead of using the IREE |
| HAL and C++ one could generate straight C directly against their target API or |
| hardware (such as directly calling Vulkan or launching DSP executables). We |
| refer to this form as "runtimeless," as beyond the code required to run the |
| program there's no more than what one would write by hand if they were very |
| carefully hand-translating the model. |
| |
| Because we are still changing the IR we have not yet written a backend that does |
| this, however we plan to demonstrate this for targeting small embedded systems |
| and DSPs in the future. |