|  | # "Simple Embedding" sample | 
|  |  | 
|  | This sample shows how to run a simple pointwise array multiplication bytecode | 
|  | module on various HAL device targets with the minimum runtime overhead. Some of | 
|  | these devices are compatible with bare-metal system without threading or file IO | 
|  | support. | 
|  |  | 
|  | ## Background | 
|  |  | 
|  | The main bytecode testing tool | 
|  | [iree-run-module](../../tools/iree-run-module-main.cc) | 
|  | requires a proper operating system support to set up the runtime environment to | 
|  | execute an IREE bytecode module. For embedded systems, the support such as file | 
|  | system or multi-thread asynchronous control may not be available. This sample | 
|  | demonstrates how to setup the simplest framework to load and run the IREE | 
|  | bytecode with various target backends. | 
|  |  | 
|  | ## Build instructions | 
|  |  | 
|  | ### CMake (native and cross compilation) | 
|  |  | 
|  | Set up the CMake configuration with `-DIREE_BUILD_SAMPLES=ON` (default on) | 
|  |  | 
|  | Then run | 
|  | ```sh | 
|  | cmake --build <build dir> --target samples/simple_embedding/all | 
|  | ``` | 
|  |  | 
|  | ### Bazel (host only) | 
|  |  | 
|  | ```sh | 
|  | bazel build samples/simple_embedding:all | 
|  | ``` | 
|  |  | 
|  | The resulting executables are listed as `simple_embedding_<HAL devices>`. | 
|  |  | 
|  | ## Code structure | 
|  |  | 
|  | The sample consists of three parts: | 
|  |  | 
|  | ### simple_embedding_test.mlir | 
|  |  | 
|  | The simple pointwise array multiplication op with the entry function called | 
|  | `simple_mul`, two <4xf32> inputs, and one <4xf32> output. The ML bytecode | 
|  | modules are automatically generated during the build time with the target HAL | 
|  | device configurations from the host compiler `iree-compile`. | 
|  |  | 
|  | ### simple_embedding.c | 
|  |  | 
|  | The main function of the sample has the following steps: | 
|  |  | 
|  | 1. Create a VM instance | 
|  | 2. Create a HAL module based on the target device (see the next section) | 
|  | 3. Load the bytecode module of the ML workload | 
|  | 4. Associate the HAL module with the bytecode module in the VM context | 
|  | 5. Prepare the function entry point and inputs | 
|  | 6. Invoke function | 
|  | 7. Retrieve function output | 
|  |  | 
|  | ### device_*.c | 
|  |  | 
|  | The HAL device for different target backends. Devices are created using a | 
|  | specific executable loader and device constructor. For example, | 
|  | [device_embedded_sync.c](./device_embedded_sync.c) creates a "sync" device with | 
|  | the embedded ELF loader: | 
|  |  | 
|  | ```c | 
|  | iree_hal_sync_device_params_t params; | 
|  | iree_hal_sync_device_params_initialize(¶ms); | 
|  | iree_hal_executable_loader_t* loader = NULL; | 
|  | IREE_RETURN_IF_ERROR(iree_hal_embedded_elf_loader_create( | 
|  | /*plugin_manager=*/NULL, iree_allocator_system(), | 
|  | &loader)); | 
|  |  | 
|  | iree_string_view_t identifier = iree_make_cstring_view("local-sync"); | 
|  |  | 
|  | iree_status_t status = | 
|  | iree_hal_sync_device_create(identifier, ¶ms, /*loader_count=*/1, | 
|  | &loader, iree_allocator_system(), device); | 
|  | ``` | 
|  |  | 
|  | Whereas for [device_embedded.c](./device_embedded.c), the "sync device" is | 
|  | replaced with the multithreaded "task device", which uses a "task executor": | 
|  |  | 
|  | ```c | 
|  | ... | 
|  | iree_task_executor_t* executor = NULL; | 
|  | iree_host_size_t executor_count = 0; | 
|  | iree_status_t status = | 
|  | iree_task_executors_create_from_flags(iree_allocator_system(), | 
|  | 1, &executor, &executor_count); | 
|  | IREE_ASSERT_EQ(count, 1, "NUMA unsupported"); | 
|  |  | 
|  | iree_string_view_t identifier = iree_make_cstring_view("local-task"); | 
|  | if (iree_status_is_ok(status)) { | 
|  | // Create the device. | 
|  | status = iree_hal_task_device_create(identifier, ¶ms, | 
|  | /*queue_count=*/1, &executor, | 
|  | /*loader_count=*/1, &loader, | 
|  | iree_allocator_system(), device); | 
|  | ``` | 
|  | An example that utilizes a higher-level driver registry is in | 
|  | [device_vulkan.c](./device_vulkan.c) | 
|  |  | 
|  | #### Load device-specific bytecode module | 
|  |  | 
|  | To avoid the file IO, the bytecode module is converted into a data stream | 
|  | (`module_data`) that's embedded in the executable. The same strategy can be | 
|  | applied to build applications for the embedded systems without a proper file IO. | 
|  |  | 
|  | ## Generic platform support | 
|  |  | 
|  | Some of the devices in this sample support a generic platform (or the | 
|  | machine mode without an operating system). For example, `device_vmvx_sync` | 
|  | should support any architecture that IREE supports, and `device_embedded_sync` | 
|  | should support any architecture that supports `llvm-cpu` codegen target | 
|  | backend (may need to add the bytecode module data if it is not already in | 
|  | [device_embedded_sync.c](./device_embedded_sync.c)). |