Merge pull request #9754 from iree-org/benvanik-timepoint-to-hal

Adding compiler/runtime support for lowering the asynchronous stream dialect ops into HAL ops, materializing a timeline (today just one but multiple in the future), and passing through to the runtime HAL module. This allowed for the removal of the existing placeholder submit_and_wait op and enables queue-ordered allocations to be implemented in the HAL.

This is likely not the final design but unblocks work on coroutines, queue-ordered allocations, webgpu, and plumbing fences through the user-facing API/native ABI. Future refinements may create overrides that use semaphores instead of fences to avoid fence heap allocations when not required, but for most single-function classic ML models once we plumb fences through the ABI no internal fences are required. The current timeline materialization also strictly orders all invocations where instead we should be able to elide those when there's no internal program state to protect.

Because the various HAL backends all need work (CUDA/ROCM in particular need massive work) nearly everything is synchronized exactly as it was before but now that synchronization happens in the IR we emit and we can selectively start supporting async per target.

Progress on #1285 (just need to put fences on the ABI!).
Progress on #8093 (added yieldable fence waits).
Progress on #9572 (added compiler/runtime glue for queue-ordered allocs).
diff --git a/docs/website/docs/building-from-source/riscv.md b/docs/website/docs/building-from-source/riscv.md
index fee2530..5ca560f 100644
--- a/docs/website/docs/building-from-source/riscv.md
+++ b/docs/website/docs/building-from-source/riscv.md
@@ -150,7 +150,7 @@
   --iree-llvm-target-cpu=generic-rv64 \
   --iree-llvm-target-abi=lp64d \
   --iree-llvm-target-cpu-features="+m,+a,+f,+d,+v" \
-  --riscv-v-vector-bits-min=256 --riscv-v-fixed-length-vector-lmul-max=8 \
+  --riscv-v-vector-bits-min=512 --riscv-v-fixed-length-vector-lmul-max=8 \
   iree_input.mlir -o mobilenet_cpu.vmfb
 ```
 
@@ -158,7 +158,7 @@
 
 ```shell hl_lines="2 5"
 ${QEMU_BIN} \
-  -cpu rv64,x-v=true,x-k=true,vlen=256,elen=64,vext_spec=v1.0 \
+  -cpu rv64,x-v=true,x-k=true,vlen=512,elen=64,vext_spec=v1.0 \
   -L ${RISCV_TOOLCHAIN_ROOT}/sysroot/ \
   ../iree-build-riscv/tools/iree-run-module \
   --device=local-task \
diff --git a/tests/riscv32/smoke.sh b/tests/riscv32/smoke.sh
index 147117b..3f6d6d8 100755
--- a/tests/riscv32/smoke.sh
+++ b/tests/riscv32/smoke.sh
@@ -16,17 +16,14 @@
 # Run the embedded_library module loader and simple_embedding under QEMU.
 echo "Test elf_module_test_binary"
 pushd "${BUILD_RISCV_DIR}/runtime/src/iree/hal/local/elf" > /dev/null
-"${QEMU_RV32_BIN}" -cpu rv32,x-v=true,x-k=true,vlen=256,elen=64,vext_spec=v1.0 \
-elf_module_test_binary
+"${QEMU_RV32_BIN}" -cpu rv32 elf_module_test_binary
 popd > /dev/null
 
 echo "Test simple_embedding binaries"
 pushd "${BUILD_RISCV_DIR}/samples/simple_embedding" > /dev/null
 
-"${QEMU_RV32_BIN}" -cpu rv32,x-v=true,x-k=true,vlen=256,elen=64,vext_spec=v1.0 \
-simple_embedding_embedded_sync
+"${QEMU_RV32_BIN}" -cpu rv32 simple_embedding_embedded_sync
 
-"${QEMU_RV32_BIN}" -cpu rv32,x-v=true,x-k=true,vlen=256,elen=64,vext_spec=v1.0 \
-simple_embedding_vmvx_sync
+"${QEMU_RV32_BIN}" -cpu rv32 simple_embedding_vmvx_sync
 
 popd > /dev/null
diff --git a/tests/riscv64/lit.cfg.py b/tests/riscv64/lit.cfg.py
index 2bdc624..e515eff 100644
--- a/tests/riscv64/lit.cfg.py
+++ b/tests/riscv64/lit.cfg.py
@@ -16,7 +16,7 @@
 test_cmd = [
     os.environ["QEMU_RV64_BIN"],
     "-cpu",
-    "rv64,x-v=true,x-k=true,vlen=256,elen=64,vext_spec=v1.0",
+    "rv64,x-v=true,x-k=true,vlen=512,elen=64,vext_spec=v1.0",
     "-L",
     os.path.join(os.environ["RISCV_TOOLCHAIN_ROOT"], "sysroot"),
 ]
diff --git a/tests/riscv64/smoke.sh b/tests/riscv64/smoke.sh
index 267b52d..475ae9c 100755
--- a/tests/riscv64/smoke.sh
+++ b/tests/riscv64/smoke.sh
@@ -50,7 +50,7 @@
       --iree-input-type=tosa
       --iree-llvm-target-cpu-features="+m,+a,+f,+d,+c,+v"
       --riscv-v-fixed-length-vector-lmul-max=8
-      --riscv-v-vector-bits-min=256
+      --riscv-v-vector-bits-min=512
       "${BUILD_RISCV_DIR}/tosa.mlir"
     )
   fi
@@ -61,7 +61,7 @@
   "${ROOT_DIR}/tools/test/iree-run-module.mlir" \
   -o "${BUILD_RISCV_DIR}/iree-run-module-llvm_aot.vmfb"
 
-wget -P "${BUILD_RISCV_DIR}/" https://github.com/tensorflow/tflite-micro/raw/aeac6f39e5c7475cea20c54e86d41e3a38312546/tensorflow/lite/micro/models/person_detect.tflite
+wget -P "${BUILD_RISCV_DIR}/" "https://storage.googleapis.com/iree-model-artifacts/person_detect.tflite"
 
 generate_llvm_cpu_vmfb tosa \
   "${BUILD_RISCV_DIR}/person_detect.tflite" \