)]}'
{
  "commit": "d39c3c56682e006e842b32aa6f38c272f77c8f3c",
  "tree": "99df9d9f9e7a6181a4998943cea66390dda38035",
  "parents": [
    "1ea9ee0421258452aa97513fb8687dc34bbd1c33",
    "f721fd0b612742d6dba50dd2f954321cd5d64aff"
  ],
  "author": {
    "name": "Ben Vanik",
    "email": "ben.vanik@gmail.com",
    "time": "Mon Jul 29 21:24:32 2024 -0700"
  },
  "committer": {
    "name": "GitHub",
    "email": "noreply@github.com",
    "time": "Mon Jul 29 21:24:32 2024 -0700"
  },
  "message": "Merging multi-device branch to main. (#17987)\n\n**TLDR**: nothing should break, `--iree-hal-target-backends\u003d` is\r\ndeprecated, use `--iree-hal-target-device\u003d` and appropriate\r\ntarget-specific flags instead.\r\n\r\nThis reworks the target device concept in the IREE pipeline - in some\r\ncases introducing the concept (flow and HAL) and in others replacing\r\nplaceholder mechanisms around stream affinity. This builds upon prior\r\nwork that added support for enumerating available devices via the HAL\r\nand providing multiple devices to the runtime tools by adding the\r\nability to define devices, allowing for execution and storage resources\r\nto be assigned a device, and upgrading passes to support multiple\r\ndevices. \"Multi-device\" here means several things and all are\r\naccomplished with the same mechanism: a single device that may be one of\r\nmultiple types (multiple CPU/GPU archs, CPU _or_ GPU, etc), multiple\r\nhomogeneous devices (4 of the same exact GPUs accessed through the same\r\nruntime HAL driver), multiple heterogeneous devices (a CPU and a\r\nGPU/NPU/etc), and optional devices (a CPU with some portions offloaded\r\nto a GPU/NPU if it\u0027s compatible and available at runtime). In this way\r\nwe can provide cross-compilation/targeting, multi-targeting, and\r\nmultiple devices with one set of flags, compiler analysis, passes\r\ndealing with the devices, and runtime infrastructure.\r\n\r\nEarly warning: **it\u0027s strongly discouraged to use device information\r\nprior to codegen** - any pass using such information earlier on is a red\r\nflag that will receive pushback. IREE is designed first and foremost as\r\na cross-compiler with multi-targeting at its core and radically changing\r\nprogram behavior near the frontend makes it nearly impossible to have\r\nconfiguration control over the compilation pipeline. Consider\r\nspecializing on device prior to codegen tantamount to using C\r\npreprocessor macros based on operating system or architecture: it means\r\nthat a problem has not been solved and a workaround has been taken.\r\nThere are exceptionally few cases that require device information early,\r\nand those that do can do so in generic ways that do not disturb the\r\ndebuggability of the program. For example, far better than preprocessor\r\nmacros in C++ are function calls and if statements (as we can do in our\r\nprograms), and even better than that are virtual interfaces (ops that\r\nare only lowered to one of multiple implementations later on). That\r\ndisclaimer out of the way: it\u0027s now possible to query device information\r\nafter the input pipeline (global opt/preprocessing/flow). Upstream will\r\npush back against doing so in nearly all cases but it is a useful\r\nmechanism for downstream projects.\r\n\r\nThe key change here is that the `--iree-hal-target-backends\u003d` compiler\r\nflag has been deprecated. It continues to work for now with the same\r\nbehavior as before but usage will shift to the replacement\r\n`--iree-hal-target-device\u003d` flag. A single instance of this flag defines\r\na single device within the program and repeated uses of it will define\r\nnew devices. Devices may be named (\"my_device\") or anonymous (in which\r\ncase they will be assigned an ordinal like 0 or 1), and each device may\r\nbe backed by one or more target devices (Vulkan, local host, HIP, etc).\r\nEach target device in the compiler (represented by\r\n`IREE::HAL::TargetDevice`) may have any number of backends with various\r\nconfigurations (multiple archs, different deployment formats, etc\r\nrepresented by one or more `IREE::HAL::ExecutableTargetAttr` values).\r\n\r\nExample flags:\r\n```sh\r\n# Two devices, one the local host device and the other a Vulkan device:\r\n--iree-hal-target-device\u003dlocal --iree-hal-target-device\u003dvulkan\r\n\r\n# One device selecting between Vulkan if available and otherwise use the local host device:\r\n--iree-hal-target-device\u003dvulkan,local\r\n\r\n# Two CUDA devices selected by runtime ordinal; at runtime two --device\u003d\r\n# flags are required to configure both devices:\r\n--iree-hal-target-device\u003dcuda[0] --iree-hal-target-device\u003dcuda[1]\r\n\r\n# A fully-defined target specification:\r\n--iree-hal-target-device\u003d#hal.device.target\u003c\"cuda\", {...}, [#hal.executable.target\u003c...\u003e]\u003e\r\n\r\n# Named device for defining a reference by #hal.device.promise\u003c@some_name\u003e:\r\n--iree-hal-target-device\u003dsome_name\u003dvulkan\r\n```\r\n\r\nThe device metadata as specified in the compiler is used to produce\r\nenumeration code that executes at runtime and queries the available\r\ndevices to find the appropriate matches. This means that if the program\r\nis compiled to target two CUDA devices then at runtime there must be two\r\nCUDA devices specified - the indirection allows for the compiled\r\nartifact to work with any two CUDA devices targeted by UUID, device\r\nordinal, etc and not just the first and second CUDA device in the\r\nsystem. E.g. `iree-compile --iree-hal-target-device\u003dcuda[0]\r\n--iree-hal-target-device\u003dcuda[1]` and `iree-run-module\r\n--device\u003dcuda://UUID_A --device\u003dcuda://UUID_B`. Devices targets in the\r\ncompiler can now specify the ordinal of the device in order to\r\ndifferentiate between multiple devices at runtime (the `cuda[0]` and\r\n`cuda[1]` above indicate the first CUDA device and second CUDA device\r\nprovided to the runtime).\r\n\r\nMajor new attributes:\r\n* `#hal.device.promise\u003c@device\u003e` is a reference to a device that will be\r\nprovided at a later stage. Frontends can use this as a placeholder for\r\ndevices that are specified on the command line without needing to say\r\nwhat those devices are when exporting.\r\n* `#hal.device.alias\u003c\"name\"\u003e` specifies an `IREE::HAL::TargetDevice` in\r\nthe compiler (`vulkan`, `local`, `hip`, etc) and expands to a full\r\n`#hal.device.target` based on target-specific flags.\r\n* `#hal.device.select\u003c[...]\u003e` controls selection by enumerating each\r\ndevice in turn and matching the first found.\r\n* `#hal.device.fallback\u003c@other_device\u003e` provides a fallback reference\r\nthat the device will match if no other device matches. Note that having\r\ntwo devices with the same target will create two copies at runtime - if\r\nwanting to use the existing device then the fallback mechanism must be\r\nused.\r\n* `#hal.device.affinity\u003c@device\u003e` (and optional queue mask) is used on\r\nops to indicate on which device they should execute.\r\n\r\nAll of the above flags are just syntactic sugar that add the above\r\nattributes to the program IR and it\u0027s possible for frontends to insert\r\nthese attributes or ops directly depending on use-case. In most cases\r\nleaving placeholders in the IR such that the exact target can be\r\nspecified during compilation is ideal: this allows one output from the\r\nfrontend to be used with any number of targets and configurations.\r\nOnline compilers, though, may want to bake in their exact configuration\r\nand can do so without the need for flags that may lose information. The\r\ngeneral flow of the `buildHALDeviceAssignmentPassPipeline`/`iree-opt\r\n--iree-hal-device-assignment-pipeline` is:\r\n1. `--iree-hal-target-device\u003d` flags are parsed and a\r\n`hal.device.targets` attribute is added to the module.\r\n* `--iree-hal-device-target\u003dcpu_device\u003dlocal` becomes\r\n`hal.device.targets \u003d [#hal.device.alias\u003c\"local\"\u003e : !hal.device]`\r\n* `--iree-hal-device-target\u003dcpu_device\u003dlocal\r\n--iree-hal-device-target\u003dgpu_device\u003dcuda,hip` becomes\r\n  ```mlir\r\n  hal.device.targets \u003d {\r\n    cpu_device \u003d #hal.device.alias\u003c\"local\"\u003e : !hal.device,\r\ngpu_device \u003d #hal.device.select\u003c[#hal.device.alias\u003c\"cuda\"\u003e :\r\n!hal.device, #hal.device.alias\u003c\"hip\"\u003e : !hal.device]\u003e :\r\n  !hal.device\r\n  }\r\n  ```\r\n2. The `hal.device.targets` attribute (if any) is expanded into\r\n`util.global` ops for each device. These globals are initialized with\r\none of the supported attributes which are much later turned into\r\nenumeration/selection logic. The above multi-device example becomes:\r\n  ```mlir\r\nbuiltin.module attributes {stream.affinity.default \u003d\r\n#hal.device.affinity\u003c@cpu_device\u003e} {\r\nutil.global private @cpu_device \u003d #hal.device.alias\u003c\"local\"\u003e :\r\n!hal.device\r\nutil.global private @gpu_device \u003d\r\n#hal.device.select\u003c[#hal.device.alias\u003c\"cuda\"\u003e : !hal.device,\r\n#hal.device.alias\u003c\"hip\"\u003e : !hal.device]\u003e :\r\n  !hal.device\r\n  }\r\n  ```\r\n3. Any `#hal.device.promise` attributes will be changed to reference the\r\nglobals with the same name. This allows for retargeting of inputs by\r\nletting a frontend specify named devices prior to them having been\r\npassed on the command line (or inserted by some other pipeline).\r\n4. Any `#hal.device.alias` attributes are converted to full\r\n`#hal.device.target` attributes using the appropriate\r\n`IREE::HAL::DeviceTarget` implementation.\r\n\r\nUpon completion of the pipeline there are globals initialized with\r\neither a specific device target or a selection mechanism to pick between\r\ntargets. From that point onward devices are a structural part of the\r\nprogram and can be referenced by symbol name via attributes like\r\n`#hal.device.affinity`.\r\n\r\nPrograms are expected to specify the device affinity for all operations\r\neither explicitly or implicitly. By default (as today) the first device\r\ndefined will be used but going forward we will want frontends to start\r\nspecifying devices. To that end the `flow.tensor.transfer` operation was\r\nadded to allow a tensor to have a device affinity assigned to it. A new\r\nanalysis is added that allows all tensors (or stream resources) and ops\r\ninteracting with them to be queried for which device they should be\r\nplaced on. For example, a frontend can specify multiple devices be used\r\nin a computation by transferring the tensors used:\r\n```mlir\r\nutil.func private @my_func(%arg0: tensor\u003c4xi32\u003e) -\u003e tensor\u003c4xi32\u003e {\r\n  %arg0_device_a \u003d flow.tensor.transfer %arg0 : tensor\u003c4xi32\u003e to #hal.device.promise\u003c@device_a\u003e\r\n  %compute_device_a \u003d arith.addi %arg0_device_a, %arg0_device_a : tensor\u003c4xi32\u003e\r\n  %transient_device_b \u003d flow.tensor.transfer %compute_device_a : tensor\u003c4xi32\u003e to #hal.device.promise\u003c@device_b\u003e\r\n  %compute_device_b \u003d arith.muli %transient_device_b, %transient_device_b : tensor\u003c4xi32\u003e\r\n  util.return %compute_device_b : tensor\u003c4xi32\u003e\r\n}\r\n```\r\n\r\nTo avoid copies there are also ways for frontends to indicate where\r\nargument and result tensors are placed. The best way (in that it\u0027s most\r\ngeneral/powerful) is for the frontends to emit `hal.tensor.import`,\r\n`hal.tensor.export`, and `hal.tensor.alias` ops directly as they all now\r\ntake affinities. When using the default ABI translation pass it\u0027s\r\npossible to add arg/result attrs to public functions, e.g. `util.func\r\npublic @my_func(%arg0: tensor\u003c2xi32\u003e {iree.abi.affinity \u003d\r\n#hal.device.promise\u003c@device_a\u003e}) -\u003e (tensor\u003c2xi32\u003e {iree.abi.affinity \u003d\r\n#hal.device.promise\u003c@device_b\u003e})`. Shorthand is provided to allow\r\nspecifying an `iree.abi.affinity` on functions themselves for when all\r\narguments and results are placed on the same device.\r\n\r\nAfter the point devices are specified, materialized in the program as\r\nglobals, and referenced either via the magic default attribute, scoped\r\nattributes, or explicit transfer operations most of the mechanics are\r\nimplementation details of the stream and HAL dialect lowerings.\r\nPartitioning, allocation, and scheduling in the stream dialect were\r\nalways affinity-aware and required only minor tweaks as part of this\r\nwork while the HAL TODOs for multi-device were implemented by memoizing\r\nresources per-device and adding the machinery to enumerate and select\r\ndevices.\r\n\r\nThis was reviewed in the following chunks and tested in a roll-up PR\r\n#17482:\r\n* https://github.com/iree-org/iree/pull/17915\r\n* https://github.com/iree-org/iree/pull/17917\r\n* https://github.com/iree-org/iree/pull/17916\r\n* https://github.com/iree-org/iree/pull/17918\r\n* https://github.com/iree-org/iree/pull/17919\r\n* https://github.com/iree-org/iree/pull/17920",
  "tree_diff": []
}
