tree 79e6ab11af4defc4ea61c5dde3ddaf79454eb828
parent 73ddcce258c9be653be0997b25bef1dcb3d0ddbd
author Ben Vanik <ben.vanik@gmail.com> 1692385577 -0700
committer GitHub <noreply@github.com> 1692385577 -0700
gpgsig -----BEGIN PGP SIGNATURE-----
 
 wsBcBAABCAAQBQJk38EpCRBK7hj4Ov3rIwAAn1kIACsQVgnz3ETDQZAVJmGelRrX
 OlDrSC0TMKetlyECH2HpVj+e31IOQYGa3462UAI/94EBIejixSFJB8cXp9DB+8OY
 1bLYDwZeuWwxYF2s9omJpdRzK/CC4pid3kO8ihS99qMUj7UVoPtI7wsoY1gPJl3I
 TQVh7Ull799yCo79e/lx6rLuxlBamgKihT9AqyKBavX9zPThU1EKkM+YX+xFV36B
 78tKlfZDLhgleC9zhfD4B2U/5huBmPzGH700X61FLEy/k9PqfLvkvzmUfzrUUjul
 b/WJ5T72+vBVQqtyeK0wCim9xCOGzlMbKc/w8+m8p3E7+txdV7AmeFNwonPcXHI=
 =EUWD
 -----END PGP SIGNATURE-----
 

Adding Vulkan sparse binding buffer support for native allocations. (#14536)

This creates one logical VkBuffer that is backed by as many aligned
max-size allocations as required. There's a lot we could tweak here and
a lot to optimize but the initial proof of concept here is specifically
for allowing large constant/variable buffers with long lifetimes. Most
implementations don't allow using these buffers with dispatches, though,
due to embarrassingly and arbitrarily small limits on shader storage
buffer access ranges. We'll need device pointers to actually use these
but at least we can allocate them now.

Future changes will add asynchronous binding and sparse residency as
part of the HAL API so that targets supporting constrained virtual
memory management (CPU, CUDA, Vulkan, etc) can have such
virtual/physical remapping exposed for use by the compiler. When that's
implemented the sparse buffer type here will be reworked as a shared
utility implementation using the binding/sparse residency APIs.

In order for this to be used for large constants host allocation
importing was implemented so that the buffers can be transferred. This
required a change in the HAL APIs exposed to the compiler as what was
there was a hack to approximate the proper import/mapping path but
insufficient for doing it properly. This has been tested with imports of
up to 15GB (and should work beyond that, device memory allowing).

On discrete systems when the module is mmapped we can't import and stage
in chunks:

![image](https://github.com/openxla/iree/assets/75337/951568e9-5cdb-4a2a-95c1-05a8d371066c)

If not mmapped we can import the host pointer as a staging source and
avoid the chunk allocation:

![image](https://github.com/openxla/iree/assets/75337/2d87982e-e98f-4e4c-a3d0-e226f72717a6)

On unified memory systems we can (sometimes) directly use the host
buffer and avoid all allocations:

![image](https://github.com/openxla/iree/assets/75337/3eb51285-3270-4b7a-a88c-240ca4312287)

Progress on #14607.
Fixes #7242.