Merge pull request #4767 from google/benvanik-hal-c

A 30KB drop, removes all allocs from the common command buffer recording path, and almost all operations are faster to boot (no extraneous ref counting). There's a lot less C++ magic too: 1k loc of template magic replaced with 140 loc of stupid simple macros and autogeneratable boilerplate.

Fixes #4678.