| # Kelvin |
| |
| Kelvin is a RISCV CPU with custom SIMD instructions and microarchitectural |
| decisions aligned with the dataplane properties of an ML accelerator. Kelvin |
| starts with domain and matrix capabilities and then adds vector and scalar |
| capabilities for a fused design. |
| |
| ## Block Diagram |
| |
| ![Kelvin block diagram](images/arch.png) |
| |
| ## Scalar Core |
| |
| A simple RISC-V scalar frontend drives the command queues of the ML+SIMD |
| backend. |
| |
| Kelvin utilizes a custom RISC-V frontend (rv32im) that runs the minimal set of |
| instructions to support an executor run-to-completion model (eg. no OS, no |
| interrupts), with all control tasks onloaded to the SMC . The C extension |
| encoding is reclaimed (as per the risc-v specification) to provide the necessary |
| encoding space for the SIMD registers (6b indices), and to allow flexible type |
| encodings and instruction compression (stripmining) for the SIMD instruction |
| set. The scalar core is an in order machine with no speculation. |
| |
| The branch policy in the fetch stage is backwards branches are taken and forward |
| branches are not-taken, incurring a penalty cycle if the execute result does not |
| match the decision in the fetch unit. |
| |
| ## Vector Core |
| |
| ![Kelvin SIMD](images/simd.png) |
| |
| We use SIMD and vector interchangeably, referring to a simple and practical SIMD |
| instruction definition devoid of variable length behaviors. The scalar frontend |
| is decoupled from the backend by a Fifo structure that buffers vector |
| instructions, posting only to the relevant command queues when dependencies are |
| resolved in the vector regfile. |
| |
| ### MAC |
| |
| The central component of the design is a quantized outer product |
| multiply-accumulate engine. An outer-product engine provides two-dimensional |
| broadcast structures to maximize the amount of deliverable compute with respect |
| to memory accesses. On one axis is a parallel broadcast (“wide”, convolution |
| weights), and the other axis the transpose shifted inputs of a number of batches |
| (“narrow”, eg. MobileNet XY batching). |
| |
| ![Kelvin MAC](images/mac.png) |
| |
| The outer-product construction is a vertical arrangement of multiple VDOT |
| opcodes which utilize 4x 8bit multiplies reduced into 32 bit accumulators. |
| |
| ### Stripmining |
| |
| Strip mining is defined as folding array-based parallelism to fit the available |
| hardware parallelism. To reduce frontend instruction dispatch pressure becoming |
| a bottleneck, and to natively support instruction level tiling patterns through |
| the SIMD registers, the instruction encoding shall explicitly include a |
| stripmine mechanism that converts a single frontend dispatch event to the |
| command queue into four serialized issue events into the SIMD units. For |
| instance a “vadd v0” in Dispatch will produce “vadd v0 : vadd v1 : vadd v2 : |
| vadd v3” at Issue. These will be processed as four discrete events. |
| |
| ## Registers |
| |
| There are 4 distinct register types. |
| |
| Registers | Names | Width |
| ---------------- | ------------- | ----------------------- |
| Scalar (31) | zero, x1..x31 | 32 bits |
| Vector (64) | v0..v63 | 256 bits (eg. int32 x8) |
| Accumulator | acc<8><8> | 8x8x 32 bits |
| Control & Status | CSRx | Various |
| |
| ## Cache |
| |
| Caches exists as a single layer between the core and the first level of shared |
| SRAM. The L1 cache and scalar core frontend are an overhead to the rest of the |
| backend compute pipeline and ideally are as small as possible. |
| |
| The L1Icache is 8KB (256b blocks * 256 slots) with 4-way set associativity. |
| |
| The L1Dcache sizing is towards the scalar core requirements to perform loop |
| management and address generation. The L1Dcache is 16KB (SIMD256b) with low set |
| associativity of 4-way. The L1Dcache is implemented with a dual bank |
| architecture where each bank is 8KB (similar to L1Icache). This property allows |
| for a degree of next line prefetch. The L1Dcache also serves as an alignment |
| buffer for the scalar and SIMD instructions to assist development and to |
| simplify software support. In an embedded setting, the L1Dcache provides half of |
| the memory bandwidth to the ML outer-product engine when only a single external |
| memory port is provided. Line and all entry flushing is supported where the core |
| stalls until completion to simplify the contract. |
| |
| A shared VLdSt unit exists for cached accesses. |
| |
| ## Uncached |
| |
| Note: It is not recommended to use intentional uncached accesses as |
| `mmap_uncached` has been seen to be buggy. |
| |
| Memory may be accessed as uncached through the setting of a high address bit. |
| This is for simple fine grain control over how load/store units are to access |
| memory directly or through the L1 cache. We only allow aligned accesses of |
| native register size (eg. scalar=32b, simd=256b) via uncached accesses direct to |
| memory. This simplifies the hardware which is required to support a large window |
| of outstanding read operations, but does impose complications on the software. |
| The code must assume C `__restrict__` attributes for any memory accessed in this |
| way. |
| |
| Separate VLd and VSt units exist for uncached accesses. |