|  | # Kelvin | 
|  |  | 
|  | Kelvin is a RISCV CPU with custom SIMD instructions and microarchitectural | 
|  | decisions aligned with the dataplane properties of an ML accelerator. Kelvin | 
|  | starts with domain and matrix capabilities and then adds vector and scalar | 
|  | capabilities for a fused design. | 
|  |  | 
|  | ## Block Diagram | 
|  |  | 
|  |  | 
|  |  | 
|  | ## Scalar Core | 
|  |  | 
|  | A simple RISC-V scalar frontend drives the command queues of the ML+SIMD | 
|  | backend. | 
|  |  | 
|  | Kelvin utilizes a custom RISC-V frontend (rv32im) that runs the minimal set of | 
|  | instructions to support an executor run-to-completion model (eg. no OS, no | 
|  | interrupts), with all control tasks onloaded to the SMC . The C extension | 
|  | encoding is reclaimed (as per the risc-v specification) to provide the necessary | 
|  | encoding space for the SIMD registers (6b indices), and to allow flexible type | 
|  | encodings and instruction compression (stripmining) for the SIMD instruction | 
|  | set. The scalar core is an in order machine with no speculation. | 
|  |  | 
|  | The branch policy in the fetch stage is backwards branches are taken and forward | 
|  | branches are not-taken, incurring a penalty cycle if the execute result does not | 
|  | match the decision in the fetch unit. | 
|  |  | 
|  | ## Vector Core | 
|  |  | 
|  |  | 
|  |  | 
|  | We use SIMD and vector interchangeably, referring to a simple and practical SIMD | 
|  | instruction definition devoid of variable length behaviors. The scalar frontend | 
|  | is decoupled from the backend by a Fifo structure that buffers vector | 
|  | instructions, posting only to the relevant command queues when dependencies are | 
|  | resolved in the vector regfile. | 
|  |  | 
|  | ### MAC | 
|  |  | 
|  | The central component of the design is a quantized outer product | 
|  | multiply-accumulate engine. An outer-product engine provides two-dimensional | 
|  | broadcast structures to maximize the amount of deliverable compute with respect | 
|  | to memory accesses. On one axis is a parallel broadcast (“wide”, convolution | 
|  | weights), and the other axis the transpose shifted inputs of a number of batches | 
|  | (“narrow”, eg. MobileNet XY batching). | 
|  |  | 
|  |  | 
|  |  | 
|  | The outer-product construction is a vertical arrangement of multiple VDOT | 
|  | opcodes which utilize 4x 8bit multiplies reduced into 32 bit accumulators. | 
|  |  | 
|  | ### Stripmining | 
|  |  | 
|  | Strip mining is defined as folding array-based parallelism to fit the available | 
|  | hardware parallelism. To reduce frontend instruction dispatch pressure becoming | 
|  | a bottleneck, and to natively support instruction level tiling patterns through | 
|  | the SIMD registers, the instruction encoding shall explicitly include a | 
|  | stripmine mechanism that converts a single frontend dispatch event to the | 
|  | command queue into four serialized issue events into the SIMD units. For | 
|  | instance a “vadd v0” in Dispatch will produce “vadd v0 : vadd v1 : vadd v2 : | 
|  | vadd v3” at Issue. These will be processed as four discrete events. | 
|  |  | 
|  | ## Registers | 
|  |  | 
|  | There are 4 distinct register types. | 
|  |  | 
|  | Registers        | Names         | Width | 
|  | ---------------- | ------------- | ----------------------- | 
|  | Scalar (31)      | zero, x1..x31 | 32 bits | 
|  | Vector (64)      | v0..v63       | 256 bits (eg. int32 x8) | 
|  | Accumulator      | acc<8><8>     | 8x8x 32 bits | 
|  | Control & Status | CSRx          | Various | 
|  |  | 
|  | ## Cache | 
|  |  | 
|  | Caches exists as a single layer between the core and the first level of shared | 
|  | SRAM. The L1 cache and scalar core frontend are an overhead to the rest of the | 
|  | backend compute pipeline and ideally are as small as possible. | 
|  |  | 
|  | The L1Icache is 8KB (256b blocks * 256 slots) with 4-way set associativity. | 
|  |  | 
|  | The L1Dcache sizing is towards the scalar core requirements to perform loop | 
|  | management and address generation. The L1Dcache is 16KB (SIMD256b) with low set | 
|  | associativity of 4-way. The L1Dcache is implemented with a dual bank | 
|  | architecture where each bank is 8KB (similar to L1Icache). This property allows | 
|  | for a degree of next line prefetch. The L1Dcache also serves as an alignment | 
|  | buffer for the scalar and SIMD instructions to assist development and to | 
|  | simplify software support. In an embedded setting, the L1Dcache provides half of | 
|  | the memory bandwidth to the ML outer-product engine when only a single external | 
|  | memory port is provided. Line and all entry flushing is supported where the core | 
|  | stalls until completion to simplify the contract. | 
|  |  | 
|  | A shared VLdSt unit exists for cached accesses. | 
|  |  | 
|  | ## Uncached | 
|  |  | 
|  | Note: It is not recommended to use intentional uncached accesses as | 
|  | `mmap_uncached` has been seen to be buggy. | 
|  |  | 
|  | Memory may be accessed as uncached through the setting of a high address bit. | 
|  | This is for simple fine grain control over how load/store units are to access | 
|  | memory directly or through the L1 cache. We only allow aligned accesses of | 
|  | native register size (eg. scalar=32b, simd=256b) via uncached accesses direct to | 
|  | memory. This simplifies the hardware which is required to support a large window | 
|  | of outstanding read operations, but does impose complications on the software. | 
|  | The code must assume C `__restrict__` attributes for any memory accessed in this | 
|  | way. | 
|  |  | 
|  | Separate VLd and VSt units exist for uncached accesses. |