Michael Hoang | cd23513 | 2023-11-07 00:09:20 +0000 | [diff] [blame] | 1 | # Kelvin |
| 2 | |
| 3 | Kelvin is a RISCV CPU with custom SIMD instructions and microarchitectural |
| 4 | decisions aligned with the dataplane properties of an ML accelerator. Kelvin |
| 5 | starts with domain and matrix capabilities and then adds vector and scalar |
| 6 | capabilities for a fused design. |
| 7 | |
| 8 | ## Block Diagram |
| 9 | |
| 10 | ![Kelvin block diagram](images/arch.png) |
| 11 | |
| 12 | ## Scalar Core |
| 13 | |
| 14 | A simple RISC-V scalar frontend drives the command queues of the ML+SIMD |
| 15 | backend. |
| 16 | |
| 17 | Kelvin utilizes a custom RISC-V frontend (rv32im) that runs the minimal set of |
| 18 | instructions to support an executor run-to-completion model (eg. no OS, no |
| 19 | interrupts), with all control tasks onloaded to the SMC . The C extension |
| 20 | encoding is reclaimed (as per the risc-v specification) to provide the necessary |
| 21 | encoding space for the SIMD registers (6b indices), and to allow flexible type |
| 22 | encodings and instruction compression (stripmining) for the SIMD instruction |
| 23 | set. The scalar core is an in order machine with no speculation. |
| 24 | |
| 25 | The branch policy in the fetch stage is backwards branches are taken and forward |
| 26 | branches are not-taken, incurring a penalty cycle if the execute result does not |
| 27 | match the decision in the fetch unit. |
| 28 | |
| 29 | ## Vector Core |
| 30 | |
| 31 | ![Kelvin SIMD](images/simd.png) |
| 32 | |
| 33 | We use SIMD and vector interchangeably, referring to a simple and practical SIMD |
| 34 | instruction definition devoid of variable length behaviors. The scalar frontend |
| 35 | is decoupled from the backend by a Fifo structure that buffers vector |
| 36 | instructions, posting only to the relevant command queues when dependencies are |
| 37 | resolved in the vector regfile. |
| 38 | |
| 39 | ### MAC |
| 40 | |
| 41 | The central component of the design is a quantized outer product |
| 42 | multiply-accumulate engine. An outer-product engine provides two-dimensional |
| 43 | broadcast structures to maximize the amount of deliverable compute with respect |
| 44 | to memory accesses. On one axis is a parallel broadcast (“wide”, convolution |
| 45 | weights), and the other axis the transpose shifted inputs of a number of batches |
| 46 | (“narrow”, eg. MobileNet XY batching). |
| 47 | |
| 48 | ![Kelvin MAC](images/mac.png) |
| 49 | |
| 50 | The outer-product construction is a vertical arrangement of multiple VDOT |
| 51 | opcodes which utilize 4x 8bit multiplies reduced into 32 bit accumulators. |
| 52 | |
| 53 | ### Stripmining |
| 54 | |
| 55 | Strip mining is defined as folding array-based parallelism to fit the available |
| 56 | hardware parallelism. To reduce frontend instruction dispatch pressure becoming |
| 57 | a bottleneck, and to natively support instruction level tiling patterns through |
| 58 | the SIMD registers, the instruction encoding shall explicitly include a |
| 59 | stripmine mechanism that converts a single frontend dispatch event to the |
| 60 | command queue into four serialized issue events into the SIMD units. For |
| 61 | instance a “vadd v0” in Dispatch will produce “vadd v0 : vadd v1 : vadd v2 : |
| 62 | vadd v3” at Issue. These will be processed as four discrete events. |
| 63 | |
| 64 | ## Registers |
| 65 | |
| 66 | There are 4 distinct register types. |
| 67 | |
| 68 | Registers | Names | Width |
| 69 | ---------------- | ------------- | ----------------------- |
| 70 | Scalar (31) | zero, x1..x31 | 32 bits |
| 71 | Vector (64) | v0..v63 | 256 bits (eg. int32 x8) |
| 72 | Accumulator | acc<8><8> | 8x8x 32 bits |
| 73 | Control & Status | CSRx | Various |
| 74 | |
| 75 | ## Cache |
| 76 | |
| 77 | Caches exists as a single layer between the core and the first level of shared |
| 78 | SRAM. The L1 cache and scalar core frontend are an overhead to the rest of the |
| 79 | backend compute pipeline and ideally are as small as possible. |
| 80 | |
| 81 | The L1Icache is 8KB (256b blocks * 256 slots) with 4-way set associativity. |
| 82 | |
| 83 | The L1Dcache sizing is towards the scalar core requirements to perform loop |
| 84 | management and address generation. The L1Dcache is 16KB (SIMD256b) with low set |
| 85 | associativity of 4-way. The L1Dcache is implemented with a dual bank |
| 86 | architecture where each bank is 8KB (similar to L1Icache). This property allows |
| 87 | for a degree of next line prefetch. The L1Dcache also serves as an alignment |
| 88 | buffer for the scalar and SIMD instructions to assist development and to |
| 89 | simplify software support. In an embedded setting, the L1Dcache provides half of |
| 90 | the memory bandwidth to the ML outer-product engine when only a single external |
| 91 | memory port is provided. Line and all entry flushing is supported where the core |
| 92 | stalls until completion to simplify the contract. |
| 93 | |
| 94 | A shared VLdSt unit exists for cached accesses. |
| 95 | |
| 96 | ## Uncached |
| 97 | |
| 98 | Note: It is not recommended to use intentional uncached accesses as |
| 99 | `mmap_uncached` has been seen to be buggy. |
| 100 | |
| 101 | Memory may be accessed as uncached through the setting of a high address bit. |
| 102 | This is for simple fine grain control over how load/store units are to access |
| 103 | memory directly or through the L1 cache. We only allow aligned accesses of |
| 104 | native register size (eg. scalar=32b, simd=256b) via uncached accesses direct to |
| 105 | memory. This simplifies the hardware which is required to support a large window |
| 106 | of outstanding read operations, but does impose complications on the software. |
| 107 | The code must assume C `__restrict__` attributes for any memory accessed in this |
| 108 | way. |
| 109 | |
| 110 | Separate VLd and VSt units exist for uncached accesses. |