blob: b741e3fe5a490945b8d3062f773dc13837703290 [file] [log] [blame] [view]
Michael Hoangcd235132023-11-07 00:09:20 +00001# Kelvin
2
3Kelvin is a RISCV CPU with custom SIMD instructions and microarchitectural
4decisions aligned with the dataplane properties of an ML accelerator. Kelvin
5starts with domain and matrix capabilities and then adds vector and scalar
6capabilities for a fused design.
7
8## Block Diagram
9
10![Kelvin block diagram](images/arch.png)
11
12## Scalar Core
13
14A simple RISC-V scalar frontend drives the command queues of the ML+SIMD
15backend.
16
17Kelvin utilizes a custom RISC-V frontend (rv32im) that runs the minimal set of
18instructions to support an executor run-to-completion model (eg. no OS, no
19interrupts), with all control tasks onloaded to the SMC . The C extension
20encoding is reclaimed (as per the risc-v specification) to provide the necessary
21encoding space for the SIMD registers (6b indices), and to allow flexible type
22encodings and instruction compression (stripmining) for the SIMD instruction
23set. The scalar core is an in order machine with no speculation.
24
25The branch policy in the fetch stage is backwards branches are taken and forward
26branches are not-taken, incurring a penalty cycle if the execute result does not
27match the decision in the fetch unit.
28
29## Vector Core
30
31![Kelvin SIMD](images/simd.png)
32
33We use SIMD and vector interchangeably, referring to a simple and practical SIMD
34instruction definition devoid of variable length behaviors. The scalar frontend
35is decoupled from the backend by a Fifo structure that buffers vector
36instructions, posting only to the relevant command queues when dependencies are
37resolved in the vector regfile.
38
39### MAC
40
41The central component of the design is a quantized outer product
42multiply-accumulate engine. An outer-product engine provides two-dimensional
43broadcast structures to maximize the amount of deliverable compute with respect
44to memory accesses. On one axis is a parallel broadcast (“wide”, convolution
45weights), and the other axis the transpose shifted inputs of a number of batches
46(“narrow”, eg. MobileNet XY batching).
47
48![Kelvin MAC](images/mac.png)
49
50The outer-product construction is a vertical arrangement of multiple VDOT
51opcodes which utilize 4x 8bit multiplies reduced into 32 bit accumulators.
52
53### Stripmining
54
55Strip mining is defined as folding array-based parallelism to fit the available
56hardware parallelism. To reduce frontend instruction dispatch pressure becoming
57a bottleneck, and to natively support instruction level tiling patterns through
58the SIMD registers, the instruction encoding shall explicitly include a
59stripmine mechanism that converts a single frontend dispatch event to the
60command queue into four serialized issue events into the SIMD units. For
61instance a vadd v0 in Dispatch will produce vadd v0 : vadd v1 : vadd v2 :
62vadd v3 at Issue. These will be processed as four discrete events.
63
64## Registers
65
66There are 4 distinct register types.
67
68Registers | Names | Width
69---------------- | ------------- | -----------------------
70Scalar (31) | zero, x1..x31 | 32 bits
71Vector (64) | v0..v63 | 256 bits (eg. int32 x8)
72Accumulator | acc<8><8> | 8x8x 32 bits
73Control & Status | CSRx | Various
74
75## Cache
76
77Caches exists as a single layer between the core and the first level of shared
78SRAM. The L1 cache and scalar core frontend are an overhead to the rest of the
79backend compute pipeline and ideally are as small as possible.
80
81The L1Icache is 8KB (256b blocks * 256 slots) with 4-way set associativity.
82
83The L1Dcache sizing is towards the scalar core requirements to perform loop
84management and address generation. The L1Dcache is 16KB (SIMD256b) with low set
85associativity of 4-way. The L1Dcache is implemented with a dual bank
86architecture where each bank is 8KB (similar to L1Icache). This property allows
87for a degree of next line prefetch. The L1Dcache also serves as an alignment
88buffer for the scalar and SIMD instructions to assist development and to
89simplify software support. In an embedded setting, the L1Dcache provides half of
90the memory bandwidth to the ML outer-product engine when only a single external
91memory port is provided. Line and all entry flushing is supported where the core
92stalls until completion to simplify the contract.
93
94A shared VLdSt unit exists for cached accesses.
95
96## Uncached
97
98Note: It is not recommended to use intentional uncached accesses as
99`mmap_uncached` has been seen to be buggy.
100
101Memory may be accessed as uncached through the setting of a high address bit.
102This is for simple fine grain control over how load/store units are to access
103memory directly or through the L1 cache. We only allow aligned accesses of
104native register size (eg. scalar=32b, simd=256b) via uncached accesses direct to
105memory. This simplifies the hardware which is required to support a large window
106of outstanding read operations, but does impose complications on the software.
107The code must assume C `__restrict__` attributes for any memory accessed in this
108way.
109
110Separate VLd and VSt units exist for uncached accesses.