blob: b741e3fe5a490945b8d3062f773dc13837703290 [file] [log] [blame] [view]
# Kelvin
Kelvin is a RISCV CPU with custom SIMD instructions and microarchitectural
decisions aligned with the dataplane properties of an ML accelerator. Kelvin
starts with domain and matrix capabilities and then adds vector and scalar
capabilities for a fused design.
## Block Diagram
![Kelvin block diagram](images/arch.png)
## Scalar Core
A simple RISC-V scalar frontend drives the command queues of the ML+SIMD
backend.
Kelvin utilizes a custom RISC-V frontend (rv32im) that runs the minimal set of
instructions to support an executor run-to-completion model (eg. no OS, no
interrupts), with all control tasks onloaded to the SMC . The C extension
encoding is reclaimed (as per the risc-v specification) to provide the necessary
encoding space for the SIMD registers (6b indices), and to allow flexible type
encodings and instruction compression (stripmining) for the SIMD instruction
set. The scalar core is an in order machine with no speculation.
The branch policy in the fetch stage is backwards branches are taken and forward
branches are not-taken, incurring a penalty cycle if the execute result does not
match the decision in the fetch unit.
## Vector Core
![Kelvin SIMD](images/simd.png)
We use SIMD and vector interchangeably, referring to a simple and practical SIMD
instruction definition devoid of variable length behaviors. The scalar frontend
is decoupled from the backend by a Fifo structure that buffers vector
instructions, posting only to the relevant command queues when dependencies are
resolved in the vector regfile.
### MAC
The central component of the design is a quantized outer product
multiply-accumulate engine. An outer-product engine provides two-dimensional
broadcast structures to maximize the amount of deliverable compute with respect
to memory accesses. On one axis is a parallel broadcast (“wide”, convolution
weights), and the other axis the transpose shifted inputs of a number of batches
(“narrow”, eg. MobileNet XY batching).
![Kelvin MAC](images/mac.png)
The outer-product construction is a vertical arrangement of multiple VDOT
opcodes which utilize 4x 8bit multiplies reduced into 32 bit accumulators.
### Stripmining
Strip mining is defined as folding array-based parallelism to fit the available
hardware parallelism. To reduce frontend instruction dispatch pressure becoming
a bottleneck, and to natively support instruction level tiling patterns through
the SIMD registers, the instruction encoding shall explicitly include a
stripmine mechanism that converts a single frontend dispatch event to the
command queue into four serialized issue events into the SIMD units. For
instance a vadd v0 in Dispatch will produce vadd v0 : vadd v1 : vadd v2 :
vadd v3 at Issue. These will be processed as four discrete events.
## Registers
There are 4 distinct register types.
Registers | Names | Width
---------------- | ------------- | -----------------------
Scalar (31) | zero, x1..x31 | 32 bits
Vector (64) | v0..v63 | 256 bits (eg. int32 x8)
Accumulator | acc<8><8> | 8x8x 32 bits
Control & Status | CSRx | Various
## Cache
Caches exists as a single layer between the core and the first level of shared
SRAM. The L1 cache and scalar core frontend are an overhead to the rest of the
backend compute pipeline and ideally are as small as possible.
The L1Icache is 8KB (256b blocks * 256 slots) with 4-way set associativity.
The L1Dcache sizing is towards the scalar core requirements to perform loop
management and address generation. The L1Dcache is 16KB (SIMD256b) with low set
associativity of 4-way. The L1Dcache is implemented with a dual bank
architecture where each bank is 8KB (similar to L1Icache). This property allows
for a degree of next line prefetch. The L1Dcache also serves as an alignment
buffer for the scalar and SIMD instructions to assist development and to
simplify software support. In an embedded setting, the L1Dcache provides half of
the memory bandwidth to the ML outer-product engine when only a single external
memory port is provided. Line and all entry flushing is supported where the core
stalls until completion to simplify the contract.
A shared VLdSt unit exists for cached accesses.
## Uncached
Note: It is not recommended to use intentional uncached accesses as
`mmap_uncached` has been seen to be buggy.
Memory may be accessed as uncached through the setting of a high address bit.
This is for simple fine grain control over how load/store units are to access
memory directly or through the L1 cache. We only allow aligned accesses of
native register size (eg. scalar=32b, simd=256b) via uncached accesses direct to
memory. This simplifies the hardware which is required to support a large window
of outstanding read operations, but does impose complications on the software.
The code must assume C `__restrict__` attributes for any memory accessed in this
way.
Separate VLd and VSt units exist for uncached accesses.