Kelvin is a RISCV CPU with custom SIMD instructions and microarchitectural decisions aligned with the dataplane properties of an ML accelerator. Kelvin starts with domain and matrix capabilities and then adds vector and scalar capabilities for a fused design.

Block Diagram

Kelvin block diagram

Scalar Core

A simple RISC-V scalar frontend drives the command queues of the ML+SIMD backend.

Kelvin utilizes a custom RISC-V frontend (rv32im) that runs the minimal set of instructions to support an executor run-to-completion model (eg. no OS, no interrupts), with all control tasks onloaded to the SMC . The C extension encoding is reclaimed (as per the risc-v specification) to provide the necessary encoding space for the SIMD registers (6b indices), and to allow flexible type encodings and instruction compression (stripmining) for the SIMD instruction set. The scalar core is an in order machine with no speculation.

The branch policy in the fetch stage is backwards branches are taken and forward branches are not-taken, incurring a penalty cycle if the execute result does not match the decision in the fetch unit.

Vector Core

Kelvin SIMD

We use SIMD and vector interchangeably, referring to a simple and practical SIMD instruction definition devoid of variable length behaviors. The scalar frontend is decoupled from the backend by a Fifo structure that buffers vector instructions, posting only to the relevant command queues when dependencies are resolved in the vector regfile.


The central component of the design is a quantized outer product multiply-accumulate engine. An outer-product engine provides two-dimensional broadcast structures to maximize the amount of deliverable compute with respect to memory accesses. On one axis is a parallel broadcast (“wide”, convolution weights), and the other axis the transpose shifted inputs of a number of batches (“narrow”, eg. MobileNet XY batching).

Kelvin MAC

The outer-product construction is a vertical arrangement of multiple VDOT opcodes which utilize 4x 8bit multiplies reduced into 32 bit accumulators.


Strip mining is defined as folding array-based parallelism to fit the available hardware parallelism. To reduce frontend instruction dispatch pressure becoming a bottleneck, and to natively support instruction level tiling patterns through the SIMD registers, the instruction encoding shall explicitly include a stripmine mechanism that converts a single frontend dispatch event to the command queue into four serialized issue events into the SIMD units. For instance a “vadd v0” in Dispatch will produce “vadd v0 : vadd v1 : vadd v2 : vadd v3” at Issue. These will be processed as four discrete events.


There are 4 distinct register types.

Scalar (31)zero, x1..x3132 bits
Vector (64)v0..v63256 bits (eg. int32 x8)
Accumulatoracc<8><8>8x8x 32 bits
Control & StatusCSRxVarious


Caches exists as a single layer between the core and the first level of shared SRAM. The L1 cache and scalar core frontend are an overhead to the rest of the backend compute pipeline and ideally are as small as possible.

The L1Icache is 8KB (256b blocks * 256 slots) with 4-way set associativity.

The L1Dcache sizing is towards the scalar core requirements to perform loop management and address generation. The L1Dcache is 16KB (SIMD256b) with low set associativity of 4-way. The L1Dcache is implemented with a dual bank architecture where each bank is 8KB (similar to L1Icache). This property allows for a degree of next line prefetch. The L1Dcache also serves as an alignment buffer for the scalar and SIMD instructions to assist development and to simplify software support. In an embedded setting, the L1Dcache provides half of the memory bandwidth to the ML outer-product engine when only a single external memory port is provided. Line and all entry flushing is supported where the core stalls until completion to simplify the contract.

A shared VLdSt unit exists for cached accesses.


Note: It is not recommended to use intentional uncached accesses as mmap_uncached has been seen to be buggy.

Memory may be accessed as uncached through the setting of a high address bit. This is for simple fine grain control over how load/store units are to access memory directly or through the L1 cache. We only allow aligned accesses of native register size (eg. scalar=32b, simd=256b) via uncached accesses direct to memory. This simplifies the hardware which is required to support a large window of outstanding read operations, but does impose complications on the software. The code must assume C __restrict__ attributes for any memory accessed in this way.

Separate VLd and VSt units exist for uncached accesses.