Blame - doc/overview.md - hw/kelvin

blob: b741e3fe5a490945b8d3062f773dc13837703290 [file] [log] [blame] [view]

Michael Hoang	cd23513	2023-11-07 00:09:20 +0000	[diff] [blame]	1	# Kelvin
				2
				3	Kelvin is a RISCV CPU with custom SIMD instructions and microarchitectural
				4	decisions aligned with the dataplane properties of an ML accelerator. Kelvin
				5	starts with domain and matrix capabilities and then adds vector and scalar
				6	capabilities for a fused design.
				7
				8	## Block Diagram
				9
				10	![Kelvin block diagram](images/arch.png)
				11
				12	## Scalar Core
				13
				14	A simple RISC-V scalar frontend drives the command queues of the ML+SIMD
				15	backend.
				16
				17	Kelvin utilizes a custom RISC-V frontend (rv32im) that runs the minimal set of
				18	instructions to support an executor run-to-completion model (eg. no OS, no
				19	interrupts), with all control tasks onloaded to the SMC . The C extension
				20	encoding is reclaimed (as per the risc-v specification) to provide the necessary
				21	encoding space for the SIMD registers (6b indices), and to allow flexible type
				22	encodings and instruction compression (stripmining) for the SIMD instruction
				23	set. The scalar core is an in order machine with no speculation.
				24
				25	The branch policy in the fetch stage is backwards branches are taken and forward
				26	branches are not-taken, incurring a penalty cycle if the execute result does not
				27	match the decision in the fetch unit.
				28
				29	## Vector Core
				30
				31	![Kelvin SIMD](images/simd.png)
				32
				33	We use SIMD and vector interchangeably, referring to a simple and practical SIMD
				34	instruction definition devoid of variable length behaviors. The scalar frontend
				35	is decoupled from the backend by a Fifo structure that buffers vector
				36	instructions, posting only to the relevant command queues when dependencies are
				37	resolved in the vector regfile.
				38
				39	### MAC
				40
				41	The central component of the design is a quantized outer product
				42	multiply-accumulate engine. An outer-product engine provides two-dimensional
				43	broadcast structures to maximize the amount of deliverable compute with respect
				44	to memory accesses. On one axis is a parallel broadcast (“wide”, convolution
				45	weights), and the other axis the transpose shifted inputs of a number of batches
				46	(“narrow”, eg. MobileNet XY batching).
				47
				48	![Kelvin MAC](images/mac.png)
				49
				50	The outer-product construction is a vertical arrangement of multiple VDOT
				51	opcodes which utilize 4x 8bit multiplies reduced into 32 bit accumulators.
				52
				53	### Stripmining
				54
				55	Strip mining is defined as folding array-based parallelism to fit the available
				56	hardware parallelism. To reduce frontend instruction dispatch pressure becoming
				57	a bottleneck, and to natively support instruction level tiling patterns through
				58	the SIMD registers, the instruction encoding shall explicitly include a
				59	stripmine mechanism that converts a single frontend dispatch event to the
				60	command queue into four serialized issue events into the SIMD units. For
				61	instance a “vadd v0” in Dispatch will produce “vadd v0 : vadd v1 : vadd v2 :
				62	vadd v3” at Issue. These will be processed as four discrete events.
				63
				64	## Registers
				65
				66	There are 4 distinct register types.
				67
				68	Registers \| Names \| Width
				69	---------------- \| ------------- \| -----------------------
				70	Scalar (31) \| zero, x1..x31 \| 32 bits
				71	Vector (64) \| v0..v63 \| 256 bits (eg. int32 x8)
				72	Accumulator \| acc<8><8> \| 8x8x 32 bits
				73	Control & Status \| CSRx \| Various
				74
				75	## Cache
				76
				77	Caches exists as a single layer between the core and the first level of shared
				78	SRAM. The L1 cache and scalar core frontend are an overhead to the rest of the
				79	backend compute pipeline and ideally are as small as possible.
				80
				81	The L1Icache is 8KB (256b blocks * 256 slots) with 4-way set associativity.
				82
				83	The L1Dcache sizing is towards the scalar core requirements to perform loop
				84	management and address generation. The L1Dcache is 16KB (SIMD256b) with low set
				85	associativity of 4-way. The L1Dcache is implemented with a dual bank
				86	architecture where each bank is 8KB (similar to L1Icache). This property allows
				87	for a degree of next line prefetch. The L1Dcache also serves as an alignment
				88	buffer for the scalar and SIMD instructions to assist development and to
				89	simplify software support. In an embedded setting, the L1Dcache provides half of
				90	the memory bandwidth to the ML outer-product engine when only a single external
				91	memory port is provided. Line and all entry flushing is supported where the core
				92	stalls until completion to simplify the contract.
				93
				94	A shared VLdSt unit exists for cached accesses.
				95
				96	## Uncached
				97
				98	Note: It is not recommended to use intentional uncached accesses as
				99	`mmap_uncached` has been seen to be buggy.
				100
				101	Memory may be accessed as uncached through the setting of a high address bit.
				102	This is for simple fine grain control over how load/store units are to access
				103	memory directly or through the L1 cache. We only allow aligned accesses of
				104	native register size (eg. scalar=32b, simd=256b) via uncached accesses direct to
				105	memory. This simplifies the hardware which is required to support a large window
				106	of outstanding read operations, but does impose complications on the software.
				107	The code must assume C `__restrict__` attributes for any memory accessed in this
				108	way.
				109
				110	Separate VLd and VSt units exist for uncached accesses.