Initial commit of overview.md
- Also clean up some chisel
Change-Id: I7bb97991aa36f6cf127092a53c2f7ff661d0e9f0
diff --git a/README.md b/README.md
index b697d08..79de965 100644
--- a/README.md
+++ b/README.md
@@ -2,6 +2,11 @@
Kelvin is a RISC-V32IM core with a custom instruction set.
+![Kelvin block diagram](doc/images/arch.png)
+
+More information on the design can be found in the
+[overview](doc/overview.md).
+
## Building
Kelvin uses [bazel](https://bazel.build/) as it's build system. The Verilated
diff --git a/doc/images/arch.png b/doc/images/arch.png
new file mode 100644
index 0000000..0483cf5
--- /dev/null
+++ b/doc/images/arch.png
Binary files differ
diff --git a/doc/images/mac.png b/doc/images/mac.png
new file mode 100644
index 0000000..775b37e
--- /dev/null
+++ b/doc/images/mac.png
Binary files differ
diff --git a/doc/images/simd.png b/doc/images/simd.png
new file mode 100644
index 0000000..b3db5dd
--- /dev/null
+++ b/doc/images/simd.png
Binary files differ
diff --git a/doc/overview.md b/doc/overview.md
new file mode 100644
index 0000000..b741e3f
--- /dev/null
+++ b/doc/overview.md
@@ -0,0 +1,110 @@
+# Kelvin
+
+Kelvin is a RISCV CPU with custom SIMD instructions and microarchitectural
+decisions aligned with the dataplane properties of an ML accelerator. Kelvin
+starts with domain and matrix capabilities and then adds vector and scalar
+capabilities for a fused design.
+
+## Block Diagram
+
+![Kelvin block diagram](images/arch.png)
+
+## Scalar Core
+
+A simple RISC-V scalar frontend drives the command queues of the ML+SIMD
+backend.
+
+Kelvin utilizes a custom RISC-V frontend (rv32im) that runs the minimal set of
+instructions to support an executor run-to-completion model (eg. no OS, no
+interrupts), with all control tasks onloaded to the SMC . The C extension
+encoding is reclaimed (as per the risc-v specification) to provide the necessary
+encoding space for the SIMD registers (6b indices), and to allow flexible type
+encodings and instruction compression (stripmining) for the SIMD instruction
+set. The scalar core is an in order machine with no speculation.
+
+The branch policy in the fetch stage is backwards branches are taken and forward
+branches are not-taken, incurring a penalty cycle if the execute result does not
+match the decision in the fetch unit.
+
+## Vector Core
+
+![Kelvin SIMD](images/simd.png)
+
+We use SIMD and vector interchangeably, referring to a simple and practical SIMD
+instruction definition devoid of variable length behaviors. The scalar frontend
+is decoupled from the backend by a Fifo structure that buffers vector
+instructions, posting only to the relevant command queues when dependencies are
+resolved in the vector regfile.
+
+### MAC
+
+The central component of the design is a quantized outer product
+multiply-accumulate engine. An outer-product engine provides two-dimensional
+broadcast structures to maximize the amount of deliverable compute with respect
+to memory accesses. On one axis is a parallel broadcast (“wide”, convolution
+weights), and the other axis the transpose shifted inputs of a number of batches
+(“narrow”, eg. MobileNet XY batching).
+
+![Kelvin MAC](images/mac.png)
+
+The outer-product construction is a vertical arrangement of multiple VDOT
+opcodes which utilize 4x 8bit multiplies reduced into 32 bit accumulators.
+
+### Stripmining
+
+Strip mining is defined as folding array-based parallelism to fit the available
+hardware parallelism. To reduce frontend instruction dispatch pressure becoming
+a bottleneck, and to natively support instruction level tiling patterns through
+the SIMD registers, the instruction encoding shall explicitly include a
+stripmine mechanism that converts a single frontend dispatch event to the
+command queue into four serialized issue events into the SIMD units. For
+instance a “vadd v0” in Dispatch will produce “vadd v0 : vadd v1 : vadd v2 :
+vadd v3” at Issue. These will be processed as four discrete events.
+
+## Registers
+
+There are 4 distinct register types.
+
+Registers | Names | Width
+---------------- | ------------- | -----------------------
+Scalar (31) | zero, x1..x31 | 32 bits
+Vector (64) | v0..v63 | 256 bits (eg. int32 x8)
+Accumulator | acc<8><8> | 8x8x 32 bits
+Control & Status | CSRx | Various
+
+## Cache
+
+Caches exists as a single layer between the core and the first level of shared
+SRAM. The L1 cache and scalar core frontend are an overhead to the rest of the
+backend compute pipeline and ideally are as small as possible.
+
+The L1Icache is 8KB (256b blocks * 256 slots) with 4-way set associativity.
+
+The L1Dcache sizing is towards the scalar core requirements to perform loop
+management and address generation. The L1Dcache is 16KB (SIMD256b) with low set
+associativity of 4-way. The L1Dcache is implemented with a dual bank
+architecture where each bank is 8KB (similar to L1Icache). This property allows
+for a degree of next line prefetch. The L1Dcache also serves as an alignment
+buffer for the scalar and SIMD instructions to assist development and to
+simplify software support. In an embedded setting, the L1Dcache provides half of
+the memory bandwidth to the ML outer-product engine when only a single external
+memory port is provided. Line and all entry flushing is supported where the core
+stalls until completion to simplify the contract.
+
+A shared VLdSt unit exists for cached accesses.
+
+## Uncached
+
+Note: It is not recommended to use intentional uncached accesses as
+`mmap_uncached` has been seen to be buggy.
+
+Memory may be accessed as uncached through the setting of a high address bit.
+This is for simple fine grain control over how load/store units are to access
+memory directly or through the L1 cache. We only allow aligned accesses of
+native register size (eg. scalar=32b, simd=256b) via uncached accesses direct to
+memory. This simplifies the hardware which is required to support a large window
+of outstanding read operations, but does impose complications on the software.
+The code must assume C `__restrict__` attributes for any memory accessed in this
+way.
+
+Separate VLd and VSt units exist for uncached accesses.
diff --git a/hdl/chisel/src/common/Fifo4.scala b/hdl/chisel/src/common/Fifo4.scala
index 0c277a1..952ff33 100644
--- a/hdl/chisel/src/common/Fifo4.scala
+++ b/hdl/chisel/src/common/Fifo4.scala
@@ -127,16 +127,6 @@
in1pos === i.U && in1valid(1),
in0pos === i.U && in0valid(0))
- // Couldn't get the following to work properly.
- //
- // val data = MuxOR(valid(0), io.in.bits(0).bits.asUInt) |
- // MuxOR(valid(1), io.in.bits(1).bits.asUInt) |
- // MuxOR(valid(2), io.in.bits(2).bits.asUInt) |
- // MuxOR(valid(3), io.in.bits(3).bits.asUInt)
- //
- // when (ivalid && valid =/= 0.U) {
- // mem(i) := data.asTypeOf(t)
- // }
when (ivalid) {
when (valid(0)) {
mem(i) := io.in.bits(0).bits
@@ -166,7 +156,6 @@
when (mcount > 0.U) {
mslice.io.in.bits := mem(outpos)
} .elsewhen (ivalid) {
- // As above, couldn't get MuxOR to work.
when (iactive(0)) {
mslice.io.in.bits := io.in.bits(0).bits
} .elsewhen (iactive(1)) {
diff --git a/hdl/chisel/src/common/Fifo4e.scala b/hdl/chisel/src/common/Fifo4e.scala
index ac2a484..a298552 100644
--- a/hdl/chisel/src/common/Fifo4e.scala
+++ b/hdl/chisel/src/common/Fifo4e.scala
@@ -100,16 +100,6 @@
in1pos === i.U && in1valid(1),
in0pos === i.U && in0valid(0))
- // Couldn't get the following to work properly.
- //
- // val data = MuxOR(valid(0), io.in.bits(0).bits.asUInt) |
- // MuxOR(valid(1), io.in.bits(1).bits.asUInt) |
- // MuxOR(valid(2), io.in.bits(2).bits.asUInt) |
- // MuxOR(valid(3), io.in.bits(3).bits.asUInt)
- //
- // when (ivalid && valid =/= 0.U) {
- // mem(i) := data.asTypeOf(t)
- // }
when (ivalid) {
when (valid(0)) {
mem(i) := io.in.bits(0).bits
diff --git a/hdl/chisel/src/kelvin/L1DCache.scala b/hdl/chisel/src/kelvin/L1DCache.scala
index 13e61a1..4c0d8d0 100644
--- a/hdl/chisel/src/kelvin/L1DCache.scala
+++ b/hdl/chisel/src/kelvin/L1DCache.scala
@@ -262,7 +262,7 @@
// 2^8 * 256 / 8 = 8KiB 4-way Tag[31,12] + Index[11,6] + Data[5,0]
val slots = p.l1dslots
val slotBits = log2Ceil(slots)
- val assoc = 4 // 2, 4, 8, 16, slots
+ val assoc = 4
val sets = slots / assoc
val setLsb = log2Ceil(p.lsuDataBits / 8)
val setMsb = log2Ceil(sets) + setLsb - 1
@@ -342,7 +342,6 @@
val valid = RegInit(VecInit(Seq.fill(slots)(false.B)))
val dirty = RegInit(VecInit(Seq.fill(slots)(false.B)))
val camaddr = Reg(Vec(slots, UInt(32.W)))
- // val mem = Mem1RWM(slots, p.lsuDataBits * 9 / 8, 9)
val mem = Module(new Sram_1rwm_256x288())
val history = Reg(Vec(slots / assoc, Vec(assoc, UInt(log2Ceil(assoc).W))))
diff --git a/hdl/chisel/src/kelvin/L1ICache.scala b/hdl/chisel/src/kelvin/L1ICache.scala
index 55cc19c..135bb6c 100644
--- a/hdl/chisel/src/kelvin/L1ICache.scala
+++ b/hdl/chisel/src/kelvin/L1ICache.scala
@@ -26,13 +26,13 @@
class L1ICache(p: Parameters) extends Module {
// A relatively simple cache block. Only one transaction may post at a time.
- // 2^8 * 256 / 8 = 8KiB 4-way Tag[31,12] + Index[11,6] + Data[5,0]
+ // 2^8 * 256 / 8 = 8KiB 4-way Tag[31,11] + Index[10,5] + Data[4,0]
assert(p.axi0IdBits == 4)
assert(p.axi0DataBits == 256)
val slots = p.l1islots
val slotBits = log2Ceil(slots)
- val assoc = 4 // 2, 4, 8, 16, slots
+ val assoc = 4
val sets = slots / assoc
val setLsb = log2Ceil(p.fetchDataBits / 8)
val setMsb = log2Ceil(sets) + setLsb - 1
@@ -71,7 +71,6 @@
// CAM state.
val valid = RegInit(VecInit(Seq.fill(slots)(false.B)))
val camaddr = Reg(Vec(slots, UInt(32.W)))
- // val mem = Mem1RW(slots, UInt(p.axi0DataBits.W))
val mem = Module(new Sram_1rw_256x256())
val history = Reg(Vec(slots / assoc, Vec(assoc, UInt(log2Ceil(assoc).W))))
diff --git a/hdl/chisel/src/kelvin/Parameters.scala b/hdl/chisel/src/kelvin/Parameters.scala
index b9e3d77..bb8a149 100644
--- a/hdl/chisel/src/kelvin/Parameters.scala
+++ b/hdl/chisel/src/kelvin/Parameters.scala
@@ -25,7 +25,7 @@
}
// Vector Length (register-file and compute).
- // 128 = faster builds, but not production(?).
+ // 128 = faster builds, but not production.
val vectorBits = sys.env.get("KELVIN_SIMD").getOrElse("256").toInt
assert(vectorBits == 512 || vectorBits == 256 || vectorBits == 128)
@@ -46,9 +46,7 @@
val vectorFifoDepth = 16
// L0ICache Fetch unit.
- // val fetchCacheBytes = 2048
val fetchCacheBytes = 1024
- // val fetchCacheBytes = 128
// Scalar Core Fetch bus.
val fetchAddrBits = 32 // do not change
diff --git a/hdl/chisel/src/kelvin/scalar/Alu.scala b/hdl/chisel/src/kelvin/scalar/Alu.scala
index 445dacc..06818c1 100644
--- a/hdl/chisel/src/kelvin/scalar/Alu.scala
+++ b/hdl/chisel/src/kelvin/scalar/Alu.scala
@@ -79,31 +79,10 @@
op := io.req.op
}
- // val rs1 = MuxOR(valid, io.rs1.data)
- // val rs2 = MuxOR(valid, io.rs2.data)
val rs1 = io.rs1.data
val rs2 = io.rs2.data
val shamt = rs2(4,0)
- // TODO: should we be masking like this for energy?
- // TODO: a single addsub for add/sub/slt/sltu
- // val add = MuxOR(op(alu.ADD), rs1) + MuxOR(op(alu.ADD), rs2)
- // val sub = MuxOR(op(alu.SUB), rs1) - MuxOR(op(alu.SUB), rs2)
- // val sll = MuxOR(op(alu.SLL), rs1) << MuxOR(op(alu.SLL), shamt)
- // val srl = MuxOR(op(alu.SRL), rs1) >> MuxOR(op(alu.SRL), shamt)
- // val sra = (MuxOR(op(alu.SRA), rs1.asSInt, 0.S) >> MuxOR(op(alu.SRA), shamt)).asUInt
- // val slt = MuxOR(op(alu.SLT), rs1.asSInt, 0.S) < MuxOR(op(alu.SLT), rs2.asSInt, 0.S)
- // val sltu = MuxOR(op(alu.SLTU), rs1) < MuxOR(op(alu.SLTU), rs2)
- // val and = MuxOR(op(alu.AND), rs1) & MuxOR(op(alu.AND), rs2)
- // val or = MuxOR(op(alu.OR), rs1) | MuxOR(op(alu.OR), rs2)
- // val xor = MuxOR(op(alu.XOR), rs1) ^ MuxOR(op(alu.XOR), rs2)
- // val lui = MuxOR(op(alu.LUI), rs2)
- // val clz = MuxOR(op(alu.CLZ), CLZ(rs1))
- // val ctz = MuxOR(op(alu.CTZ), CTZ(rs1))
- // val pcnt = MuxOR(op(alu.PCNT), PopCount(rs1))
-
- // io.rd.data := add | sub | sll | srl | sra | slt | sltu | and | or | xor | lui
-
io.rd.valid := valid
io.rd.addr := addr
io.rd.data := MuxOR(op(alu.ADD), rs1 + rs2) |
diff --git a/hdl/chisel/src/kelvin/scalar/Decode.scala b/hdl/chisel/src/kelvin/scalar/Decode.scala
index 48f3622..9c7abf4 100644
--- a/hdl/chisel/src/kelvin/scalar/Decode.scala
+++ b/hdl/chisel/src/kelvin/scalar/Decode.scala
@@ -12,6 +12,12 @@
// See the License for the specific language governing permissions and
// limitations under the License.
+
+// Decode: Contains decode logic to be forwarded to the appropriate functional
+// block. A serialization mechanism is introduced to stall a decoded instruction
+// from bring presented to the functional block until next cycle if the block has
+// already been presented with an instruction from another decoder.
+
package kelvin
import chisel3._
diff --git a/hdl/chisel/src/kelvin/scalar/Fetch.scala b/hdl/chisel/src/kelvin/scalar/Fetch.scala
index 722c428..fb951e1 100644
--- a/hdl/chisel/src/kelvin/scalar/Fetch.scala
+++ b/hdl/chisel/src/kelvin/scalar/Fetch.scala
@@ -12,6 +12,11 @@
// See the License for the specific language governing permissions and
// limitations under the License.
+
+// Fetch Unit: 4 way fetcher that directly feeds the 4 decoders.
+// The fetcher itself has a partial decoder to identify branches, where backwards
+// branches are assumed taken and forward branches assumed not taken.
+
package kelvin
import chisel3._
diff --git a/hdl/chisel/src/kelvin/scalar/Regfile.scala b/hdl/chisel/src/kelvin/scalar/Regfile.scala
index c594fcb..a3e9d84 100644
--- a/hdl/chisel/src/kelvin/scalar/Regfile.scala
+++ b/hdl/chisel/src/kelvin/scalar/Regfile.scala
@@ -12,6 +12,10 @@
// See the License for the specific language governing permissions and
// limitations under the License.
+// Regfile: 32 entry scalar register file with 8 read ports and 6
+// write ports. Houses a global scoreboard that informs of interlock
+// deps inside the decoders.
+
package kelvin
import chisel3._
@@ -92,11 +96,8 @@
}
})
- // 8R6W
- // 8 read ports
- // 6 write ports
- // The scalar registers, integer (and float todo).
+ // The scalar registers.
val regfile = Reg(Vec(32, UInt(32.W)))
// ***************************************************************************
diff --git a/hdl/chisel/src/kelvin/scalar/SCore.scala b/hdl/chisel/src/kelvin/scalar/SCore.scala
index 07ef6ad..b380584 100644
--- a/hdl/chisel/src/kelvin/scalar/SCore.scala
+++ b/hdl/chisel/src/kelvin/scalar/SCore.scala
@@ -12,6 +12,8 @@
// See the License for the specific language governing permissions and
// limitations under the License.
+
+// Scalar Core Frontend
package kelvin
import chisel3._
diff --git a/hdl/chisel/src/kelvin/vector/VCore.scala b/hdl/chisel/src/kelvin/vector/VCore.scala
index 7e4cb00..029900a 100644
--- a/hdl/chisel/src/kelvin/vector/VCore.scala
+++ b/hdl/chisel/src/kelvin/vector/VCore.scala
@@ -26,12 +26,6 @@
}
}
-// object VCore {
-// def apply(p: Parameters): VCoreEmpty = {
-// return Module(new VCoreEmpty(p))
-// }
-// }
-
class VCoreIO(p: Parameters) extends Bundle {
// Decode cycle.
val vinst = Vec(4, new VInstIO)
@@ -310,51 +304,3 @@
vld.io.nempty || vst.io.nempty
}
-class VCoreEmpty(p: Parameters) extends Module {
- val io = IO(new Bundle {
- // Score <> VCore
- val score = new VCoreIO(p)
-
- // Data bus interface.
- val dbus = new DBusIO(p)
- val last = Output(Bool())
-
- // AXI interface.
- val ld = new AxiMasterReadIO(p.axi2AddrBits, p.axi2DataBits, p.axi2IdBits)
- val st = new AxiMasterWriteIO(p.axi2AddrBits, p.axi2DataBits, p.axi2IdBits)
- })
-
- io.score.undef := io.score.vinst(0).valid || io.score.vinst(1).valid ||
- io.score.vinst(2).valid || io.score.vinst(3).valid
-
- io.score.mactive := false.B
-
- io.dbus.valid := false.B
- io.dbus.write := false.B
- io.dbus.size := 0.U
- io.dbus.addr := 0.U
- io.dbus.adrx := 0.U
- io.dbus.wdata := 0.U
- io.dbus.wmask := 0.U
- io.last := false.B
-
- for (i <- 0 until 4) {
- io.score.vinst(i).ready := true.B
- io.score.rd(i).valid := false.B
- io.score.rd(i).addr := 0.U
- io.score.rd(i).data := 0.U
- }
-
- io.ld.addr.valid := false.B
- io.ld.addr.bits.addr := 0.U
- io.ld.addr.bits.id := 0.U
- io.ld.data.ready := false.B
-
- io.st.addr.valid := false.B
- io.st.addr.bits.addr := 0.U
- io.st.addr.bits.id := 0.U
- io.st.data.valid := false.B
- io.st.data.bits.data := 0.U
- io.st.data.bits.strb := 0.U
- io.st.resp.ready := false.B
-}