| # Kelvin Instruction Reference |
| |
| An ML+SIMD+Scalar instruction set for ML accelerator cores. |
| |
| [TOC] |
| |
| ## SIMD register configuration |
| |
| Kelvin has 64 vector registers, `v0` to `v63`, with the vector length of 256-bit |
| for each of the registers. The register can store data in the format of 8b, 16b, |
| and 32b, as encoded in the instructions (See the next section for detail). |
| |
| Kelvin also supports the stripmine behaviors, which utilizes 16 vector registers |
| with each one 4x the size of the typical register (Also see the details in the |
| next section). |
| |
| ## SIMD Instructions |
| |
| The SIMD instructions utilize a register file with 64 entries which serves both |
| standard arithmetic and logical operations and the domain compute. SIMD lane |
| size, scalar broadcast, arithmetic operation sign, and stripmine behaviors are |
| encoded explictly in the opcodes. |
| |
| The SIMD instructions replace the encoding space of the compressed instruction |
| set extension (those with 2-bit prefixes 00, 01, and 10). See [The RISC-V |
| Instruction Set Manual v2.2 "Available 30-bit instruction encoding |
| spaces"](https://riscv.org/wp-content/uploads/2017/05/riscv-spec-v2.2.pdf) for |
| quadrupling the available encoding space within the 32-bit format. |
| |
| ### Instruction Encodings |
| |
| 31..26 | 25..20 | 19..14 | 13..12 | 11..6 | 5 | 4..2 | 1..0 | form |
| :----: | :----: | :----: | :----: | :---: | :-: | :---: | :--: | :--: |
| func2 | vs2 | vs1 | sz | vd | m | func1 | 00 | .vv |
| func2 | [0]xs2 | vs1 | sz | vd | m | func1 | 10 | .vx |
| func2 | 000000 | vs1 | sz | vd | m | func1 | 10 | .v |
| func2 | [0]xs2 | xs1[0] | sz | vd | m | 111 | 11 | .xx |
| func2 | 000000 | xs1[0] | sz | vd | m | 111 | 11 | .x |
| |
| <br> |
| |
| 31..26 | 25..20 | 19..14 | 13..12 | 11..6 | 5 | 4..3 | 2..0 | form |
| :----: | :----: | :----: | :--------: | :---: | :-: | :--------: | :--: | :--: |
| vs3 | vs2 | vs1 | func3[3:2] | vd | m | func3[1:0] | 001 | .vvv |
| vs3 | [0]xs2 | vs1 | func3[3:2] | vd | m | func3[1:0] | 101 | .vxv |
| |
| ### Types ".b" ".h" ".w" |
| |
| The SIMD lane size is encoded in the opcode definition indicating the |
| destination type. For many opcodes source and destination sizes are the same, |
| differing for widening and narrowing operations. |
| |
| op[13:12] | sz | type |
| :-------: | :--: | :--: |
| 00 | ".b" | 8b |
| 01 | ".h" | 16b |
| 10 | ".w" | 32b |
| |
| ### Scalar ".vx" |
| |
| Instructions may use a scalar register to perform a value broadcast (8b, 16b, |
| 32b) to all SIMD lanes of one operand. |
| |
| op[2:0] | form |
| :-----: | :------------: |
| x00 | ".vv" |
| x10 | ".vx" |
| x10 | ".v" (xs2==x0) |
| x11 | ".xx" |
| x11 | ".x" (xs2==x0) |
| 001 | ".vvv" |
| 101 | ".vxv" |
| |
| ### Signed/Unsigned ".u" |
| |
| Instructions which may be marked with ".u" have signed and unsigned variants. |
| See comparisons, arithmetic operations and saturation for usage, the side |
| effects being typical behaviors unless otherwise noted. |
| |
| ### Stripmine ".m" |
| |
| The stripmine functionality is an instruction compression mechanism. Frontend |
| dispatch captures a single instruction, while the backend issue expands to four |
| operations. Conceptually the register file is reduced from 64 locations to 16, |
| where a stripmine register must use a mod4 base aligned register (eg. v0, v4, |
| v8, ...). Normal instruction and stripmine variants may be mixed together. |
| |
| When stripmining is used in conjunction with instructions which use a register |
| index as a base to several registers, the offset of +4 (instead of +1) shall be |
| used. e.g., {vm0,vm1} becomes {{v0,v1,v2,v3},{v4,v5,v6,v7}}. |
| |
| A machine may elect to distribute a stripmined instruction across multiple ALUs. |
| |
| op[5] | m |
| :---: | :--: |
| 0 | "" |
| 1 | ".m" |
| |
| ### 2-arg .xx (Load / Store) |
| |
| Instruction | func2 | Notes |
| :---------: | :-------: | :--------: |
| vld | 00 xx0PSL | 1-arg |
| vld.l | 01 xx0PSL | |
| vld.s | 02 xx0PSL | |
| vld.p | 04 xx0PSL | 1 or 2-arg |
| vld.lp | 05 xx0PSL | |
| vld.sp | 06 xx0PSL | |
| vld.tp | 07 xx0PSL | |
| vst | 08 xx1PSL | 1-arg |
| vst.l | 09 xx1PSL | |
| vst.s | 10 xx1PSL | |
| vst.p | 12 xx1PSL | 1 or 2-arg |
| vst.lp | 13 xx1PSL | |
| vst.sp | 14 xx1PSL | |
| vst.tp | 15 xx1PSL | |
| vdup.x | 16 x10000 | |
| vcget | 20 x10100 | 0-arg |
| vstq.s | 26 x11PSL | |
| vstq.sp | 30 x11PSL | |
| |
| To saving encoding space, use the compile time knowledge that if vld.p.xx or |
| vst.p.xx post-incremented by a zero amount, do not encode x0, instead disable |
| the post-increment operation so as to reuse the encoding where xs2==x0 for |
| vld.p.x or vst.p.x which have different base update behavior. If the |
| post-increment were programmatic behavior then a register where xs2!=x0 would be |
| used. |
| |
| **NOTE**: Scalar register `xs1` uses the same encoding bitfield as the vector |
| register `vs1`, but **HAS ONE BIT PADDED AT LSB**. That is `xs1` has the same |
| encoding as the regular RISC-V instructions (bit[19:15]). On the other head, |
| `xs2` shares the same encoding bitfield `vs2`, but **HAS ONE BIT PADDED AT MSB**, |
| so it is consistent with the regular RISC-V instructions (bit[24:20]). |
| |
| ### 1-arg .x (Load / Store) |
| |
| Instructions of the format "op.xx vd, xs1, x0" (xs2=x0, the scalar zero |
| register) are reduced to the shortened form "op.x vd, xs1". |
| |
| **NOTE**: Scalar register `xs1` uses the same encoding bitfield as the vector |
| register `vs1`, but **HAS ONE BIT PADDED AT LSB**. That is `xs1` has the same |
| encoding as the regular RISC-V instructions (bit[19:15]). |
| |
| ### 0-arg |
| |
| Instructions of the format "op.xx vd, x0, x0" (xs1=x0, xs2=x0, the scalar zero |
| register) are reduced to the shortened form "op vd". |
| |
| ### 1-arg .v |
| |
| Single argument vector operations ".v" use xs2 scalar encoding "x0|zero". |
| |
| ### 2-arg .vv|.vx |
| |
| **Instruction** | func2 | **func1** / Notes |
| :-------------: | :-------: | :-----------------------: |
| **Arithmetic** | ... | **000** |
| vadd | 00 xxxxxx | |
| vsub | 01 xxxxxx | |
| vrsub | 02 xxxxxx | |
| veq | 06 xxxxxx | |
| vne | 07 xxxxxx | |
| vlt.{u} | 08 xxxxxU | |
| vle.{u} | 10 xxxxxU | |
| vgt.{u} | 12 xxxxxU | |
| vge.{u} | 14 xxxxxU | |
| vabsd.{u} | 16 xxxxxU | |
| vmax.{u} | 18 xxxxxU | |
| vmin.{u} | 20 xxxxxU | |
| vadd3 | 24 xxxxxx | |
| **Arithmetic2** | ... | **100** |
| vadds.{u} | 00 xxxxxU | |
| vsubs.{u} | 02 xxxxxU | |
| vaddw.{u} | 04 xxxxxU | |
| vsubw.{u} | 06 xxxxxU | |
| vacc.{u} | 10 xxxxxU | |
| vpadd.{u} | 12 xxxxxU | .v |
| vpsub.{u} | 14 xxxxxU | .v |
| vhadd.{ur} | 16 xxxxRU | |
| vhsub.{ur} | 20 xxxxRU | |
| **Logical** | ... | **001** |
| vand | 00 xxxxxx | |
| vor | 01 xxxxxx | |
| vxor | 02 xxxxxx | |
| vnot | 03 xxxxxx | .v |
| vrev | 04 xxxxxx | |
| vror | 05 xxxxxx | |
| vclb | 08 xxxxxx | .v |
| vclz | 09 xxxxxx | .v |
| vcpop | 10 xxxxxx | .v |
| vmv | 12 xxxxxx | .v |
| vmvp | 13 xxxxxx | |
| acset | 16 xxxxxx | |
| actr | 17 xxxxxx | .v |
| adwinit | 18 xxxxxx | |
| **Shift** | ... | **010** |
| vsll | 01 xxxxxx | |
| vsra | 02 xxxxx0 | |
| vsrl | 03 xxxxx1 | |
| vsha.{r} | 08 xxxxR0 | +/- shamt |
| vshl.{r} | 09 xxxxR1 | +/- shamt |
| vsrans{u}.{r} | 16 xxxxRU | narrowing saturating (x2) |
| vsraqs{u}.{r} | 24 xxxxRU | narrowing saturating (x4) |
| **Mul/Div** | **...** | **011** |
| vmul | 00 xxxxxx | |
| vmuls | 02 xxxxxU | |
| vmulw | 04 xxxxxU | |
| vmulh.{ur} | 08 xxxxRU | |
| vdmulh.{rn} | 16 xxxxRN | |
| vmacc | 20 xxxxxx | |
| vmadd | 21 xxxxxx | |
| **Float** | ... | **101** |
| --reserved-- | xx xxxxxx | |
| **Shuffle** | ... | **110** |
| vslidevn | 00 xxxxNN | |
| vslidehn | 04 xxxxNN | |
| vslidevp | 08 xxxxNN | |
| vslidehp | 12 xxxxNN | |
| vsel | 16 xxxxxx | |
| vevn | 24 xxxxxx | |
| vodd | 25 xxxxxx | |
| vevnodd | 26 xxxxxx | |
| vzip | 28 xxxxxx | |
| **Reserved7** | ... | **111** |
| --reserved-- | xx xxxxxx | |
| |
| ### 3-arg .vvv|.vxv |
| |
| Instruction | func3 | Notes |
| :---------: | :---: | :-----------------------: |
| aconv | 8 | scalar: sign |
| vdwconv | 10 | scalar: sign/type/swizzle |
| |
| ### Typeless |
| |
| Operations that do not have a {.b,.h,.w} type have the same behavior regardless |
| of the size field (bitwise: vand, vnot, vor, vxor; move: vmv, vmvp). The tooling |
| convention is to use size=0b00 ".b" encoding. |
| |
| ### Vertical Modes |
| |
| The ".tp" mode of vld or vst uses the four registers of ".m" in a vertical |
| structure, compared to other modes horizontal usage. The ".m" base update is a |
| single register width, vs 4x width for other modes. The usage model is four |
| "lines" being processed at the same time, vs a single line chained together in |
| other ".m" modes. |
| |
| ``` |
| Horizontal |
| ... AAAA BBBB CCCC DDDD ... |
| |
| vs. |
| |
| Vertical (".tp") |
| ... AAAA ... |
| ... BBBB ... |
| ... CCCC ... |
| ... DDDD ... |
| ``` |
| |
| ### Aliases |
| |
| vneg.v ← vrsub.xv vd, vs1, zero \ |
| vabs.v ← vabsd.vx vd, vs1, zero \ |
| vwiden.v ← vaddw.vx vd, vs1, zero |
| |
| ## System Instructions |
| |
| The execution model is designed towards OS-less and interrupt-less operation. A |
| machine will typically operate as run-to-completion of small restartable |
| workloads. A user/machine mode split is provided as a runtime convenience, |
| though there is no difference in access permissions between the modes. |
| |
| 31..28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19..15 | 14..12 | 11..7 | 6..2 | 1 | 0 | OP |
| :----: | :-: | :-: | :-: | :-: | :-: | :-: | :-: | :-: | :----: | :----: | :---: | :---: | :-: | :-: | :-: |
| 0000 | PI | PO | PR | PW | SI | SO | SR | SW | 00000 | 000 | 00000 | 00011 | 1 | 1 | FENCE |
| |
| <br> |
| |
| 31..28 | 27..24 | 23..20 | 19..15 | 14..12 | 11..7 | 6..2 | 1 | 0 | OP |
| :----: | :----: | :----: | :----: | :----: | :---: | :---: | :-: | :-: | :-----: |
| 0000 | 0000 | 0000 | 00000 | 001 | 00000 | 00011 | 1 | 1 | FENCE.I |
| |
| <br> |
| |
| 31..27 | 26..25 | 24..20 | 19..15 | 14..12 | 11..7 | 6..2 | 1 | 0 | OP |
| :----: | :----: | :----: | :----: | :----: | :---: | :---: | :-: | :-: | :-: |
| 00100 | 11 | 00000 | xs1 | 000 | 00000 | 11101 | 1 | 1 | FLUSH |
| 0001M | sz | xs2 | xs1 | 000 | xd | 11101 | 1 | 1 | GET{MAX}VL |
| 01111 | 00 | 00000 | xs1 | mode | 00000 | 11101 | 1 | 1 | \[F,S,K,C\]LOG |
| |
| <br> |
| |
| 31..20 | 19..15 | 14..12 | 11..7 | 6..2 | 1 | 0 | OP |
| :----------: | :----: | :----: | :---: | :---: | :-: | :-: | :----: |
| 000000000001 | 00000 | 000 | 00000 | 11100 | 1 | 1 | EBREAK |
| 001100000010 | 00000 | 000 | 00000 | 11100 | 1 | 1 | MRET |
| 000010000000 | 00000 | 000 | 00000 | 11100 | 1 | 1 | MPAUSE |
| 000001100000 | 00000 | 000 | 00000 | 11100 | 1 | 1 | ECTXSW |
| 000001000000 | 00000 | 000 | 00000 | 11100 | 1 | 1 | EYIELD |
| 000000100000 | 00000 | 000 | 00000 | 11100 | 1 | 1 | EEXIT |
| 000000000000 | 00000 | 000 | 00000 | 11100 | 1 | 1 | ECALL |
| |
| ### Exit Cause |
| |
| * `enum_IDLE = 0` |
| * `enum_EBREAK = 1` |
| * `enum_ECALL = 2` |
| * `enum_EEXIT = 3` |
| * `enum_EYIELD = 4` |
| * `enum_ECTXSW = 5` |
| * `enum_UNDEF_INST = (1u<<31) | 2` |
| * `enum_USAGE_FAULT = (1u<<31) | 16` |
| |
| ## Instruction Definitions |
| |
| -------------------------------------------------------------------------------- |
| |
| ### FLUSH |
| |
| Cache clean and invalidate operations at the private level |
| |
| **Encodings** |
| |
| flushat xs1 \ |
| flushall |
| |
| **Operation** |
| |
| ``` |
| Start = End = xs1 |
| Line = xs1 |
| ``` |
| The instruction is a standard way of describing cache maintenance operations. |
| |
| Type | Visibility | System1 | System2 |
| ------- | ---------- | ----------------- | --------------------- |
| Private | Core | Core L1 | Core L1 + Coherent L2 |
| |
| <br> |
| |
| -------------------------------------------------------------------------------- |
| |
| ### FENCE |
| |
| Enforce memory ordering of loads and stores for external visibility. |
| |
| **Encodings** |
| |
| fence \[i|o|r|w\], \[i|o|r|w\] \ |
| fence |
| |
| **Operation** |
| |
| ``` |
| PI predecessor I/O input |
| PO predecessor I/O output |
| PR predecessor memory read |
| PW predecessor memory write |
| <ordering between marked predecessors and successors> |
| SI successor I/O input |
| SO successor I/O output |
| SR successor memory read |
| SW successor memory write |
| ``` |
| |
| Note: a simplified implementation may have the frontend stall until all |
| preceding operations are completed before permitting any trailing instruction to |
| be dispatched. |
| |
| -------------------------------------------------------------------------------- |
| |
| ### FENCE.I |
| |
| Ensure subsequent instruction fetches observe prior data operations. |
| |
| **Encodings** |
| |
| fence.i |
| |
| **Operation** |
| |
| ``` |
| InvalidateInstructionCaches() |
| InvalidateInstructionPrefetchBuffers() |
| ``` |
| |
| -------------------------------------------------------------------------------- |
| |
| ### GETVL |
| |
| Calculate the vector length. |
| |
| **Encodings** |
| |
| getvl.[b,h,w].x xd, xs1 \ |
| getvl.[b,h,w].xx xd, xs1, xs2 \ |
| getvl.[b,h,w].x.m xd, xs1 \ |
| getvl.[b,h,w].xx.m xd, xs1, xs2 |
| |
| **Operation** |
| |
| ``` |
| xd = min(vl.type.size, unsigned(xs1), xs2 ? unsigned(xs2) : ignore) |
| ``` |
| |
| Find the minimum of the maximum vector length by type and the two input values. |
| If xs2 is zero (either x0 or register contents) then it is ignored (or |
| considered MaxInt), acting as a clamp less than maxvl. |
| |
| Type | Instruction | Description |
| ---- | ----------- | ---------------- |
| 00 | getvl.b | 8bit lane count |
| 01 | getvl.h | 16bit lane count |
| 10 | getvl.w | 32bit lane count |
| |
| -------------------------------------------------------------------------------- |
| |
| ### GETMAXVL |
| |
| Obtain the maximum vector length. |
| |
| **Encodings** |
| |
| getmaxvl.[b,h,w].{m} xd |
| |
| **Operation** |
| |
| ``` |
| xd = vl.type.size |
| ``` |
| |
| Type | Instruction | Description |
| ---- | ----------- | ---------------- |
| 00 | getmaxvl.b | 8bit lane count |
| 01 | getmaxvl.h | 16bit lane count |
| 10 | getmaxvl.w | 32bit lane count |
| |
| For a machine with 256bit SIMD registers: |
| |
| * getmaxvl.w = 8 lanes |
| * getmaxvl.h = 16 lanes |
| * getmaxvl.b = 32 lanes |
| * getmaxvl.w.m = 32 lanes   // multiply by 4 with strip mine. |
| * getmaxvl.h.m = 64 lanes |
| * getmaxvl.b.m = 128 lanes |
| |
| -------------------------------------------------------------------------------- |
| |
| ### ECALL |
| |
| Execution call to supervisor OS. |
| |
| **Encodings** |
| |
| ecall |
| |
| **Operation** |
| |
| ``` |
| if (mode == User) |
| mcause = enum_ECALL |
| mepc = pc |
| pc = mtvec |
| mode = Machine |
| else |
| mcause = enum_USAGE_FAULT |
| mfault = pc |
| EndExecution |
| ``` |
| |
| -------------------------------------------------------------------------------- |
| |
| ### EEXIT |
| |
| Execution exit to supervisor OS. |
| |
| **Encodings** |
| |
| eexit |
| |
| **Operation** |
| |
| ``` |
| if (mode == User) |
| mcause = enum_EEXIT |
| mepc = pc |
| pc = mtvec |
| mode = Machine |
| else |
| mcause = enum_USAGE_FAULT |
| mfault = pc |
| EndExecution |
| ``` |
| |
| -------------------------------------------------------------------------------- |
| |
| ### EYIELD |
| |
| Synchronous execution switch to supervisor OS. |
| |
| **Encodings** |
| |
| eyield |
| |
| **Operation** |
| |
| ``` |
| if (mode == User) |
| if (YIELD_REQUEST == 1) |
| mcause = enum_EYIELD |
| mepc = pc + 4 # advance to next instruction |
| pc = mtvec |
| mode = Machine |
| else |
| NOP # pc = pc + 4 |
| else |
| mcause = enum_USAGE_FAULT |
| mfault = pc |
| EndExecution |
| ``` |
| |
| YIELD_REQUEST refers to a signal the supervisor core sets to request a context |
| switch. |
| |
| Note: use when MIE=0 eyield is inserted at synchronization points for |
| cooperative context switching. |
| |
| -------------------------------------------------------------------------------- |
| |
| ### ECTXSW |
| |
| Asynchronous execution switch to supervisor OS. |
| |
| **Encodings** |
| |
| ectxsw |
| |
| **Operation** |
| |
| ``` |
| if (mode == User) |
| mcause = enum_ECTXSW |
| mepc = pc |
| pc = mtvec |
| mode = Machine |
| else |
| mcause = enum_USAGE_FAULT |
| mfault = pc |
| EndExecution |
| ``` |
| |
| -------------------------------------------------------------------------------- |
| |
| ### EBREAK |
| |
| Execution breakpoint to supervisor OS. |
| |
| **Encodings** |
| |
| ebreak |
| |
| **Operation** |
| |
| ``` |
| if (mode == User) |
| mcause = enum_EBREAK |
| mepc = pc |
| pc = mtvec |
| mode = Machine |
| else |
| mcause = enum_UNDEF_INST |
| mfault = pc |
| EndExecution |
| ``` |
| |
| -------------------------------------------------------------------------------- |
| |
| ### MRET |
| |
| Return from machine mode to user mode. |
| |
| **Encodings** |
| |
| mret |
| |
| **Operation** |
| |
| ``` |
| if (mode == Machine) |
| pc = mepc |
| mode = User |
| else |
| mcause = enum_UNDEF_INST |
| mepc = pc |
| pc = mtvec |
| mode = Machine |
| ``` |
| |
| -------------------------------------------------------------------------------- |
| |
| ### MPAUSE |
| |
| Machine pause and release for next execution context. |
| |
| **Encodings** |
| |
| mpause |
| |
| **Operation** |
| |
| ``` |
| if (mode == Machine) |
| EndExecution |
| else |
| mcause = enum_UNDEF_INST |
| mepc = pc |
| pc = mtvec |
| mode = Machine |
| ``` |
| |
| -------------------------------------------------------------------------------- |
| |
| ### VABSD |
| |
| Absolute difference with unsigned result. |
| |
| **Encodings** |
| |
| vabsd.[b,h,w].{u}.vv.{m} vd, vs1, vs2 \ |
| vabsd.[b,h,w].{u}.vx.{m} vd, vs1, xs2 |
| |
| **Operation** |
| |
| ``` |
| for L in Op.typelen |
| vd[L] = vs1[L] > vs2[L] ? vs1[L] - vs2[L] : vs2[L] - vs1[L] |
| ``` |
| |
| Note: for signed(INTx_MAX - INTx_MIN) the result will be UINTx_MAX. |
| |
| -------------------------------------------------------------------------------- |
| |
| ### VACC |
| |
| Accumulates a value into a wider register. |
| |
| **Encodings** |
| |
| vacc.[h,w].{u}.vv.{m} vd, vs1, vs2 \ |
| vacc.[h,w].{u}.vx.{m} vd, vs1, xs2 |
| |
| **Operation** |
| |
| ``` |
| for L in Op.typelen |
| {vd+0}[L] = {vs1+0} + vs2.asHalfType[2*L+0] |
| {vd+1}[L] = {vs1+1} + vs2.asHalfType[2*L+1] |
| ``` |
| |
| -------------------------------------------------------------------------------- |
| |
| ### VADD |
| |
| Add operands. |
| |
| **Encodings** |
| |
| vadd.[b,h,w].vv.{m} vd, vs1, vs2 \ |
| vadd.[b,h,w].vx.{m} vd, vs1, xs2 |
| |
| **Operation** |
| |
| ``` |
| for L in Op.typelen |
| vd[L] = vs1[L] + vs2[L] |
| ``` |
| |
| -------------------------------------------------------------------------------- |
| |
| ### VADDS |
| |
| Add operands with saturation. |
| |
| **Encodings** |
| |
| vadds.[b,h,w].{u}.vv.{m} vd, vs1, vs2 \ |
| vadds.[b,h,w].{u}.vx.{m} vd, vs1, xs2 |
| |
| **Operation** |
| |
| ``` |
| for L in Op.typelen |
| vd[L] = Saturate(vs1[L] + vs2[L]) |
| ``` |
| |
| -------------------------------------------------------------------------------- |
| |
| ### VADDW |
| |
| Add operands with widening. |
| |
| **Encodings** |
| |
| vaddw.[h,w].{u}.vv.{m} vd, vs1, vs2 \ |
| vaddw.[h,w].{u}.vx.{m} vd, vs1, xs2 |
| |
| **Operation** |
| |
| ``` |
| for L in Op.typelen |
| {vd+0}[L] = vs1.asHalfType[2*L+0] + vs2.asHalfType[2*L+0] |
| {vd+1}[L] = vs1.asHalfType[2*L+1] + vs2.asHalfType[2*L+1] |
| ``` |
| |
| -------------------------------------------------------------------------------- |
| |
| ### VADD3 |
| |
| Add three operands. |
| |
| **Encodings** |
| |
| vadd3.[w].vv.{m} vd, vs1, vs2 \ |
| vadd3.[w].vx.{m} vd, vs1, xs2 |
| |
| **Operation** |
| |
| ``` |
| for L in i32.typelen |
| vd[L] = vd[L] + vs1[L] + vs2[L] |
| ``` |
| |
| -------------------------------------------------------------------------------- |
| |
| ### VAND |
| |
| AND operands. |
| |
| **Encodings** |
| |
| vand.vv.{m} vd, vs1, vs2 \ |
| vand.[b,h,w].vx.{m} vd, vs1, xs2 |
| |
| **Operation** |
| |
| ``` |
| for L in Op.typelen |
| vd[L] = vs1[L] & vs2[L] |
| ``` |
| |
| -------------------------------------------------------------------------------- |
| |
| ### ACONV |
| |
| Convolution ALU operation. |
| |
| **Encodings** |
| |
| aconv.vxv vd, vs1, xs2, vs3 |
| |
| Encoding 'aconv' uses a '1' in the unused 5th bit (b25) of vs2. |
| |
| **Operation** |
| |
| ``` |
| # 8b: 0123456789abcdef |
| # 32b: 048c 26ae 159d 37bf |
| assert(vd == 48) |
| N = is_simd512 ? 16 : is_simd256 ? 8 : assert(0) |
| |
| func Interleave(Y,L): |
| m = L % 4 |
| if (m == 0) (Y & ~3) + 0 |
| if (m == 1) (Y & ~3) + 2 |
| if (m == 2) (Y & ~3) + 1 |
| if (m == 3) (Y & ~3) + 3 |
| |
| # i32 += i8 x i8 (u*u, u*s, s*u, s*s) |
| for Y in [0..N-1] |
| for X in [Start..Stop] |
| for L in i8.typelen |
| Data1 = {vs1+Y}.i8[4*X + L&3] # 'transpose and broadcast' |
| Data2 = {vs3+X-Start}.u8[L] |
| {Accum+Interleave(Y,L)}[L / 4] += |
| ((signed(SData1,Data1{7:0}) + signed(Bias1{8:0})){9:0} * |
| (signed(SData2,Data2{7:0}) + signed(Bias2{8:0})){9:0}){18:0} |
| ``` |
| |
| Length (stop - start + 1) is in 32bit accumulator lane count, as all inputs will |
| horizontally reduce to this size. |
| |
| The Start and Stop definition allows for a partial window of input values to be |
| transpose broadcast into the convolution unit. |
| |
| Mode | Mode | Usage |
| :----: | :--: | :-----------------------------------------------: |
| Common | | Mode[1:0] Start[6:2] Stop[11:7] |
| s8 | 0 | SData2[31] SBias2[30:22] SData1[21] SBias1[20:12] |
| |
| ``` |
| # SIMD256 |
| acc.out = {v48..55} |
| narrow0 = {v0..7} |
| narrow1 = {v16..23} |
| narrow2 = {v32..39} |
| narrow3 = {v48..55} |
| wide0 = {v8..15} |
| wide1 = {v24..31} |
| wide2 = {v40..47} |
| wide3 = {v56..63} |
| ``` |
| |
| ### VCGET |
| |
| Copy convolution accumulators into general registers. |
| |
| **Encodings** |
| |
| vcget vd |
| |
| **Operation** |
| |
| ``` |
| assert(vd == 48) |
| N = is_simd512 ? 15 : is_simd256 ? 7 : assert(0) |
| for Y in [0..N] |
| vd{Y} = Accum{Y} |
| Accum{Y} = 0 |
| |
| ``` |
| |
| ### ACSET |
| |
| Copy general registers into convolution accumulators. |
| |
| **Encodings** |
| |
| acset.v vd, vs1 |
| |
| **Operation** |
| |
| ``` |
| assert(vd == 48) |
| N = is_simd512 ? 15 : is_simd256 ? 7 : assert(0) |
| for Y in [0..N] |
| Accum{Y} = vd{Y} |
| ``` |
| |
| -------------------------------------------------------------------------------- |
| |
| ### ACTR |
| |
| Transpose a register group into the convolution accumulators. |
| |
| **Encodings** |
| |
| actr.[w].v.{m} vd, vs1 |
| |
| **Operation** |
| |
| ``` |
| assert(vd in {v48}) |
| assert(vs1 in {v0, v16, v32, v48} |
| for I in i32.typelen |
| for J in i32.typelen |
| ACCUM[J][I] = vs1[I][J] |
| ``` |
| |
| -------------------------------------------------------------------------------- |
| |
| ### VCLB |
| |
| Count the leading bits. |
| |
| **Encodings** |
| |
| vclb.[b,h,w].v.{m} vd, vs1 |
| |
| **Operation** |
| |
| ``` |
| MSB = 1 << (vtype.size - 1) |
| for L in Op.typelen |
| vd[L] = vs1[L] & MSB ? CLZ(~vs1[L]) : CLZ(vs1[L]) |
| ``` |
| |
| Note: (clb - 1) is equivalent to `__builtin_clrsb`. |
| |
| **clb examples** |
| |
| ``` |
| clb.w(0xffffffff) = 32 |
| clb.w(0xcfffffff) = 2 |
| clb.w(0x80001000) = 1 |
| clb.w(0x00007fff) = 17 |
| clb.w(0x00000000) = 32 |
| ``` |
| |
| -------------------------------------------------------------------------------- |
| |
| ### VCLZ |
| |
| Count the leading zeros. |
| |
| **Encodings** |
| |
| vclz.[b,h,w].v.{m} vd, vs1 |
| |
| **Operation** |
| |
| ``` |
| for L in Op.typelen |
| vd[L] = CLZ(vs1[L]) |
| ``` |
| |
| Note: clz.[b,h,w](0) returns [8,16,32]. |
| |
| -------------------------------------------------------------------------------- |
| |
| ### VDWCONV |
| |
| Depthwise convolution 3-way multiply accumulate. |
| |
| **Encodings** |
| |
| vdwconv.vxv vd, vs1, x2, vs3 \ |
| adwconv.vxv vd, vs1, x2, vs3 |
| |
| Encoding 'adwconv' uses a '1' in the unused 5th bit (b25) of vs2. |
| |
| **Operation** |
| |
| The vertical axis is typically tiled which requires preserving registers for |
| this functionality. The sparse formats require shuffles so that additional |
| registers of intermediate state are not required. |
| |
| ``` |
| # quant8 |
| {vs1+0,vs1+1,vs1+2} = Rebase({vs1}, Mode::RegBase) |
| {b0} = {vs3+0}.asByteType |
| {b1} = {vs3+1}.asByteType |
| {b2} = {vs3+2}.asByteType |
| if IsDenseFormat |
| a0 = {vs1+0}.asByteType |
| a1 = {vs1+1}.asByteType |
| a2 = {vs1+2}.asByteType |
| if IsSparseFormat1 # [n-1,n,n+1] |
| a0 = vslide_p({vs1+1}, {vs1+0}, 1).asByteType |
| a1 = {vs1+0}.asByteType |
| a2 = vslide_n({vs1+1}, {vs1+2}, 1).asByteType |
| if IsSparseFormat2 # [n,n+1,n+2] |
| a0 = {vs1+0}.asByteType |
| a1 = vslide_n({vs1+0}, {vs1+1}, 1).asByteType |
| a2 = vslide_n({vs1+0}, {vs1+1}, 2).asByteType |
| |
| # 8b: 0123456789abcdef |
| # 32b: 048c 26ae 159d 37bf |
| func Interleave(L): |
| i = L % 4 |
| if (i == 0) 0 |
| if (i == 1) 2 |
| if (i == 2) 1 |
| if (i == 3) 3 |
| |
| for L in Op.typelen |
| B = 4*L # 8b --> 32b |
| for i in [0..3] |
| # int19_t multiply results |
| # int23_t addition results |
| # int32_t storage |
| {dwacc+i}[L/4] += |
| (SData1(a0[B+i]) + bias1) * (SData2(b0[B+i]) + bias2) + |
| (SData1(a1[B+i]) + bias1) * (SData2(b1[B+i]) + bias2) + |
| (SData1(a2[B+i]) + bias1) * (SData2(b2[B+i]) + bias2) |
| if is_vdwconv // !adwconv |
| for i in [0..3] |
| {vd+i} = {dwacc+i} |
| ``` |
| |
| Mode | Encoding | Usage |
| :----: | :------: | :-----------------------------------------------: |
| Common | xs2 | Mode[1:0] Sparsity[3:2] RegBase[7:4] |
| q8 | 0 | SData2[31] SBias2[30:22] SData1[21] SBias1[20:12] |
| |
| The Mode::Sparity sets the swizzling patterns. |
| |
| Sparsity | Format | Swizzle |
| :------: | :-----: | :---------: |
| b00 | Dense | none |
| b01 | Sparse1 | [n-1,n,n+1] |
| b10 | Sparse2 | [n,n+1,n+2] |
| |
| The Mode::RegBase allows for the start point of the 3 register group to allow |
| for cycling of [prev,curr,next] values. |
| |
| RegBase | Prev | Curr | Next |
| :-----: | :-----: | :-----: | :-----: |
| b0000 | {vs1+0} | {vs1+1} | {vs1+2} |
| b0001 | {vs1+1} | {vs1+2} | {vs1+3} |
| b0010 | {vs1+2} | {vs1+3} | {vs1+4} |
| b0011 | {vs1+3} | {vs1+4} | {vs1+5} |
| b0100 | {vs1+4} | {vs1+5} | {vs1+6} |
| b0101 | {vs1+5} | {vs1+6} | {vs1+7} |
| b0110 | {vs1+6} | {vs1+7} | {vs1+8} |
| b0111 | {vs1+1} | {vs1+0} | {vs1+2} |
| b1000 | {vs1+1} | {vs1+2} | {vs1+0} |
| b1001 | {vs1+3} | {vs1+4} | {vs1+0} |
| b1010 | {vs1+5} | {vs1+6} | {vs1+0} |
| b1011 | {vs1+7} | {vs1+8} | {vs1+0} |
| b1100 | {vs1+2} | {vs1+0} | {vs1+1} |
| b1101 | {vs1+4} | {vs1+0} | {vs1+1} |
| b1110 | {vs1+6} | {vs1+0} | {vs1+1} |
| b1111 | {vs1+8} | {vs1+0} | {vs1+1} |
| |
| Regbase supports upto 3x3 5x5 7x7 9x9, or use the extra horizontal range for |
| input latency hiding. |
| |
| The vdwconv instruction includes a non-architectural state accumulator to |
| increase registerfile bandwidth. The dwinit instruction must be used to prepare |
| the depthwise accumulator for a sequence of dwconv instructions, and the |
| sequence must be dispatched without other instructions interleaved otherwise the |
| results will be unpredictable. Should other operations be required then a dwinit |
| must be inserted to resume the sequence. |
| |
| In a context switch save where the accumulator must be saved alongside the |
| architectural simd registers, v0..63 are saved to thread stack or tcb and then a |
| vdwconv with vdup prepared zero inputs can be used to write the values to simd |
| registers and then saved to memory. In a context switch restore the values can |
| be loaded from memory and set in the accumulator registers using the dwinit |
| instruction. |
| |
| ### ADWINIT |
| |
| Load the depthwise convolution accumulator state. |
| |
| **Encodings** |
| |
| adwinit.v vd, vs1 |
| |
| **Operation** |
| |
| ``` |
| for L in Op.typelen |
| {dwacc+0} = {vs1+0}[L] |
| {dwacc+1} = {vs1+1}[L] |
| {dwacc+2} = {vs1+2}[L] |
| {dwacc+3} = {vs1+3}[L] |
| ``` |
| |
| -------------------------------------------------------------------------------- |
| |
| ### VDMULH |
| |
| Saturating signed doubling multiply returning high half with optional rounding. |
| |
| **Encodings** |
| |
| vdmulh.[b,h,w].{r,rn}.vv.{m} vd, vs1, vs2 \ |
| vdmulh.[b,h,w].{r,rn}.vx.{m} vd, vs1, xs2 |
| |
| **Operation** |
| |
| ``` |
| SZ = vtype.size * 8 |
| for L in Op.typelen |
| LHS = SignExtend(vs1[L], 2*SZ) |
| RHS = SignExtend(vs2[L], 2*SZ) |
| MUL = LHS * RHS |
| RND = R ? (N && MUL < 0 ? -(1<<(SZ-1)) : (1<<(SZ-1))) : 0 |
| vd[L] = SignedSaturation(2 * MUL + RND)[2*SZ-1:SZ] |
| ``` |
| |
| Note: saturation is only needed for MaxNeg inputs (eg. 0x80000000). |
| |
| Note: vdmulh.w.r.vx.m is used in ML activations so may be optimized by |
| implementations. |
| |
| -------------------------------------------------------------------------------- |
| |
| ### VDUP |
| |
| Duplicate a scalar value into a vector register. |
| |
| **Encodings** |
| |
| vdup.[b,h,w].x.{m} vd, xs2 |
| |
| **Operation** |
| |
| ``` |
| for L in Op.typelen |
| vd[L] = [xs2] |
| ``` |
| |
| -------------------------------------------------------------------------------- |
| |
| ### VEQ |
| |
| Integer equal comparison. |
| |
| **Encodings** |
| |
| veq.[b,h,w].vv.{m} vd, vs1, vs2 \ |
| veq.[b,h,w].vx.{m} vd, vs1, xs2 |
| |
| **Operation** |
| |
| ``` |
| for L in Op.typelen |
| vd[L] = vs1[L] == vs2[L] ? 1 : 0 |
| ``` |
| |
| -------------------------------------------------------------------------------- |
| |
| ### VEVN, VODD, VEVNODD |
| |
| Even/odd of concatenated registers. |
| |
| **Encodings** |
| |
| vevn.[b,h,w].vv.{m} vd, vs1, vs2 \ |
| vevn.[b,h,w].vx.{m} vd, vs1, xs2 \ |
| vodd.[b,h,w].vv.{m} vd, vs1, vs2 \ |
| vodd.[b,h,w].vx.{m} vd, vs1, xs2 \ |
| vevnodd.[b,h,w].vv.{m} vd, vs1, vs2 \ |
| vevnodd.[b,h,w].vx.{m} vd, vs1, xs2 |
| |
| **Operation** |
| |
| ``` |
| M = Op.typelen / 2 |
| |
| if vevn || vevnodd |
| {dst0} = {vd+0} |
| {dst1} = {vd+1} |
| if vodd |
| {dst1} = {vd+0} |
| |
| if vevn || vevnodd |
| for L in Op.typelen |
| dst0[L] = L < M ? vs1[2 * L + 0] : vs2[2 * (L - M) + 0] # even |
| |
| if odd || vevnodd |
| for L in Op.typelen |
| dst1[L] = L < M ? vs1[2 * L + 1] : vs2[2 * (L - M) + 1] # odd |
| |
| where: |
| vs1 = 0x33221100 |
| vs2 = 0x77665544 |
| {vd+0} = 0x66442200 |
| {vd+1} = 0x77553311 |
| ``` |
| |
| -------------------------------------------------------------------------------- |
| |
| #### VGE |
| |
| Integer greater-than-or-equal comparison. |
| |
| **Encodings** |
| |
| vge.[b,h,w].{u}.vv.{m} vd, vs1, vs2 \ |
| vge.[b,h,w].{u}.vx.{m} vd, vs1, xs2 |
| |
| **Operation** |
| |
| ``` |
| for L in Op.typelen |
| vd[L] = vs1[L] >= vs2[L] ? 1 : 0 |
| ``` |
| |
| -------------------------------------------------------------------------------- |
| |
| #### VGT |
| |
| Integer greater-than comparison. |
| |
| **Encodings** |
| |
| vgt.[b,h,w].{u}.vv.{m} vd, vs1, vs2 \ |
| vgt.[b,h,w].{u}.vx.{m} vd, vs1, xs2 |
| |
| **Operation** |
| |
| ``` |
| for L in Op.typelen |
| vd[L] = vs1[L] > vs2[L] ? 1 : 0 |
| ``` |
| |
| -------------------------------------------------------------------------------- |
| |
| ### VHADD |
| |
| Halving addition with optional rounding bit. |
| |
| **Encodings** |
| |
| vhadd.[b,h,w].{r,u,ur}.vv.{m} vd, vs1, vs2 \ |
| vhadd.[b,h,w].{r,u,ur}.vx.{m} vd, vs1, xs2 |
| |
| **Operation** |
| |
| ``` |
| for L in Op.typelen |
| if IsSigned() |
| vd[L] = (signed(vs1[L]) + signed(vs2[L]) + R) >> 1 |
| else |
| vd[L] = (unsigned(vs1[L]) + unsigned(vs2[L]) + R) >> 1 |
| ``` |
| |
| -------------------------------------------------------------------------------- |
| |
| ### VHSUB |
| |
| Halving subtraction with optional rounding bit. |
| |
| **Encodings** |
| |
| vhsub.[b,h,w].{r,u,ur}.vv.{m} vd, vs1, vs2 \ |
| vhsub.[b,h,w].{r,u,ur}.vx.{m} vd, vs1, xs2 |
| |
| **Operation** |
| |
| ``` |
| for L in Op.typelen |
| if IsSigned() |
| vd[L] = (signed(vs1[L]) - signed(vs2[L]) + R) >> 1 |
| else |
| vd[L] = (unsigned(vs1[L]) - unsigned(vs2[L]) + R) >> 1 |
| ``` |
| |
| -------------------------------------------------------------------------------- |
| |
| ### VLD |
| |
| Vector load from memory with optional post-increment by scalar. |
| |
| **Encodings** |
| |
| vld.[b,h,w].{p}.x.{m} vd, xs1 \ |
| vld.[b,h,w].[l,p,s,lp,sp,tp].xx.{m} vd, xs1, xs2 |
| |
| **Operation** |
| |
| ``` |
| addr = xs1 |
| sm = Op.m ? 4 : 1 |
| len = min(Op.typelen * sm, unsigned(xs2)) |
| for M in Op.m |
| for L in Op.typelen |
| if !Op.bit.l || (L + M * Op.typelen) < len |
| vd[L] = mem[addr + L].type |
| else |
| vd[L] = 0 |
| if (Op.bit.s) |
| addr += xs2 * sizeof(type) |
| else |
| addr += Reg.bytes |
| if Op.bit.p |
| if Op.bit.l && Op.bit.s # .tp |
| xs1 += Reg.bytes |
| elif !Op.bit.l && !Op.bit.s && !{xs2} # .p.x |
| xs1 += Reg.bytes * sm |
| elif Op.bit.l # .lp |
| xs1 += len * sizeof(type) |
| elif Op.bit.s # .sp |
| xs1 += xs2 * sizeof(type) * sm |
| else # .p.xx |
| xs1 += xs2 * sizeof(type) |
| ``` |
| |
| -------------------------------------------------------------------------------- |
| |
| ### VLE |
| |
| Integer less-than-or-equal comparison. |
| |
| **Encodings** |
| |
| vle.[b,h,w].{u}.vv.{m} vd, vs1, vs2 \ |
| vle.[b,h,w].{u}.vx.{m} vd, vs1, xs2 |
| |
| **Operation** |
| |
| ``` |
| for L in Op.typelen |
| vd[L] = vs1[L] <= vs2[L] ? 1 : 0 |
| ``` |
| |
| -------------------------------------------------------------------------------- |
| |
| ### VLT |
| |
| Integer less-than comparison. |
| |
| **Encodings** |
| |
| vlt.[b,h,w].{u}.vv.{m} vd, vs1, vs2 \ |
| vlt.[b,h,w].{u}.vx.{m} vd, vs1, xs2 |
| |
| **Operation** |
| |
| ``` |
| for L in Op.typelen |
| vd[L] = vs1[L] < vs2[L] ? 1 : 0 |
| ``` |
| |
| -------------------------------------------------------------------------------- |
| |
| ### VMACC |
| |
| Multiply accumulate. |
| |
| **Encodings** |
| |
| vmacc.[b,h,w].vv.{m} vd, vs1, vs2 \ |
| vmacc.[b,h,w].vx.{m} vd, vs1, xs2 |
| |
| **Operation** |
| |
| ``` |
| for L in Op.typelen |
| vd[N] += vs1[L] * vs2[L] |
| ``` |
| |
| -------------------------------------------------------------------------------- |
| |
| ### VMADD |
| |
| Multiply add. |
| |
| **Encodings** |
| |
| vmadd.[b,h,w].vv.{m} vd, vs1, vs2 \ |
| vmadd.[b,h,w].vx.{m} vd, vs1, xs2 |
| |
| **Operation** |
| |
| ``` |
| for L in Op.typelen |
| vd[N] = vd[L] * vs2[L] + vs1[L] |
| ``` |
| |
| -------------------------------------------------------------------------------- |
| |
| ### VMAX |
| |
| Find the unsigned or signed maximum of two registers. |
| |
| **Encodings** |
| |
| vmax.[b,h,w].{u}.vv.{m} vd, vs1, vs2 \ |
| vmax.[b,h,w].{u}.vx.{m} vd, vs1, xs2 |
| |
| **Operation** |
| |
| ``` |
| for L in Op.typelen |
| vd[L] = vs1[L] > vs2[L] ? vs1[L] : vs2[L] |
| ``` |
| |
| -------------------------------------------------------------------------------- |
| |
| ### VMIN |
| |
| Find the minimum of two registers. |
| |
| **Encodings** |
| |
| vmin.[b,h,w].{u}.vv.{m} vd, vs1, vs2 \ |
| vmin.[b,h,w].{u}.vx.{m} vd, vs1, xs2 |
| |
| **Operation** |
| |
| ``` |
| for L in Op.typelen |
| vd[L] = vs1[L] < vs2[L] ? vs1[L] : vs2[L] |
| ``` |
| |
| -------------------------------------------------------------------------------- |
| |
| ### VMUL |
| |
| Multiply two registers. |
| |
| **Encodings** |
| |
| vmul.[b,h,w].vv.{m} vd, vs1, vs2 \ |
| vmul.[b,h,w].vx.{m} vd, vs1, xs2 |
| |
| **Operation** |
| |
| ``` |
| for L in Op.typelen |
| vd[L] = vs1[L] * vs2[L] |
| ``` |
| |
| -------------------------------------------------------------------------------- |
| |
| ### VMULS |
| |
| Multiply with saturation two registers. |
| |
| **Encodings** |
| |
| vmuls.[b,h,w].{u}.vv.{m} vd, vs1, vs2 \ |
| vmuls.[b,h,w].{u}.vx.{m} vd, vs1, xs2 |
| |
| **Operation** |
| |
| ``` |
| for L in Op.typelen |
| vd[L] = Saturation(vs1[L] * vs2[L]) |
| ``` |
| |
| -------------------------------------------------------------------------------- |
| |
| ### VMULW |
| |
| Multiply with widening two registers. |
| |
| **Encodings** |
| |
| vmulw.[h,w].{u}.vv.{m} vd, vs1, vs2 \ |
| vmulw.[h,w].{u}.vx.{m} vd, vs1, xs2 |
| |
| **Operation** |
| |
| ``` |
| for L in Op.typelen |
| {vd+0}[L] = vs1.asHalfType[2*L+0] * vs2.asHalfType[2*L+0] |
| {vd+1}[L] = vs1.asHalfType[2*L+1] * vs2.asHalfType[2*L+1] |
| ``` |
| |
| -------------------------------------------------------------------------------- |
| |
| ### VMULH |
| |
| Multiply with widening two registers returning the high half. |
| |
| **Encodings** |
| |
| vmulh.[b,h,w].{u}.{r}.vv.{m} vd, vs1, vs2 \ |
| vmulh.[b,h,w].{u}.{r}.vx.{m} vd, vs1, xs2 |
| |
| **Operation** |
| |
| ``` |
| SZ = vtype.size * 8 |
| RND = IsRounded ? 1<<(SZ-1) : 0 |
| for L in Op.typelen |
| if IsU() |
| vd[L] = (unsigned(vs1[L]) * unsigned(vs2[L] + RND))[2*SZ-1:SZ] |
| else if IsSU() |
| vd[L] = ( signed(vs1[L]) * unsigned(vs2[L] + RND))[2*SZ-1:SZ] |
| else |
| vd[L] = ( signed(vs1[L]) * signed(vs2[L] + RND))[2*SZ-1:SZ] |
| ``` |
| |
| -------------------------------------------------------------------------------- |
| |
| ### VMV |
| |
| Move a register. |
| |
| **Encodings** |
| |
| vmv.v.{m} vd, vs1 |
| |
| **Operation** |
| |
| ``` |
| for L in Op.typelen |
| vd[L] = vs1[L] |
| ``` |
| |
| Note: in the stripmined case an implemention may deliver more than one write per |
| cycle. |
| |
| -------------------------------------------------------------------------------- |
| |
| ### VMVP |
| |
| Move a pair of registers. |
| |
| **Encodings** |
| |
| vmvp.vv.{m} vd, vs1, vs2 \ |
| vmvp.[b,h,w].vx.{m} vd, vs1, xs2 |
| |
| **Operation** |
| |
| ``` |
| for L in Op.typelen |
| {vd+0}[L] = vs1[L] |
| {vd+1}[L] = vs2[L] |
| ``` |
| |
| -------------------------------------------------------------------------------- |
| |
| ### VNE |
| |
| Integer not-equal comparison. |
| |
| **Encodings** |
| |
| vne.[b,h,w].vv.{m} vd, vs1, vs2 \ |
| vne.[b,h,w].vx.{m} vd, vs1, xs2 |
| |
| **Operation** |
| |
| ``` |
| for L in Op.typelen |
| vd[L] = vs1[L] != vs2[L] ? 1 : 0 |
| ``` |
| |
| -------------------------------------------------------------------------------- |
| |
| ### VNOT |
| |
| Bitwise NOT a register. |
| |
| **Encodings** |
| |
| vnot.v.{m} vd, vs1 |
| |
| **Operation** |
| |
| ``` |
| for L in Op.typelen |
| vd[L] = ~vs1[L] |
| ``` |
| |
| -------------------------------------------------------------------------------- |
| |
| ### VOR |
| |
| OR two operands. |
| |
| **Encodings** |
| |
| vor.vv.{m} vd, vs1, vs2 \ |
| vor.[b,h,w].vx.{m} vd, vs1, xs2 |
| |
| **Operation** |
| |
| ``` |
| for L in Op.typelen |
| vd[L] = vs1[L] | vs2[L] |
| ``` |
| |
| -------------------------------------------------------------------------------- |
| |
| ### VPADD |
| |
| Adds the lane pairs. |
| |
| **Encodings** |
| |
| vpadd.[h,w].{u}.v.{m} vd, vs1 |
| |
| **Operation** |
| |
| ``` |
| if .v |
| for L in Op.typelen |
| vd[L] = (vs1.asHalfType[2 * L] + vs1.asHalfType[2 * L + 1]) |
| ``` |
| |
| -------------------------------------------------------------------------------- |
| |
| ### VPSUB |
| |
| Subtracts the lane pairs. |
| |
| **Encodings** |
| |
| vpsub.[h,w].{u}.v.{m} vd, vs1 |
| |
| **Operation** |
| |
| ``` |
| if .v |
| for L in Op.typelen |
| vd[L] = (vs1.asHalfType[2 * L] - vs1.asHalfType[2 * L + 1]) |
| ``` |
| |
| -------------------------------------------------------------------------------- |
| |
| ### VCPOP |
| |
| Count the set bits. |
| |
| **Encodings** |
| |
| vcpop.[b,h,w].v.{m} vd, vs1 |
| |
| **Operation** |
| |
| ``` |
| for L in Op.typelen |
| vd[L] = CountPopulation(vs1[L]) |
| ``` |
| |
| -------------------------------------------------------------------------------- |
| |
| ### VREV |
| |
| Generalized reverse using bit ladder. |
| |
| The size of the flip is based on the `log_2(data type)` |
| |
| **Encodings** |
| |
| vrev.[b,h,w].vv.{m} vd, vs1, vs2 \ |
| vrev.[b,h,w].vx.{m} vd, vs1, xs2 |
| |
| **Operation** |
| |
| ``` |
| N = vtype.bits - 1 # 7, 15, 31 |
| shamt = xs2[4:0] & N |
| for L in Op.typelen |
| r = vs1[L] |
| if (shamt & 1) r = ((r & 0x55..) << 1) | ((r & 0xAA..) >> 1) |
| if (shamt & 2) r = ((r & 0x33..) << 2) | ((r & 0xCC..) >> 2) |
| if (shamt & 4) r = ((r & 0x0F..) << 4) | ((r & 0xF0..) >> 4) |
| if (sz == 0) vd[L] = r; continue; |
| if (shamt & 8) r = ((r & 0x00..) << 8) | ((r & 0xFF..) >> 8) |
| if (sz == 1) vd[L] = r; continue; |
| if (shamt & 16) r = ((r & 0x00..) << 16) | ((r & 0xFF..) >> 16) |
| vd[L] = r |
| ``` |
| |
| -------------------------------------------------------------------------------- |
| |
| ### VROR |
| |
| Logical rotate right. |
| |
| **Encodings** |
| |
| vror.[b,h,w].vv.{m} vd, vs1, vs2 \ |
| vror.[b,h,w].vx.{m} vd, vs1, xs2 |
| |
| **Operation** |
| |
| ``` |
| N = vtype.bits - 1 # 7, 15, 31 |
| shamt = xs2[4:0] & N |
| for L in Op.typelen |
| r = vs1[L] |
| if (shamt & 1) for (B in vtype.bits) r[B] = r[(N+1) % N] |
| if (shamt & 2) for (B in vtype.bits) r[B] = r[(N+2) % N] |
| if (shamt & 4) for (B in vtype.bits) r[B] = r[(N+4) % N] |
| if (shamt & 8) for (B in vtype.bits) r[B] = r[(N+8) % N] |
| if (shamt & 16) for (B in vtype.bits) r[B] = r[(N+16) % N] |
| vd[L] = r |
| ``` |
| |
| -------------------------------------------------------------------------------- |
| |
| ### VSHA, VSHL |
| |
| Arithmetic and logical left/right shift with saturating shift amount and result. |
| |
| **Encodings** |
| |
| vsha.[b,h,w].{r}.vv.{m} vd, vs1, vs2 |
| |
| vshl.[b,h,w].{r}.vv.{m} vd, vs1, vs2 |
| |
| **Operation** |
| |
| ``` |
| M = Op.size # 8, 16, 32 |
| N = [8->3, 16->4, 32->5][Op.size] |
| SHSAT[L] = vs2[L][M-1:N] != 0 |
| SHAMT[L] = vs2[L][N-1:0] |
| RND = R && SHAMT ? 1 << (SHAMT-1) : 0 |
| RND -= N && (vs1[L] < 0) ? 1 : 0 |
| SZ = sizeof(src.type) * 8 * (W ? 2 : 1) |
| RESULT_NEG = (vs1[L] <<[<] SHAMT[L])[SZ-1:0] // !A "<<<" logical shift |
| RESULT_NEG = S ? Saturate(RESULT_POS, SHSAT[L]) : RESULT_NEG |
| RESULT_POS = ((vs1[L] + RND) >>[>] SHAMT[L]) // !A ">>>" logical shift |
| RESULT_POS = S ? Saturate(RESULT_NEG, SHSAT[L]) : RESULT_POS |
| xd[L] = SHAMT[L] >= 0 ? RESULT_POS : RESULT_NEG |
| ``` |
| |
| -------------------------------------------------------------------------------- |
| |
| ### VSEL |
| |
| Select lanes from two operands with vector selection boolean. |
| |
| **Encodings** |
| |
| vsel.[b,h,w].vv.{m} vd, vs1, vs2 \ |
| vsel.[b,h,w].vx.{m} vd, vs1, xs2 |
| |
| **Operation** |
| |
| ``` |
| for L in Op.typelen |
| vd[L] = vs1[L].bit(0) ? vd[L] : vs2[L] |
| ``` |
| |
| -------------------------------------------------------------------------------- |
| |
| ### VSLL |
| |
| Logical left shift. |
| |
| **Encodings** |
| |
| vsll.[b,h,w].vv.{m} vd, vs1, vs2 \ |
| vsll.[b,h,w].vx.{m} vd, vs1, xs2 |
| |
| **Operation** |
| |
| ``` |
| N = [8->3, 16->4, 32->5][Op.size] |
| xd[L] = vs1[L] <<< vs2[L][N-1:0] |
| ``` |
| |
| -------------------------------------------------------------------------------- |
| |
| ### VSLIDEN |
| |
| Slide next register by index. |
| |
| For the horizontal mode, it treats the stripmine `vm` register based on |
| `vs1` as a contiguous block, and only the first `index` elements from `vs2` |
| will be used. |
| For the vertical mode, each stripmine vector register `op_index` is mapped |
| separatedly. it mimics the imaging tiling process shift of |
| |
| ``` |
| |--------|--------| |
| | 4xVLEN | 4xVLEN | |
| | (vs1) | (vs2) | |
| |--------|--------| |
| ``` |
| |
| The vertical mode can also support the non-stripmine version to handle |
| the last columns of the image. |
| |
| **Encodings** |
| |
| Horizontal slide: |
| |
| vslidehn.[b,h,w].[1,2,3,4].vv.m vd, vs1, vs2 \ |
| vslidehn.[b,h,w].[1,2,3,4].vx.m vd, vs1, xs2 |
| |
| Vertical slide: |
| |
| vsliden.[b,h,w].[1,2,3,4].vv vd, vs1, vs2 \ |
| vslidevn.[b,h,w].[1,2,3,4].vv.m vd, vs1, vs2 \ |
| vslidevn.[b,h,w].[1,2,3,4].vx.m vd, vs1, xs2 |
| |
| **Operation** |
| |
| ``` |
| assert vd != vs1 && vd != vs2 |
| if Op.h // A contiguous horizontal slide based on vs1 |
| va = {{vs1},{vs1+1},{vs1+2},{vs1+3}} |
| vb = {{vs1+1},{vs1+2},{vs1+3},{vs2}} |
| if Op.v // vs1/vs2 vertical slide |
| va = {{vs1},{vs1+1},{vs1+2},{vs1+3}} |
| vb = {{vs2},{vs2+1},{vs2+2},{vs2+3}} |
| |
| sm = Op.m ? 4 : 1 |
| |
| for M in sm |
| for L in Op.typelen |
| if (L + index < Op.typelen) |
| vd[L] = va[M][L + index] |
| else |
| vd[L] = is_vx ? xs2 : vb[M][L + index - Op.typelen] |
| ``` |
| |
| -------------------------------------------------------------------------------- |
| |
| ### VSLIDEP |
| |
| Slide previous register by index. |
| |
| For the horizontal mode, it treats the stripmine `vm` register based on |
| **`vs2`** as a contiguous block, and only the _LAST_ `index` elements from |
| stripmine vm register based on `vs1` will be used AT THE BEGINNING. |
| For the vertical mode, each stripmine vector register `op_index` is mapped |
| separatedly. it mimics the imaging tiling process shift of |
| |
| ``` |
| |--------|--------| |
| | 4xVLEN | 4xVLEN | |
| | (vs1) | (vs2) | |
| |--------|--------| |
| ``` |
| |
| The vertical mode can also support the non-stripmine version to handle |
| the last columns of the image. |
| |
| **Encodings** |
| |
| Horizontal slide: |
| |
| vslidehp.[b,h,w].[1,2,3,4].vv.m vd, vs1, vs2 \ |
| vslidehp.[b,h,w].[1,2,3,4].vx.m vd, vs1, xs2 |
| |
| Vertical slide: |
| |
| vslidep.[b,h,w].[1,2,3,4].vv vd, vs1, vs2 \ |
| vslidevp.[b,h,w].[1,2,3,4].vv.m vd, vs1, vs2 \ |
| vslidevp.[b,h,w].[1,2,3,4].vv.m vd, vs1, xs2 |
| |
| **Operation** |
| |
| ``` |
| assert vd != vs1 && vd != vs2 |
| |
| if Op.h // A continuous horizontal slide based on vs2 |
| va = {{vs1+3},{vs2},{vs2+1},{vs2+2}} |
| vb = {{vs2},{vs2+1},{vs2+2},{vs2+3}} |
| if Op.v // vs1/vs2 vertical slide |
| va = {{vs1},{vs1+1},{vs1+2},{vs1+3}} |
| vb = {{vs2},{vs2+1},{vs2+2},{vs2+3}} |
| |
| sm = Op.m ? 4 : 1 |
| |
| for M in sm |
| for L in Op.typelen |
| if (L < index) |
| vd[L] = va[M][Op.typelen + L - index] |
| else |
| vd[L] = is_vx ? xs2 : vb[M][L - index] |
| ``` |
| |
| -------------------------------------------------------------------------------- |
| |
| ### VSRA, VSRL |
| |
| Arithmetic and logical right shift. |
| |
| **Encodings** |
| |
| vsra.[b,h,w].vv.{m} vd, vs1, vs2 \ |
| vsra.[b,h,w].vx.{m} vd, vs1, xs2 |
| |
| vsrl.[b,h,w].vv.{m} vd, vs1, vs2 \ |
| vsrl.[b,h,w].vx.{m} vd, vs1, xs2 |
| |
| **Operation** |
| |
| ``` |
| N = Op.size[8->3, 16->4, 32->5] |
| xd[L] = vs1[L] >>[>] vs2[L][N-1:0] |
| ``` |
| |
| -------------------------------------------------------------------------------- |
| |
| ### VSRANS, VSRANSU |
| |
| Arithmetic right shift with rounding and signed/unsigned saturation. |
| |
| **Encodings** |
| |
| vsrans{u}.[b,h].{r}.vv.{m} vd, vs1, vs2 \ |
| vsrans{u}.[b,h].{r}.vx.{m} vd, vs1, xs2 |
| |
| **Operation** |
| |
| ``` |
| for L in Op.typelen |
| N = [8->3, 16->4, 32->5][Op.size] |
| SHAMT[L] = vs2[L][2*N-1:0] # source size index |
| RND = R && SHAMT ? 1 << (SHAMT-1) : 0 |
| RND -= N && (vs1[L] < 0) ? 1 : 0 |
| vd[L+0] = Saturate({vs1+0}[L/2] + RND, u) >>[>] SHAMT |
| vd[L+1] = Saturate({vs1+1}[L/2] + RND, u) >>[>] SHAMT |
| ``` |
| |
| Note: vsrans.[b,h].vx.m are used in ML activations so may be optimized by |
| implementations. |
| |
| -------------------------------------------------------------------------------- |
| |
| ### VSRAQS |
| |
| Arithmetic quarter narrowing right shift with rounding and signed/unsigned |
| saturation. |
| |
| **Encodings** |
| |
| vsraqs{u}.b.{r}.vv.{m} vd, vs1, vs2 \ |
| vsraqs{u}.b.{r}.vx.{m} vd, vs1, xs2 |
| |
| **Operation** |
| |
| ``` |
| for L in i32.typelen |
| SHAMT[L] = vs2[L][4:0] |
| RND = R && SHAMT ? 1 << (SHAMT-1) : 0 |
| RND -= N && (vs1[L] < 0) ? 1 : 0 |
| vd[L+0] = Saturate({vs1+0}[L/4] + RND, u) >>[>] SHAMT |
| vd[L+1] = Saturate({vs1+2}[L/4] + RND, u) >>[>] SHAMT |
| vd[L+2] = Saturate({vs1+1}[L/4] + RND, u) >>[>] SHAMT |
| vd[L+3] = Saturate({vs1+3}[L/4] + RND, u) >>[>] SHAMT |
| ``` |
| |
| Note: The register interleaving is [0,2,1,3] and not [0,1,2,3] as this matches |
| vconv/vdwconv requirements, and one vsraqs is the same as two chained vsrans. |
| |
| -------------------------------------------------------------------------------- |
| |
| ### VRSUB |
| |
| Reverse subtract two operands. |
| |
| **Encodings** |
| |
| vrsub.[b,h,w].vx.{m} vd, vs1, xs2 |
| |
| **Operation** |
| |
| ``` |
| for L in Op.typelen |
| vd[L] = xs2[L] - vs1[L] |
| ``` |
| |
| -------------------------------------------------------------------------------- |
| |
| ### VSUB |
| |
| Subtract two operands. |
| |
| **Encodings** |
| |
| vsub.[b,h,w].vv.{m} vd, vs1, vs2 \ |
| vsub.[b,h,w].vx.{m} vd, vs1, xs2 |
| |
| **Operation** |
| |
| ``` |
| for L in Op.typelen |
| vd[L] = vs1[L] - vs2[L] |
| ``` |
| |
| -------------------------------------------------------------------------------- |
| |
| ### VSUBS |
| |
| Subtract two operands with saturation. |
| |
| **Encodings** |
| |
| vsubs.[b,h,w].{u}.vv.{m} vd, vs1, vs2 \ |
| vsubs.[b,h,w].{u}.vx.{m} vd, vs1, xs2 |
| |
| **Operation** |
| |
| ``` |
| for L in Op.typelen |
| vd[L] = Saturate(vs1[L] - vs2[L]) |
| ``` |
| |
| -------------------------------------------------------------------------------- |
| |
| ### VSUBW |
| |
| Subtract two operands with widening. |
| |
| **Encodings** |
| |
| vsubw.[h,w].{u}.vv.{m} vd, vs1, vs2 \ |
| vsubw.[h,w].{u}.vx.{m} vd, vs1, xs2 |
| |
| **Operation** |
| |
| ``` |
| for L in Op.typelen |
| {vd+0}[L] = vs1.asHalfType[2*L+0] - vs2.asHalfType[2*L+0] |
| {vd+1}[L] = vs1.asHalfType[2*L+1] - vs2.asHalfType[2*L+1] |
| ``` |
| |
| -------------------------------------------------------------------------------- |
| |
| ### VST |
| |
| Vector store to memory with optional post-increment by scalar. |
| |
| **Encodings** |
| |
| vst.[b,h,w].{p}.x.{m} vd, xs1 \ |
| vst.[b,h,w].[l,p,s,lp,sp,tp].xx.{m} vd, xs1, xs2 |
| |
| **Operation** |
| |
| ``` |
| addr = xs1 |
| sm = Op.m ? 4 : 1 |
| len = min(Op.typelen * sm, unsigned(xs2)) |
| for M in Op.m |
| for L in Op.typelen |
| if !Op.bit.l || (L + M * Op.typelen) < len |
| mem[addr + L].type = vd[L] |
| if (Op.bit.s) |
| addr += xs2 * sizeof(type) |
| else |
| addr += Reg.bytes |
| if Op.bit.p |
| if Op.bit.l && Op.bit.s # .tp |
| xs1 += Reg.bytes |
| elif !Op.bit.l && !Op.bit.s && !{xs2} # .p.x |
| xs1 += Reg.bytes * sm |
| elif Op.bit.l # .lp |
| xs1 += len * sizeof(type) |
| elif Op.bit.s # .sp |
| xs1 += xs2 * sizeof(type) * sm |
| else # .p.xx |
| xs1 += xs2 * sizeof(type) |
| ``` |
| |
| -------------------------------------------------------------------------------- |
| |
| ### VSTQ |
| |
| Vector store quads to memory with optional post-increment by scalar. |
| |
| **Encodings** |
| |
| vstq.[b,h,w].[s,sp].xx.{m} vd, xs1, xs2 |
| |
| **Operation** |
| |
| ``` |
| addr = xs1 |
| sm = Op.m ? 4 : 1 |
| for M in Op.m |
| for Q in 0 to 3 |
| for L in Op.typelen / 4 |
| mem[addr + L].type = vd[L + Q * Op.typelen / 4] |
| addr += xs2 * sizeof(type) |
| if Op.bit.p |
| xs1 += xs2 * sizeof(type) * sm |
| ``` |
| |
| Note: This is principally for storing the results of vconv after 32b to 8b |
| reduction. |
| |
| -------------------------------------------------------------------------------- |
| |
| ### VXOR |
| |
| XOR two operands. |
| |
| **Encodings** |
| |
| vxor.vv.{m} vd, vs1, vs2 \ |
| vxor.[b,h,w].vx.{m} vd, vs1, xs2 |
| |
| **Operation** |
| |
| ``` |
| for L in Op.typelen |
| vd[L] = vs1[L] ^ vs2[L] |
| ``` |
| |
| -------------------------------------------------------------------------------- |
| |
| ### VZIP |
| |
| Interleave even/odd lanes of two operands. |
| |
| **Encodings** |
| |
| vzip.[b,h,w].vv.{m} vd, vs1, vs2 \ |
| vzip.[b,h,w].vx.{m} vd, vs1, xs2 |
| |
| **Operation** |
| |
| ``` |
| index = Is(a=>0, b=>1) |
| for L in Op.typelen |
| M = L / 2 |
| N = L / 2 + Op.typelen / 2 |
| {vd+0}[L] = L & 1 ? vs2[M] : vs1[M] |
| {vd+1}[L] = L & 1 ? vs2[N] : vs1[N] |
| |
| where: |
| vs1 = 0x66442200 |
| vs2 = 0x77553311 |
| {vd+0} = 0x33221100 |
| {vd+1} = 0x77665544 |
| ``` |
| |
| Note: vd must not be in the range of vs1 or vs2. |
| |
| -------------------------------------------------------------------------------- |
| |
| ### FLOG, SLOG, CLOG, KLOG |
| |
| Log a register in a printf contract. |
| |
| **Encodings** |
| |
| flog rs1   // mode=0, “printf” formatted command, rs1=(context) \ |
| slog rs1   // mode=1, scalar log \ |
| clog rs1   // mode=2, character log \ |
| klog rs1   // mode=3, const string log |
| |
| **Operation** |
| |
| A number of arguments are sent with SLOG or CLOG, and then a FLOG operation |
| closes the packet and may emit a timestamp and context data like ASID. A |
| receiving tool can construct messages, e.g. XML records per printf stream, by |
| collecting the arguments as they arrive in a variable length buffer, and closing |
| the record when the FLOG instruction arrives. |
| |
| A transport layer may choose to encode in the flog format footer the preceding |
| count of arguments or bytes sent. This is so that detection of payload errors or |
| hot connections are possible. |
| |
| The SLOG instruction will send a payload packet represented by the starting |
| memory location. |
| |
| The CLOG instruction will send a multiple 32-bit packet message of a character |
| stream. The packet message will close when a zero character is detected. A |
| single character may be sent in a 32bit packet. |
| |
| **Pseudo code** |
| |
| ``` |
| const uint8_t p[] = "text message"; |
| printf(“Test %s\n”, p); |
| KLOG p |
| FLOG &fmt |
| ``` |
| |
| ``` |
| printf(“Test”); |
| FLOG &fmt |
| ``` |
| |
| ``` |
| print(“Test %d\n”, result_int); |
| SLOG result_int |
| FLOG &fmt |
| ``` |
| |
| ``` |
| printf(“Test %d %f %s %s %s\n”, 123, "abc", "1234", “789AB”); |
| SLOG 123 |
| CLOG ‘abc\0’ |
| CLOG ‘1234’ CLOG ‘\0’ |
| CLOG ‘789A’ CLOG ‘B\0’ |
| FLOG &fmt |
| ``` |