blob: 121ba6522b74aee947e5da0a9a8e377ebd054cd0 [file] [log] [blame] [view]
# Kelvin Instruction Reference
An ML+SIMD+Scalar instruction set for ML accelerator cores.
[TOC]
## SIMD register configuration
Kelvin has 64 vector registers, `v0` to `v63`, with the vector length of 256-bit
for each of the registers. The register can store data in the format of 8b, 16b,
and 32b, as encoded in the instructions (See the next section for detail).
Kelvin also supports the stripmine behaviors, which utilizes 16 vector registers
with each one 4x the size of the typical register (Also see the details in the
next section).
## SIMD Instructions
The SIMD instructions utilize a register file with 64 entries which serves both
standard arithmetic and logical operations and the domain compute. SIMD lane
size, scalar broadcast, arithmetic operation sign, and stripmine behaviors are
encoded explictly in the opcodes.
The SIMD instructions replace the encoding space of the compressed instruction
set extension (those with 2-bit prefixes 00, 01, and 10). See [The RISC-V
Instruction Set Manual v2.2 "Available 30-bit instruction encoding
spaces"](https://riscv.org/wp-content/uploads/2017/05/riscv-spec-v2.2.pdf) for
quadrupling the available encoding space within the 32-bit format.
### Instruction Encodings
31..26 | 25..20 | 19..14 | 13..12 | 11..6 | 5 | 4..2 | 1..0 | form
:----: | :----: | :----: | :----: | :---: | :-: | :---: | :--: | :--:
func2 | vs2 | vs1 | sz | vd | m | func1 | 00 | .vv
func2 | [0]xs2 | vs1 | sz | vd | m | func1 | 10 | .vx
func2 | 000000 | vs1 | sz | vd | m | func1 | 10 | .v
func2 | [0]xs2 | xs1[0] | sz | vd | m | 111 | 11 | .xx
func2 | 000000 | xs1[0] | sz | vd | m | 111 | 11 | .x
<br>
31..26 | 25..20 | 19..14 | 13..12 | 11..6 | 5 | 4..3 | 2..0 | form
:----: | :----: | :----: | :--------: | :---: | :-: | :--------: | :--: | :--:
vs3 | vs2 | vs1 | func3[3:2] | vd | m | func3[1:0] | 001 | .vvv
vs3 | [0]xs2 | vs1 | func3[3:2] | vd | m | func3[1:0] | 101 | .vxv
### Types ".b" ".h" ".w"
The SIMD lane size is encoded in the opcode definition indicating the
destination type. For many opcodes source and destination sizes are the same,
differing for widening and narrowing operations.
op[13:12] | sz | type
:-------: | :--: | :--:
00 | ".b" | 8b
01 | ".h" | 16b
10 | ".w" | 32b
### Scalar ".vx"
Instructions may use a scalar register to perform a value broadcast (8b, 16b,
32b) to all SIMD lanes of one operand.
op[2:0] | form
:-----: | :------------:
x00 | ".vv"
x10 | ".vx"
x10 | ".v" (xs2==x0)
x11 | ".xx"
x11 | ".x" (xs2==x0)
001 | ".vvv"
101 | ".vxv"
### Signed/Unsigned ".u"
Instructions which may be marked with ".u" have signed and unsigned variants.
See comparisons, arithmetic operations and saturation for usage, the side
effects being typical behaviors unless otherwise noted.
### Stripmine ".m"
The stripmine functionality is an instruction compression mechanism. Frontend
dispatch captures a single instruction, while the backend issue expands to four
operations. Conceptually the register file is reduced from 64 locations to 16,
where a stripmine register must use a mod4 base aligned register (eg. v0, v4,
v8, ...). Normal instruction and stripmine variants may be mixed together.
When stripmining is used in conjunction with instructions which use a register
index as a base to several registers, the offset of +4 (instead of +1) shall be
used. e.g., {vm0,vm1} becomes {{v0,v1,v2,v3},{v4,v5,v6,v7}}.
A machine may elect to distribute a stripmined instruction across multiple ALUs.
op[5] | m
:---: | :--:
0 | ""
1 | ".m"
### 2-arg .xx (Load / Store)
Instruction | func2 | Notes
:---------: | :-------: | :--------:
vld | 00 xx0PSL | 1-arg
vld.l | 01 xx0PSL |
vld.s | 02 xx0PSL |
vld.p | 04 xx0PSL | 1 or 2-arg
vld.lp | 05 xx0PSL |
vld.sp | 06 xx0PSL |
vld.tp | 07 xx0PSL |
vst | 08 xx1PSL | 1-arg
vst.l | 09 xx1PSL |
vst.s | 10 xx1PSL |
vst.p | 12 xx1PSL | 1 or 2-arg
vst.lp | 13 xx1PSL |
vst.sp | 14 xx1PSL |
vst.tp | 15 xx1PSL |
vdup.x | 16 x10000 |
vcget | 20 x10100 | 0-arg
vstq.s | 26 x11PSL |
vstq.sp | 30 x11PSL |
To saving encoding space, use the compile time knowledge that if vld.p.xx or
vst.p.xx post-incremented by a zero amount, do not encode x0, instead disable
the post-increment operation so as to reuse the encoding where xs2==x0 for
vld.p.x or vst.p.x which have different base update behavior. If the
post-increment were programmatic behavior then a register where xs2!=x0 would be
used.
**NOTE**: Scalar register `xs1` uses the same encoding bitfield as the vector
register `vs1`, but **HAS ONE BIT PADDED AT LSB**. That is `xs1` has the same
encoding as the regular RISC-V instructions (bit[19:15]). On the other head,
`xs2` shares the same encoding bitfield `vs2`, but **HAS ONE BIT PADDED AT MSB**,
so it is consistent with the regular RISC-V instructions (bit[24:20]).
### 1-arg .x (Load / Store)
Instructions of the format "op.xx vd, xs1, x0" (xs2=x0, the scalar zero
register) are reduced to the shortened form "op.x vd, xs1".
**NOTE**: Scalar register `xs1` uses the same encoding bitfield as the vector
register `vs1`, but **HAS ONE BIT PADDED AT LSB**. That is `xs1` has the same
encoding as the regular RISC-V instructions (bit[19:15]).
### 0-arg
Instructions of the format "op.xx vd, x0, x0" (xs1=x0, xs2=x0, the scalar zero
register) are reduced to the shortened form "op vd".
### 1-arg .v
Single argument vector operations ".v" use xs2 scalar encoding "x0|zero".
### 2-arg .vv|.vx
**Instruction** | func2 | **func1** / Notes
:-------------: | :-------: | :-----------------------:
**Arithmetic** | ... | **000**
vadd | 00 xxxxxx |
vsub | 01 xxxxxx |
vrsub | 02 xxxxxx |
veq | 06 xxxxxx |
vne | 07 xxxxxx |
vlt.{u} | 08 xxxxxU |
vle.{u} | 10 xxxxxU |
vgt.{u} | 12 xxxxxU |
vge.{u} | 14 xxxxxU |
vabsd.{u} | 16 xxxxxU |
vmax.{u} | 18 xxxxxU |
vmin.{u} | 20 xxxxxU |
vadd3 | 24 xxxxxx |
**Arithmetic2** | ... | **100**
vadds.{u} | 00 xxxxxU |
vsubs.{u} | 02 xxxxxU |
vaddw.{u} | 04 xxxxxU |
vsubw.{u} | 06 xxxxxU |
vacc.{u} | 10 xxxxxU |
vpadd.{u} | 12 xxxxxU | .v
vpsub.{u} | 14 xxxxxU | .v
vhadd.{ur} | 16 xxxxRU |
vhsub.{ur} | 20 xxxxRU |
**Logical** | ... | **001**
vand | 00 xxxxxx |
vor | 01 xxxxxx |
vxor | 02 xxxxxx |
vnot | 03 xxxxxx | .v
vrev | 04 xxxxxx |
vror | 05 xxxxxx |
vclb | 08 xxxxxx | .v
vclz | 09 xxxxxx | .v
vcpop | 10 xxxxxx | .v
vmv | 12 xxxxxx | .v
vmvp | 13 xxxxxx |
acset | 16 xxxxxx |
actr | 17 xxxxxx | .v
adwinit | 18 xxxxxx |
**Shift** | ... | **010**
vsll | 01 xxxxxx |
vsra | 02 xxxxx0 |
vsrl | 03 xxxxx1 |
vsha.{r} | 08 xxxxR0 | +/- shamt
vshl.{r} | 09 xxxxR1 | +/- shamt
vsrans{u}.{r} | 16 xxxxRU | narrowing saturating (x2)
vsraqs{u}.{r} | 24 xxxxRU | narrowing saturating (x4)
**Mul/Div** | **...** | **011**
vmul | 00 xxxxxx |
vmuls | 02 xxxxxU |
vmulw | 04 xxxxxU |
vmulh.{ur} | 08 xxxxRU |
vdmulh.{rn} | 16 xxxxRN |
vmacc | 20 xxxxxx |
vmadd | 21 xxxxxx |
**Float** | ... | **101**
--reserved-- | xx xxxxxx |
**Shuffle** | ... | **110**
vslidevn | 00 xxxxNN |
vslidehn | 04 xxxxNN |
vslidevp | 08 xxxxNN |
vslidehp | 12 xxxxNN |
vsel | 16 xxxxxx |
vevn | 24 xxxxxx |
vodd | 25 xxxxxx |
vevnodd | 26 xxxxxx |
vzip | 28 xxxxxx |
**Reserved7** | ... | **111**
--reserved-- | xx xxxxxx |
### 3-arg .vvv|.vxv
Instruction | func3 | Notes
:---------: | :---: | :-----------------------:
aconv | 8 | scalar: sign
vdwconv | 10 | scalar: sign/type/swizzle
### Typeless
Operations that do not have a {.b,.h,.w} type have the same behavior regardless
of the size field (bitwise: vand, vnot, vor, vxor; move: vmv, vmvp). The tooling
convention is to use size=0b00 ".b" encoding.
### Vertical Modes
The ".tp" mode of vld or vst uses the four registers of ".m" in a vertical
structure, compared to other modes horizontal usage. The ".m" base update is a
single register width, vs 4x width for other modes. The usage model is four
"lines" being processed at the same time, vs a single line chained together in
other ".m" modes.
```
Horizontal
... AAAA BBBB CCCC DDDD ...
vs.
Vertical (".tp")
... AAAA ...
... BBBB ...
... CCCC ...
... DDDD ...
```
### Aliases
vneg.v vrsub.xv vd, vs1, zero \
vabs.v vabsd.vx vd, vs1, zero \
vwiden.v vaddw.vx vd, vs1, zero
## System Instructions
The execution model is designed towards OS-less and interrupt-less operation. A
machine will typically operate as run-to-completion of small restartable
workloads. A user/machine mode split is provided as a runtime convenience,
though there is no difference in access permissions between the modes.
31..28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19..15 | 14..12 | 11..7 | 6..2 | 1 | 0 | OP
:----: | :-: | :-: | :-: | :-: | :-: | :-: | :-: | :-: | :----: | :----: | :---: | :---: | :-: | :-: | :-:
0000 | PI | PO | PR | PW | SI | SO | SR | SW | 00000 | 000 | 00000 | 00011 | 1 | 1 | FENCE
<br>
31..28 | 27..24 | 23..20 | 19..15 | 14..12 | 11..7 | 6..2 | 1 | 0 | OP
:----: | :----: | :----: | :----: | :----: | :---: | :---: | :-: | :-: | :-----:
0000 | 0000 | 0000 | 00000 | 001 | 00000 | 00011 | 1 | 1 | FENCE.I
<br>
31..27 | 26..25 | 24..20 | 19..15 | 14..12 | 11..7 | 6..2 | 1 | 0 | OP
:----: | :----: | :----: | :----: | :----: | :---: | :---: | :-: | :-: | :-:
00100 | 11 | 00000 | xs1 | 000 | 00000 | 11101 | 1 | 1 | FLUSH
0001M | sz | xs2 | xs1 | 000 | xd | 11101 | 1 | 1 | GET{MAX}VL
01111 | 00 | 00000 | xs1 | mode | 00000 | 11101 | 1 | 1 | \[F,S,K,C\]LOG
<br>
31..20 | 19..15 | 14..12 | 11..7 | 6..2 | 1 | 0 | OP
:----------: | :----: | :----: | :---: | :---: | :-: | :-: | :----:
000000000001 | 00000 | 000 | 00000 | 11100 | 1 | 1 | EBREAK
001100000010 | 00000 | 000 | 00000 | 11100 | 1 | 1 | MRET
000010000000 | 00000 | 000 | 00000 | 11100 | 1 | 1 | MPAUSE
000001100000 | 00000 | 000 | 00000 | 11100 | 1 | 1 | ECTXSW
000001000000 | 00000 | 000 | 00000 | 11100 | 1 | 1 | EYIELD
000000100000 | 00000 | 000 | 00000 | 11100 | 1 | 1 | EEXIT
000000000000 | 00000 | 000 | 00000 | 11100 | 1 | 1 | ECALL
### Exit Cause
* `enum_IDLE = 0`
* `enum_EBREAK = 1`
* `enum_ECALL = 2`
* `enum_EEXIT = 3`
* `enum_EYIELD = 4`
* `enum_ECTXSW = 5`
* `enum_UNDEF_INST = (1u<<31) | 2`
* `enum_USAGE_FAULT = (1u<<31) | 16`
## Instruction Definitions
--------------------------------------------------------------------------------
### FLUSH
Cache clean and invalidate operations at the private level
**Encodings**
flushat xs1 \
flushall
**Operation**
```
Start = End = xs1
Line = xs1
```
The instruction is a standard way of describing cache maintenance operations.
Type | Visibility | System1 | System2
------- | ---------- | ----------------- | ---------------------
Private | Core | Core L1 | Core L1 + Coherent L2
<br>
--------------------------------------------------------------------------------
### FENCE
Enforce memory ordering of loads and stores for external visibility.
**Encodings**
fence \[i|o|r|w\], \[i|o|r|w\] \
fence
**Operation**
```
PI predecessor I/O input
PO predecessor I/O output
PR predecessor memory read
PW predecessor memory write
<ordering between marked predecessors and successors>
SI successor I/O input
SO successor I/O output
SR successor memory read
SW successor memory write
```
Note: a simplified implementation may have the frontend stall until all
preceding operations are completed before permitting any trailing instruction to
be dispatched.
--------------------------------------------------------------------------------
### FENCE.I
Ensure subsequent instruction fetches observe prior data operations.
**Encodings**
fence.i
**Operation**
```
InvalidateInstructionCaches()
InvalidateInstructionPrefetchBuffers()
```
--------------------------------------------------------------------------------
### GETVL
Calculate the vector length.
**Encodings**
getvl.[b,h,w].x xd, xs1 \
getvl.[b,h,w].xx xd, xs1, xs2 \
getvl.[b,h,w].x.m xd, xs1 \
getvl.[b,h,w].xx.m xd, xs1, xs2
**Operation**
```
xd = min(vl.type.size, unsigned(xs1), xs2 ? unsigned(xs2) : ignore)
```
Find the minimum of the maximum vector length by type and the two input values.
If xs2 is zero (either x0 or register contents) then it is ignored (or
considered MaxInt), acting as a clamp less than maxvl.
Type | Instruction | Description
---- | ----------- | ----------------
00 | getvl.b | 8bit lane count
01 | getvl.h | 16bit lane count
10 | getvl.w | 32bit lane count
--------------------------------------------------------------------------------
### GETMAXVL
Obtain the maximum vector length.
**Encodings**
getmaxvl.[b,h,w].{m} xd
**Operation**
```
xd = vl.type.size
```
Type | Instruction | Description
---- | ----------- | ----------------
00 | getmaxvl.b | 8bit lane count
01 | getmaxvl.h | 16bit lane count
10 | getmaxvl.w | 32bit lane count
For a machine with 256bit SIMD registers:
* getmaxvl.w = 8 lanes
* getmaxvl.h = 16 lanes
* getmaxvl.b = 32 lanes
* getmaxvl.w.m = 32 lanes &ensp; // multiply by 4 with strip mine.
* getmaxvl.h.m = 64 lanes
* getmaxvl.b.m = 128 lanes
--------------------------------------------------------------------------------
### ECALL
Execution call to supervisor OS.
**Encodings**
ecall
**Operation**
```
if (mode == User)
mcause = enum_ECALL
mepc = pc
pc = mtvec
mode = Machine
else
mcause = enum_USAGE_FAULT
mfault = pc
EndExecution
```
--------------------------------------------------------------------------------
### EEXIT
Execution exit to supervisor OS.
**Encodings**
eexit
**Operation**
```
if (mode == User)
mcause = enum_EEXIT
mepc = pc
pc = mtvec
mode = Machine
else
mcause = enum_USAGE_FAULT
mfault = pc
EndExecution
```
--------------------------------------------------------------------------------
### EYIELD
Synchronous execution switch to supervisor OS.
**Encodings**
eyield
**Operation**
```
if (mode == User)
if (YIELD_REQUEST == 1)
mcause = enum_EYIELD
mepc = pc + 4 # advance to next instruction
pc = mtvec
mode = Machine
else
NOP # pc = pc + 4
else
mcause = enum_USAGE_FAULT
mfault = pc
EndExecution
```
YIELD_REQUEST refers to a signal the supervisor core sets to request a context
switch.
Note: use when MIE=0 eyield is inserted at synchronization points for
cooperative context switching.
--------------------------------------------------------------------------------
### ECTXSW
Asynchronous execution switch to supervisor OS.
**Encodings**
ectxsw
**Operation**
```
if (mode == User)
mcause = enum_ECTXSW
mepc = pc
pc = mtvec
mode = Machine
else
mcause = enum_USAGE_FAULT
mfault = pc
EndExecution
```
--------------------------------------------------------------------------------
### EBREAK
Execution breakpoint to supervisor OS.
**Encodings**
ebreak
**Operation**
```
if (mode == User)
mcause = enum_EBREAK
mepc = pc
pc = mtvec
mode = Machine
else
mcause = enum_UNDEF_INST
mfault = pc
EndExecution
```
--------------------------------------------------------------------------------
### MRET
Return from machine mode to user mode.
**Encodings**
mret
**Operation**
```
if (mode == Machine)
pc = mepc
mode = User
else
mcause = enum_UNDEF_INST
mepc = pc
pc = mtvec
mode = Machine
```
--------------------------------------------------------------------------------
### MPAUSE
Machine pause and release for next execution context.
**Encodings**
mpause
**Operation**
```
if (mode == Machine)
EndExecution
else
mcause = enum_UNDEF_INST
mepc = pc
pc = mtvec
mode = Machine
```
--------------------------------------------------------------------------------
### VABSD
Absolute difference with unsigned result.
**Encodings**
vabsd.[b,h,w].{u}.vv.{m} vd, vs1, vs2 \
vabsd.[b,h,w].{u}.vx.{m} vd, vs1, xs2
**Operation**
```
for L in Op.typelen
vd[L] = vs1[L] > vs2[L] ? vs1[L] - vs2[L] : vs2[L] - vs1[L]
```
Note: for signed(INTx_MAX - INTx_MIN) the result will be UINTx_MAX.
--------------------------------------------------------------------------------
### VACC
Accumulates a value into a wider register.
**Encodings**
vacc.[h,w].{u}.vv.{m} vd, vs1, vs2 \
vacc.[h,w].{u}.vx.{m} vd, vs1, xs2
**Operation**
```
for L in Op.typelen
{vd+0}[L] = {vs1+0} + vs2.asHalfType[2*L+0]
{vd+1}[L] = {vs1+1} + vs2.asHalfType[2*L+1]
```
--------------------------------------------------------------------------------
### VADD
Add operands.
**Encodings**
vadd.[b,h,w].vv.{m} vd, vs1, vs2 \
vadd.[b,h,w].vx.{m} vd, vs1, xs2
**Operation**
```
for L in Op.typelen
vd[L] = vs1[L] + vs2[L]
```
--------------------------------------------------------------------------------
### VADDS
Add operands with saturation.
**Encodings**
vadds.[b,h,w].{u}.vv.{m} vd, vs1, vs2 \
vadds.[b,h,w].{u}.vx.{m} vd, vs1, xs2
**Operation**
```
for L in Op.typelen
vd[L] = Saturate(vs1[L] + vs2[L])
```
--------------------------------------------------------------------------------
### VADDW
Add operands with widening.
**Encodings**
vaddw.[h,w].{u}.vv.{m} vd, vs1, vs2 \
vaddw.[h,w].{u}.vx.{m} vd, vs1, xs2
**Operation**
```
for L in Op.typelen
{vd+0}[L] = vs1.asHalfType[2*L+0] + vs2.asHalfType[2*L+0]
{vd+1}[L] = vs1.asHalfType[2*L+1] + vs2.asHalfType[2*L+1]
```
--------------------------------------------------------------------------------
### VADD3
Add three operands.
**Encodings**
vadd3.[w].vv.{m} vd, vs1, vs2 \
vadd3.[w].vx.{m} vd, vs1, xs2
**Operation**
```
for L in i32.typelen
vd[L] = vd[L] + vs1[L] + vs2[L]
```
--------------------------------------------------------------------------------
### VAND
AND operands.
**Encodings**
vand.vv.{m} vd, vs1, vs2 \
vand.[b,h,w].vx.{m} vd, vs1, xs2
**Operation**
```
for L in Op.typelen
vd[L] = vs1[L] & vs2[L]
```
--------------------------------------------------------------------------------
### ACONV
Convolution ALU operation.
**Encodings**
aconv.vxv vd, vs1, xs2, vs3
Encoding 'aconv' uses a '1' in the unused 5th bit (b25) of vs2.
**Operation**
```
# 8b: 0123456789abcdef
# 32b: 048c 26ae 159d 37bf
assert(vd == 48)
N = is_simd512 ? 16 : is_simd256 ? 8 : assert(0)
func Interleave(Y,L):
m = L % 4
if (m == 0) (Y & ~3) + 0
if (m == 1) (Y & ~3) + 2
if (m == 2) (Y & ~3) + 1
if (m == 3) (Y & ~3) + 3
# i32 += i8 x i8 (u*u, u*s, s*u, s*s)
for Y in [0..N-1]
for X in [Start..Stop]
for L in i8.typelen
Data1 = {vs1+Y}.i8[4*X + L&3] # 'transpose and broadcast'
Data2 = {vs3+X-Start}.u8[L]
{Accum+Interleave(Y,L)}[L / 4] +=
((signed(SData1,Data1{7:0}) + signed(Bias1{8:0})){9:0} *
(signed(SData2,Data2{7:0}) + signed(Bias2{8:0})){9:0}){18:0}
```
Length (stop - start + 1) is in 32bit accumulator lane count, as all inputs will
horizontally reduce to this size.
The Start and Stop definition allows for a partial window of input values to be
transpose broadcast into the convolution unit.
Mode | Mode | Usage
:----: | :--: | :-----------------------------------------------:
Common | | Mode[1:0] Start[6:2] Stop[11:7]
s8 | 0 | SData2[31] SBias2[30:22] SData1[21] SBias1[20:12]
```
# SIMD256
acc.out = {v48..55}
narrow0 = {v0..7}
narrow1 = {v16..23}
narrow2 = {v32..39}
narrow3 = {v48..55}
wide0 = {v8..15}
wide1 = {v24..31}
wide2 = {v40..47}
wide3 = {v56..63}
```
### VCGET
Copy convolution accumulators into general registers.
**Encodings**
vcget vd
**Operation**
```
assert(vd == 48)
N = is_simd512 ? 15 : is_simd256 ? 7 : assert(0)
for Y in [0..N]
vd{Y} = Accum{Y}
Accum{Y} = 0
```
### ACSET
Copy general registers into convolution accumulators.
**Encodings**
acset.v vd, vs1
**Operation**
```
assert(vd == 48)
N = is_simd512 ? 15 : is_simd256 ? 7 : assert(0)
for Y in [0..N]
Accum{Y} = vd{Y}
```
--------------------------------------------------------------------------------
### ACTR
Transpose a register group into the convolution accumulators.
**Encodings**
actr.[w].v.{m} vd, vs1
**Operation**
```
assert(vd in {v48})
assert(vs1 in {v0, v16, v32, v48}
for I in i32.typelen
for J in i32.typelen
ACCUM[J][I] = vs1[I][J]
```
--------------------------------------------------------------------------------
### VCLB
Count the leading bits.
**Encodings**
vclb.[b,h,w].v.{m} vd, vs1
**Operation**
```
MSB = 1 << (vtype.size - 1)
for L in Op.typelen
vd[L] = vs1[L] & MSB ? CLZ(~vs1[L]) : CLZ(vs1[L])
```
Note: (clb - 1) is equivalent to `__builtin_clrsb`.
**clb examples**
```
clb.w(0xffffffff) = 32
clb.w(0xcfffffff) = 2
clb.w(0x80001000) = 1
clb.w(0x00007fff) = 17
clb.w(0x00000000) = 32
```
--------------------------------------------------------------------------------
### VCLZ
Count the leading zeros.
**Encodings**
vclz.[b,h,w].v.{m} vd, vs1
**Operation**
```
for L in Op.typelen
vd[L] = CLZ(vs1[L])
```
Note: clz.[b,h,w](0) returns [8,16,32].
--------------------------------------------------------------------------------
### VDWCONV
Depthwise convolution 3-way multiply accumulate.
**Encodings**
vdwconv.vxv vd, vs1, x2, vs3 \
adwconv.vxv vd, vs1, x2, vs3
Encoding 'adwconv' uses a '1' in the unused 5th bit (b25) of vs2.
**Operation**
The vertical axis is typically tiled which requires preserving registers for
this functionality. The sparse formats require shuffles so that additional
registers of intermediate state are not required.
```
# quant8
{vs1+0,vs1+1,vs1+2} = Rebase({vs1}, Mode::RegBase)
{b0} = {vs3+0}.asByteType
{b1} = {vs3+1}.asByteType
{b2} = {vs3+2}.asByteType
if IsDenseFormat
a0 = {vs1+0}.asByteType
a1 = {vs1+1}.asByteType
a2 = {vs1+2}.asByteType
if IsSparseFormat1 # [n-1,n,n+1]
a0 = vslide_p({vs1+1}, {vs1+0}, 1).asByteType
a1 = {vs1+0}.asByteType
a2 = vslide_n({vs1+1}, {vs1+2}, 1).asByteType
if IsSparseFormat2 # [n,n+1,n+2]
a0 = {vs1+0}.asByteType
a1 = vslide_n({vs1+0}, {vs1+1}, 1).asByteType
a2 = vslide_n({vs1+0}, {vs1+1}, 2).asByteType
# 8b: 0123456789abcdef
# 32b: 048c 26ae 159d 37bf
func Interleave(L):
i = L % 4
if (i == 0) 0
if (i == 1) 2
if (i == 2) 1
if (i == 3) 3
for L in Op.typelen
B = 4*L # 8b --> 32b
for i in [0..3]
# int19_t multiply results
# int23_t addition results
# int32_t storage
{dwacc+i}[L/4] +=
(SData1(a0[B+i]) + bias1) * (SData2(b0[B+i]) + bias2) +
(SData1(a1[B+i]) + bias1) * (SData2(b1[B+i]) + bias2) +
(SData1(a2[B+i]) + bias1) * (SData2(b2[B+i]) + bias2)
if is_vdwconv // !adwconv
for i in [0..3]
{vd+i} = {dwacc+i}
```
Mode | Encoding | Usage
:----: | :------: | :-----------------------------------------------:
Common | xs2 | Mode[1:0] Sparsity[3:2] RegBase[7:4]
q8 | 0 | SData2[31] SBias2[30:22] SData1[21] SBias1[20:12]
The Mode::Sparity sets the swizzling patterns.
Sparsity | Format | Swizzle
:------: | :-----: | :---------:
b00 | Dense | none
b01 | Sparse1 | [n-1,n,n+1]
b10 | Sparse2 | [n,n+1,n+2]
The Mode::RegBase allows for the start point of the 3 register group to allow
for cycling of [prev,curr,next] values.
RegBase | Prev | Curr | Next
:-----: | :-----: | :-----: | :-----:
b0000 | {vs1+0} | {vs1+1} | {vs1+2}
b0001 | {vs1+1} | {vs1+2} | {vs1+3}
b0010 | {vs1+2} | {vs1+3} | {vs1+4}
b0011 | {vs1+3} | {vs1+4} | {vs1+5}
b0100 | {vs1+4} | {vs1+5} | {vs1+6}
b0101 | {vs1+5} | {vs1+6} | {vs1+7}
b0110 | {vs1+6} | {vs1+7} | {vs1+8}
b0111 | {vs1+1} | {vs1+0} | {vs1+2}
b1000 | {vs1+1} | {vs1+2} | {vs1+0}
b1001 | {vs1+3} | {vs1+4} | {vs1+0}
b1010 | {vs1+5} | {vs1+6} | {vs1+0}
b1011 | {vs1+7} | {vs1+8} | {vs1+0}
b1100 | {vs1+2} | {vs1+0} | {vs1+1}
b1101 | {vs1+4} | {vs1+0} | {vs1+1}
b1110 | {vs1+6} | {vs1+0} | {vs1+1}
b1111 | {vs1+8} | {vs1+0} | {vs1+1}
Regbase supports upto 3x3 5x5 7x7 9x9, or use the extra horizontal range for
input latency hiding.
The vdwconv instruction includes a non-architectural state accumulator to
increase registerfile bandwidth. The dwinit instruction must be used to prepare
the depthwise accumulator for a sequence of dwconv instructions, and the
sequence must be dispatched without other instructions interleaved otherwise the
results will be unpredictable. Should other operations be required then a dwinit
must be inserted to resume the sequence.
In a context switch save where the accumulator must be saved alongside the
architectural simd registers, v0..63 are saved to thread stack or tcb and then a
vdwconv with vdup prepared zero inputs can be used to write the values to simd
registers and then saved to memory. In a context switch restore the values can
be loaded from memory and set in the accumulator registers using the dwinit
instruction.
### ADWINIT
Load the depthwise convolution accumulator state.
**Encodings**
adwinit.v vd, vs1
**Operation**
```
for L in Op.typelen
{dwacc+0} = {vs1+0}[L]
{dwacc+1} = {vs1+1}[L]
{dwacc+2} = {vs1+2}[L]
{dwacc+3} = {vs1+3}[L]
```
--------------------------------------------------------------------------------
### VDMULH
Saturating signed doubling multiply returning high half with optional rounding.
**Encodings**
vdmulh.[b,h,w].{r,rn}.vv.{m} vd, vs1, vs2 \
vdmulh.[b,h,w].{r,rn}.vx.{m} vd, vs1, xs2
**Operation**
```
SZ = vtype.size * 8
for L in Op.typelen
LHS = SignExtend(vs1[L], 2*SZ)
RHS = SignExtend(vs2[L], 2*SZ)
MUL = LHS * RHS
RND = R ? (N && MUL < 0 ? -(1<<(SZ-1)) : (1<<(SZ-1))) : 0
vd[L] = SignedSaturation(2 * MUL + RND)[2*SZ-1:SZ]
```
Note: saturation is only needed for MaxNeg inputs (eg. 0x80000000).
Note: vdmulh.w.r.vx.m is used in ML activations so may be optimized by
implementations.
--------------------------------------------------------------------------------
### VDUP
Duplicate a scalar value into a vector register.
**Encodings**
vdup.[b,h,w].x.{m} vd, xs2
**Operation**
```
for L in Op.typelen
vd[L] = [xs2]
```
--------------------------------------------------------------------------------
### VEQ
Integer equal comparison.
**Encodings**
veq.[b,h,w].vv.{m} vd, vs1, vs2 \
veq.[b,h,w].vx.{m} vd, vs1, xs2
**Operation**
```
for L in Op.typelen
vd[L] = vs1[L] == vs2[L] ? 1 : 0
```
--------------------------------------------------------------------------------
### VEVN, VODD, VEVNODD
Even/odd of concatenated registers.
**Encodings**
vevn.[b,h,w].vv.{m} vd, vs1, vs2 \
vevn.[b,h,w].vx.{m} vd, vs1, xs2 \
vodd.[b,h,w].vv.{m} vd, vs1, vs2 \
vodd.[b,h,w].vx.{m} vd, vs1, xs2 \
vevnodd.[b,h,w].vv.{m} vd, vs1, vs2 \
vevnodd.[b,h,w].vx.{m} vd, vs1, xs2
**Operation**
```
M = Op.typelen / 2
if vevn || vevnodd
{dst0} = {vd+0}
{dst1} = {vd+1}
if vodd
{dst1} = {vd+0}
if vevn || vevnodd
for L in Op.typelen
dst0[L] = L < M ? vs1[2 * L + 0] : vs2[2 * (L - M) + 0] # even
if odd || vevnodd
for L in Op.typelen
dst1[L] = L < M ? vs1[2 * L + 1] : vs2[2 * (L - M) + 1] # odd
where:
vs1 = 0x33221100
vs2 = 0x77665544
{vd+0} = 0x66442200
{vd+1} = 0x77553311
```
--------------------------------------------------------------------------------
#### VGE
Integer greater-than-or-equal comparison.
**Encodings**
vge.[b,h,w].{u}.vv.{m} vd, vs1, vs2 \
vge.[b,h,w].{u}.vx.{m} vd, vs1, xs2
**Operation**
```
for L in Op.typelen
vd[L] = vs1[L] >= vs2[L] ? 1 : 0
```
--------------------------------------------------------------------------------
#### VGT
Integer greater-than comparison.
**Encodings**
vgt.[b,h,w].{u}.vv.{m} vd, vs1, vs2 \
vgt.[b,h,w].{u}.vx.{m} vd, vs1, xs2
**Operation**
```
for L in Op.typelen
vd[L] = vs1[L] > vs2[L] ? 1 : 0
```
--------------------------------------------------------------------------------
### VHADD
Halving addition with optional rounding bit.
**Encodings**
vhadd.[b,h,w].{r,u,ur}.vv.{m} vd, vs1, vs2 \
vhadd.[b,h,w].{r,u,ur}.vx.{m} vd, vs1, xs2
**Operation**
```
for L in Op.typelen
if IsSigned()
vd[L] = (signed(vs1[L]) + signed(vs2[L]) + R) >> 1
else
vd[L] = (unsigned(vs1[L]) + unsigned(vs2[L]) + R) >> 1
```
--------------------------------------------------------------------------------
### VHSUB
Halving subtraction with optional rounding bit.
**Encodings**
vhsub.[b,h,w].{r,u,ur}.vv.{m} vd, vs1, vs2 \
vhsub.[b,h,w].{r,u,ur}.vx.{m} vd, vs1, xs2
**Operation**
```
for L in Op.typelen
if IsSigned()
vd[L] = (signed(vs1[L]) - signed(vs2[L]) + R) >> 1
else
vd[L] = (unsigned(vs1[L]) - unsigned(vs2[L]) + R) >> 1
```
--------------------------------------------------------------------------------
### VLD
Vector load from memory with optional post-increment by scalar.
**Encodings**
vld.[b,h,w].{p}.x.{m} vd, xs1 \
vld.[b,h,w].[l,p,s,lp,sp,tp].xx.{m} vd, xs1, xs2
**Operation**
```
addr = xs1
sm = Op.m ? 4 : 1
len = min(Op.typelen * sm, unsigned(xs2))
for M in Op.m
for L in Op.typelen
if !Op.bit.l || (L + M * Op.typelen) < len
vd[L] = mem[addr + L].type
else
vd[L] = 0
if (Op.bit.s)
addr += xs2 * sizeof(type)
else
addr += Reg.bytes
if Op.bit.p
if Op.bit.l && Op.bit.s # .tp
xs1 += Reg.bytes
elif !Op.bit.l && !Op.bit.s && !{xs2} # .p.x
xs1 += Reg.bytes * sm
elif Op.bit.l # .lp
xs1 += len * sizeof(type)
elif Op.bit.s # .sp
xs1 += xs2 * sizeof(type) * sm
else # .p.xx
xs1 += xs2 * sizeof(type)
```
--------------------------------------------------------------------------------
### VLE
Integer less-than-or-equal comparison.
**Encodings**
vle.[b,h,w].{u}.vv.{m} vd, vs1, vs2 \
vle.[b,h,w].{u}.vx.{m} vd, vs1, xs2
**Operation**
```
for L in Op.typelen
vd[L] = vs1[L] <= vs2[L] ? 1 : 0
```
--------------------------------------------------------------------------------
### VLT
Integer less-than comparison.
**Encodings**
vlt.[b,h,w].{u}.vv.{m} vd, vs1, vs2 \
vlt.[b,h,w].{u}.vx.{m} vd, vs1, xs2
**Operation**
```
for L in Op.typelen
vd[L] = vs1[L] < vs2[L] ? 1 : 0
```
--------------------------------------------------------------------------------
### VMACC
Multiply accumulate.
**Encodings**
vmacc.[b,h,w].vv.{m} vd, vs1, vs2 \
vmacc.[b,h,w].vx.{m} vd, vs1, xs2
**Operation**
```
for L in Op.typelen
vd[N] += vs1[L] * vs2[L]
```
--------------------------------------------------------------------------------
### VMADD
Multiply add.
**Encodings**
vmadd.[b,h,w].vv.{m} vd, vs1, vs2 \
vmadd.[b,h,w].vx.{m} vd, vs1, xs2
**Operation**
```
for L in Op.typelen
vd[N] = vd[L] * vs2[L] + vs1[L]
```
--------------------------------------------------------------------------------
### VMAX
Find the unsigned or signed maximum of two registers.
**Encodings**
vmax.[b,h,w].{u}.vv.{m} vd, vs1, vs2 \
vmax.[b,h,w].{u}.vx.{m} vd, vs1, xs2
**Operation**
```
for L in Op.typelen
vd[L] = vs1[L] > vs2[L] ? vs1[L] : vs2[L]
```
--------------------------------------------------------------------------------
### VMIN
Find the minimum of two registers.
**Encodings**
vmin.[b,h,w].{u}.vv.{m} vd, vs1, vs2 \
vmin.[b,h,w].{u}.vx.{m} vd, vs1, xs2
**Operation**
```
for L in Op.typelen
vd[L] = vs1[L] < vs2[L] ? vs1[L] : vs2[L]
```
--------------------------------------------------------------------------------
### VMUL
Multiply two registers.
**Encodings**
vmul.[b,h,w].vv.{m} vd, vs1, vs2 \
vmul.[b,h,w].vx.{m} vd, vs1, xs2
**Operation**
```
for L in Op.typelen
vd[L] = vs1[L] * vs2[L]
```
--------------------------------------------------------------------------------
### VMULS
Multiply with saturation two registers.
**Encodings**
vmuls.[b,h,w].{u}.vv.{m} vd, vs1, vs2 \
vmuls.[b,h,w].{u}.vx.{m} vd, vs1, xs2
**Operation**
```
for L in Op.typelen
vd[L] = Saturation(vs1[L] * vs2[L])
```
--------------------------------------------------------------------------------
### VMULW
Multiply with widening two registers.
**Encodings**
vmulw.[h,w].{u}.vv.{m} vd, vs1, vs2 \
vmulw.[h,w].{u}.vx.{m} vd, vs1, xs2
**Operation**
```
for L in Op.typelen
{vd+0}[L] = vs1.asHalfType[2*L+0] * vs2.asHalfType[2*L+0]
{vd+1}[L] = vs1.asHalfType[2*L+1] * vs2.asHalfType[2*L+1]
```
--------------------------------------------------------------------------------
### VMULH
Multiply with widening two registers returning the high half.
**Encodings**
vmulh.[b,h,w].{u}.{r}.vv.{m} vd, vs1, vs2 \
vmulh.[b,h,w].{u}.{r}.vx.{m} vd, vs1, xs2
**Operation**
```
SZ = vtype.size * 8
RND = IsRounded ? 1<<(SZ-1) : 0
for L in Op.typelen
if IsU()
vd[L] = (unsigned(vs1[L]) * unsigned(vs2[L] + RND))[2*SZ-1:SZ]
else if IsSU()
vd[L] = ( signed(vs1[L]) * unsigned(vs2[L] + RND))[2*SZ-1:SZ]
else
vd[L] = ( signed(vs1[L]) * signed(vs2[L] + RND))[2*SZ-1:SZ]
```
--------------------------------------------------------------------------------
### VMV
Move a register.
**Encodings**
vmv.v.{m} vd, vs1
**Operation**
```
for L in Op.typelen
vd[L] = vs1[L]
```
Note: in the stripmined case an implemention may deliver more than one write per
cycle.
--------------------------------------------------------------------------------
### VMVP
Move a pair of registers.
**Encodings**
vmvp.vv.{m} vd, vs1, vs2 \
vmvp.[b,h,w].vx.{m} vd, vs1, xs2
**Operation**
```
for L in Op.typelen
{vd+0}[L] = vs1[L]
{vd+1}[L] = vs2[L]
```
--------------------------------------------------------------------------------
### VNE
Integer not-equal comparison.
**Encodings**
vne.[b,h,w].vv.{m} vd, vs1, vs2 \
vne.[b,h,w].vx.{m} vd, vs1, xs2
**Operation**
```
for L in Op.typelen
vd[L] = vs1[L] != vs2[L] ? 1 : 0
```
--------------------------------------------------------------------------------
### VNOT
Bitwise NOT a register.
**Encodings**
vnot.v.{m} vd, vs1
**Operation**
```
for L in Op.typelen
vd[L] = ~vs1[L]
```
--------------------------------------------------------------------------------
### VOR
OR two operands.
**Encodings**
vor.vv.{m} vd, vs1, vs2 \
vor.[b,h,w].vx.{m} vd, vs1, xs2
**Operation**
```
for L in Op.typelen
vd[L] = vs1[L] | vs2[L]
```
--------------------------------------------------------------------------------
### VPADD
Adds the lane pairs.
**Encodings**
vpadd.[h,w].{u}.v.{m} vd, vs1
**Operation**
```
if .v
for L in Op.typelen
vd[L] = (vs1.asHalfType[2 * L] + vs1.asHalfType[2 * L + 1])
```
--------------------------------------------------------------------------------
### VPSUB
Subtracts the lane pairs.
**Encodings**
vpsub.[h,w].{u}.v.{m} vd, vs1
**Operation**
```
if .v
for L in Op.typelen
vd[L] = (vs1.asHalfType[2 * L] - vs1.asHalfType[2 * L + 1])
```
--------------------------------------------------------------------------------
### VCPOP
Count the set bits.
**Encodings**
vcpop.[b,h,w].v.{m} vd, vs1
**Operation**
```
for L in Op.typelen
vd[L] = CountPopulation(vs1[L])
```
--------------------------------------------------------------------------------
### VREV
Generalized reverse using bit ladder.
The size of the flip is based on the `log_2(data type)`
**Encodings**
vrev.[b,h,w].vv.{m} vd, vs1, vs2 \
vrev.[b,h,w].vx.{m} vd, vs1, xs2
**Operation**
```
N = vtype.bits - 1 # 7, 15, 31
shamt = xs2[4:0] & N
for L in Op.typelen
r = vs1[L]
if (shamt & 1) r = ((r & 0x55..) << 1) | ((r & 0xAA..) >> 1)
if (shamt & 2) r = ((r & 0x33..) << 2) | ((r & 0xCC..) >> 2)
if (shamt & 4) r = ((r & 0x0F..) << 4) | ((r & 0xF0..) >> 4)
if (sz == 0) vd[L] = r; continue;
if (shamt & 8) r = ((r & 0x00..) << 8) | ((r & 0xFF..) >> 8)
if (sz == 1) vd[L] = r; continue;
if (shamt & 16) r = ((r & 0x00..) << 16) | ((r & 0xFF..) >> 16)
vd[L] = r
```
--------------------------------------------------------------------------------
### VROR
Logical rotate right.
**Encodings**
vror.[b,h,w].vv.{m} vd, vs1, vs2 \
vror.[b,h,w].vx.{m} vd, vs1, xs2
**Operation**
```
N = vtype.bits - 1 # 7, 15, 31
shamt = xs2[4:0] & N
for L in Op.typelen
r = vs1[L]
if (shamt & 1) for (B in vtype.bits) r[B] = r[(N+1) % N]
if (shamt & 2) for (B in vtype.bits) r[B] = r[(N+2) % N]
if (shamt & 4) for (B in vtype.bits) r[B] = r[(N+4) % N]
if (shamt & 8) for (B in vtype.bits) r[B] = r[(N+8) % N]
if (shamt & 16) for (B in vtype.bits) r[B] = r[(N+16) % N]
vd[L] = r
```
--------------------------------------------------------------------------------
### VSHA, VSHL
Arithmetic and logical left/right shift with saturating shift amount and result.
**Encodings**
vsha.[b,h,w].{r}.vv.{m} vd, vs1, vs2
vshl.[b,h,w].{r}.vv.{m} vd, vs1, vs2
**Operation**
```
M = Op.size # 8, 16, 32
N = [8->3, 16->4, 32->5][Op.size]
SHSAT[L] = vs2[L][M-1:N] != 0
SHAMT[L] = vs2[L][N-1:0]
RND = R && SHAMT ? 1 << (SHAMT-1) : 0
RND -= N && (vs1[L] < 0) ? 1 : 0
SZ = sizeof(src.type) * 8 * (W ? 2 : 1)
RESULT_NEG = (vs1[L] <<[<] SHAMT[L])[SZ-1:0] // !A "<<<" logical shift
RESULT_NEG = S ? Saturate(RESULT_POS, SHSAT[L]) : RESULT_NEG
RESULT_POS = ((vs1[L] + RND) >>[>] SHAMT[L]) // !A ">>>" logical shift
RESULT_POS = S ? Saturate(RESULT_NEG, SHSAT[L]) : RESULT_POS
xd[L] = SHAMT[L] >= 0 ? RESULT_POS : RESULT_NEG
```
--------------------------------------------------------------------------------
### VSEL
Select lanes from two operands with vector selection boolean.
**Encodings**
vsel.[b,h,w].vv.{m} vd, vs1, vs2 \
vsel.[b,h,w].vx.{m} vd, vs1, xs2
**Operation**
```
for L in Op.typelen
vd[L] = vs1[L].bit(0) ? vd[L] : vs2[L]
```
--------------------------------------------------------------------------------
### VSLL
Logical left shift.
**Encodings**
vsll.[b,h,w].vv.{m} vd, vs1, vs2 \
vsll.[b,h,w].vx.{m} vd, vs1, xs2
**Operation**
```
N = [8->3, 16->4, 32->5][Op.size]
xd[L] = vs1[L] <<< vs2[L][N-1:0]
```
--------------------------------------------------------------------------------
### VSLIDEN
Slide next register by index.
For the horizontal mode, it treats the stripmine `vm` register based on
`vs1` as a contiguous block, and only the first `index` elements from `vs2`
will be used.
For the vertical mode, each stripmine vector register `op_index` is mapped
separatedly. it mimics the imaging tiling process shift of
```
|--------|--------|
| 4xVLEN | 4xVLEN |
| (vs1) | (vs2) |
|--------|--------|
```
The vertical mode can also support the non-stripmine version to handle
the last columns of the image.
**Encodings**
Horizontal slide:
vslidehn.[b,h,w].[1,2,3,4].vv.m vd, vs1, vs2 \
vslidehn.[b,h,w].[1,2,3,4].vx.m vd, vs1, xs2
Vertical slide:
vsliden.[b,h,w].[1,2,3,4].vv vd, vs1, vs2 \
vslidevn.[b,h,w].[1,2,3,4].vv.m vd, vs1, vs2 \
vslidevn.[b,h,w].[1,2,3,4].vx.m vd, vs1, xs2
**Operation**
```
assert vd != vs1 && vd != vs2
if Op.h // A contiguous horizontal slide based on vs1
va = {{vs1},{vs1+1},{vs1+2},{vs1+3}}
vb = {{vs1+1},{vs1+2},{vs1+3},{vs2}}
if Op.v // vs1/vs2 vertical slide
va = {{vs1},{vs1+1},{vs1+2},{vs1+3}}
vb = {{vs2},{vs2+1},{vs2+2},{vs2+3}}
sm = Op.m ? 4 : 1
for M in sm
for L in Op.typelen
if (L + index < Op.typelen)
vd[L] = va[M][L + index]
else
vd[L] = is_vx ? xs2 : vb[M][L + index - Op.typelen]
```
--------------------------------------------------------------------------------
### VSLIDEP
Slide previous register by index.
For the horizontal mode, it treats the stripmine `vm` register based on
**`vs2`** as a contiguous block, and only the _LAST_ `index` elements from
stripmine vm register based on `vs1` will be used AT THE BEGINNING.
For the vertical mode, each stripmine vector register `op_index` is mapped
separatedly. it mimics the imaging tiling process shift of
```
|--------|--------|
| 4xVLEN | 4xVLEN |
| (vs1) | (vs2) |
|--------|--------|
```
The vertical mode can also support the non-stripmine version to handle
the last columns of the image.
**Encodings**
Horizontal slide:
vslidehp.[b,h,w].[1,2,3,4].vv.m vd, vs1, vs2 \
vslidehp.[b,h,w].[1,2,3,4].vx.m vd, vs1, xs2
Vertical slide:
vslidep.[b,h,w].[1,2,3,4].vv vd, vs1, vs2 \
vslidevp.[b,h,w].[1,2,3,4].vv.m vd, vs1, vs2 \
vslidevp.[b,h,w].[1,2,3,4].vv.m vd, vs1, xs2
**Operation**
```
assert vd != vs1 && vd != vs2
if Op.h // A continuous horizontal slide based on vs2
va = {{vs1+3},{vs2},{vs2+1},{vs2+2}}
vb = {{vs2},{vs2+1},{vs2+2},{vs2+3}}
if Op.v // vs1/vs2 vertical slide
va = {{vs1},{vs1+1},{vs1+2},{vs1+3}}
vb = {{vs2},{vs2+1},{vs2+2},{vs2+3}}
sm = Op.m ? 4 : 1
for M in sm
for L in Op.typelen
if (L < index)
vd[L] = va[M][Op.typelen + L - index]
else
vd[L] = is_vx ? xs2 : vb[M][L - index]
```
--------------------------------------------------------------------------------
### VSRA, VSRL
Arithmetic and logical right shift.
**Encodings**
vsra.[b,h,w].vv.{m} vd, vs1, vs2 \
vsra.[b,h,w].vx.{m} vd, vs1, xs2
vsrl.[b,h,w].vv.{m} vd, vs1, vs2 \
vsrl.[b,h,w].vx.{m} vd, vs1, xs2
**Operation**
```
N = Op.size[8->3, 16->4, 32->5]
xd[L] = vs1[L] >>[>] vs2[L][N-1:0]
```
--------------------------------------------------------------------------------
### VSRANS, VSRANSU
Arithmetic right shift with rounding and signed/unsigned saturation.
**Encodings**
vsrans{u}.[b,h].{r}.vv.{m} vd, vs1, vs2 \
vsrans{u}.[b,h].{r}.vx.{m} vd, vs1, xs2
**Operation**
```
for L in Op.typelen
N = [8->3, 16->4, 32->5][Op.size]
SHAMT[L] = vs2[L][2*N-1:0] # source size index
RND = R && SHAMT ? 1 << (SHAMT-1) : 0
RND -= N && (vs1[L] < 0) ? 1 : 0
vd[L+0] = Saturate({vs1+0}[L/2] + RND, u) >>[>] SHAMT
vd[L+1] = Saturate({vs1+1}[L/2] + RND, u) >>[>] SHAMT
```
Note: vsrans.[b,h].vx.m are used in ML activations so may be optimized by
implementations.
--------------------------------------------------------------------------------
### VSRAQS
Arithmetic quarter narrowing right shift with rounding and signed/unsigned
saturation.
**Encodings**
vsraqs{u}.b.{r}.vv.{m} vd, vs1, vs2 \
vsraqs{u}.b.{r}.vx.{m} vd, vs1, xs2
**Operation**
```
for L in i32.typelen
SHAMT[L] = vs2[L][4:0]
RND = R && SHAMT ? 1 << (SHAMT-1) : 0
RND -= N && (vs1[L] < 0) ? 1 : 0
vd[L+0] = Saturate({vs1+0}[L/4] + RND, u) >>[>] SHAMT
vd[L+1] = Saturate({vs1+2}[L/4] + RND, u) >>[>] SHAMT
vd[L+2] = Saturate({vs1+1}[L/4] + RND, u) >>[>] SHAMT
vd[L+3] = Saturate({vs1+3}[L/4] + RND, u) >>[>] SHAMT
```
Note: The register interleaving is [0,2,1,3] and not [0,1,2,3] as this matches
vconv/vdwconv requirements, and one vsraqs is the same as two chained vsrans.
--------------------------------------------------------------------------------
### VRSUB
Reverse subtract two operands.
**Encodings**
vrsub.[b,h,w].vx.{m} vd, vs1, xs2
**Operation**
```
for L in Op.typelen
vd[L] = xs2[L] - vs1[L]
```
--------------------------------------------------------------------------------
### VSUB
Subtract two operands.
**Encodings**
vsub.[b,h,w].vv.{m} vd, vs1, vs2 \
vsub.[b,h,w].vx.{m} vd, vs1, xs2
**Operation**
```
for L in Op.typelen
vd[L] = vs1[L] - vs2[L]
```
--------------------------------------------------------------------------------
### VSUBS
Subtract two operands with saturation.
**Encodings**
vsubs.[b,h,w].{u}.vv.{m} vd, vs1, vs2 \
vsubs.[b,h,w].{u}.vx.{m} vd, vs1, xs2
**Operation**
```
for L in Op.typelen
vd[L] = Saturate(vs1[L] - vs2[L])
```
--------------------------------------------------------------------------------
### VSUBW
Subtract two operands with widening.
**Encodings**
vsubw.[h,w].{u}.vv.{m} vd, vs1, vs2 \
vsubw.[h,w].{u}.vx.{m} vd, vs1, xs2
**Operation**
```
for L in Op.typelen
{vd+0}[L] = vs1.asHalfType[2*L+0] - vs2.asHalfType[2*L+0]
{vd+1}[L] = vs1.asHalfType[2*L+1] - vs2.asHalfType[2*L+1]
```
--------------------------------------------------------------------------------
### VST
Vector store to memory with optional post-increment by scalar.
**Encodings**
vst.[b,h,w].{p}.x.{m} vd, xs1 \
vst.[b,h,w].[l,p,s,lp,sp,tp].xx.{m} vd, xs1, xs2
**Operation**
```
addr = xs1
sm = Op.m ? 4 : 1
len = min(Op.typelen * sm, unsigned(xs2))
for M in Op.m
for L in Op.typelen
if !Op.bit.l || (L + M * Op.typelen) < len
mem[addr + L].type = vd[L]
if (Op.bit.s)
addr += xs2 * sizeof(type)
else
addr += Reg.bytes
if Op.bit.p
if Op.bit.l && Op.bit.s # .tp
xs1 += Reg.bytes
elif !Op.bit.l && !Op.bit.s && !{xs2} # .p.x
xs1 += Reg.bytes * sm
elif Op.bit.l # .lp
xs1 += len * sizeof(type)
elif Op.bit.s # .sp
xs1 += xs2 * sizeof(type) * sm
else # .p.xx
xs1 += xs2 * sizeof(type)
```
--------------------------------------------------------------------------------
### VSTQ
Vector store quads to memory with optional post-increment by scalar.
**Encodings**
vstq.[b,h,w].[s,sp].xx.{m} vd, xs1, xs2
**Operation**
```
addr = xs1
sm = Op.m ? 4 : 1
for M in Op.m
for Q in 0 to 3
for L in Op.typelen / 4
mem[addr + L].type = vd[L + Q * Op.typelen / 4]
addr += xs2 * sizeof(type)
if Op.bit.p
xs1 += xs2 * sizeof(type) * sm
```
Note: This is principally for storing the results of vconv after 32b to 8b
reduction.
--------------------------------------------------------------------------------
### VXOR
XOR two operands.
**Encodings**
vxor.vv.{m} vd, vs1, vs2 \
vxor.[b,h,w].vx.{m} vd, vs1, xs2
**Operation**
```
for L in Op.typelen
vd[L] = vs1[L] ^ vs2[L]
```
--------------------------------------------------------------------------------
### VZIP
Interleave even/odd lanes of two operands.
**Encodings**
vzip.[b,h,w].vv.{m} vd, vs1, vs2 \
vzip.[b,h,w].vx.{m} vd, vs1, xs2
**Operation**
```
index = Is(a=>0, b=>1)
for L in Op.typelen
M = L / 2
N = L / 2 + Op.typelen / 2
{vd+0}[L] = L & 1 ? vs2[M] : vs1[M]
{vd+1}[L] = L & 1 ? vs2[N] : vs1[N]
where:
vs1 = 0x66442200
vs2 = 0x77553311
{vd+0} = 0x33221100
{vd+1} = 0x77665544
```
Note: vd must not be in the range of vs1 or vs2.
--------------------------------------------------------------------------------
### FLOG, SLOG, CLOG, KLOG
Log a register in a printf contract.
**Encodings**
flog rs1 &ensp; // mode=0, “printf” formatted command, rs1=(context) \
slog rs1 &ensp; // mode=1, scalar log \
clog rs1 &ensp; // mode=2, character log \
klog rs1 &ensp; // mode=3, const string log
**Operation**
A number of arguments are sent with SLOG or CLOG, and then a FLOG operation
closes the packet and may emit a timestamp and context data like ASID. A
receiving tool can construct messages, e.g. XML records per printf stream, by
collecting the arguments as they arrive in a variable length buffer, and closing
the record when the FLOG instruction arrives.
A transport layer may choose to encode in the flog format footer the preceding
count of arguments or bytes sent. This is so that detection of payload errors or
hot connections are possible.
The SLOG instruction will send a payload packet represented by the starting
memory location.
The CLOG instruction will send a multiple 32-bit packet message of a character
stream. The packet message will close when a zero character is detected. A
single character may be sent in a 32bit packet.
**Pseudo code**
```
const uint8_t p[] = "text message";
printf(“Test %s\n”, p);
KLOG p
FLOG &fmt
```
```
printf(“Test”);
FLOG &fmt
```
```
print(“Test %d\n”, result_int);
SLOG result_int
FLOG &fmt
```
```
printf(“Test %d %f %s %s %s\n”, 123, "abc", "1234", “789AB”);
SLOG 123
CLOG ‘abc\0’
CLOG ‘1234’ CLOG ‘\0’
CLOG ‘789A’ CLOG ‘B\0’
FLOG &fmt
```