Kelvin Instruction Reference

An ML+SIMD+Scalar instruction set for ML accelerator cores.

SIMD register configuration

Kelvin has 64 vector registers, v0 to v63, with the vector length of 256-bit for each of the registers. The register can store data in the format of 8b, 16b, and 32b, as encoded in the instructions (See the next section for detail).

Kelvin also supports the stripmine behaviors, which utilizes 16 vector registers with each one 4x the size of the typical register (Also see the details in the next section).

SIMD Instructions

The SIMD instructions utilize a register file with 64 entries which serves both standard arithmetic and logical operations and the domain compute. SIMD lane size, scalar broadcast, arithmetic operation sign, and stripmine behaviors are encoded explictly in the opcodes.

The SIMD instructions replace the encoding space of the compressed instruction set extension (those with 2-bit prefixes 00, 01, and 10). See The RISC-V Instruction Set Manual v2.2 “Available 30-bit instruction encoding spaces” for quadrupling the available encoding space within the 32-bit format.

Instruction Encodings

31..26	25..20	19..14	13..12	11..6	5	4..2	1..0	form
func2	vs2	vs1	sz	vd	m	func1	00	.vv
func2	[0]xs2	vs1	sz	vd	m	func1	10	.vx
func2	000000	vs1	sz	vd	m	func1	10	.v
func2	[0]xs2	xs1[0]	sz	vd	m	111	11	.xx
func2	000000	xs1[0]	sz	vd	m	111	11	.x

31..26	25..20	19..14	13..12	11..6	5	4..3	2..0	form
vs3	vs2	vs1	func3[3:2]	vd	m	func3[1:0]	001	.vvv
vs3	[0]xs2	vs1	func3[3:2]	vd	m	func3[1:0]	101	.vxv

Types “.b” “.h” “.w”

The SIMD lane size is encoded in the opcode definition indicating the destination type. For many opcodes source and destination sizes are the same, differing for widening and narrowing operations.

op[13:12]	sz	type
00	“.b”	8b
01	“.h”	16b
10	“.w”	32b

Scalar “.vx”

Instructions may use a scalar register to perform a value broadcast (8b, 16b, 32b) to all SIMD lanes of one operand.

op[2:0]	form
x00	“.vv”
x10	“.vx”
x10	“.v” (xs2==x0)
x11	“.xx”
x11	“.x” (xs2==x0)
001	“.vvv”
101	“.vxv”

Signed/Unsigned “.u”

Instructions which may be marked with “.u” have signed and unsigned variants. See comparisons, arithmetic operations and saturation for usage, the side effects being typical behaviors unless otherwise noted.

Stripmine “.m”

The stripmine functionality is an instruction compression mechanism. Frontend dispatch captures a single instruction, while the backend issue expands to four operations. Conceptually the register file is reduced from 64 locations to 16, where a stripmine register must use a mod4 base aligned register (eg. v0, v4, v8, ...). Normal instruction and stripmine variants may be mixed together.

Currently, neither the assembler nor kelvin_sim checks for invalid stripmine registers. Code using invalid registers (like v1) will compile and sim, but will cause FPGA to hang.

When stripmining is used in conjunction with instructions which use a register index as a base to several registers, the offset of +4 (instead of +1) shall be used. e.g., {vm0,vm1} becomes {{v0,v1,v2,v3},{v4,v5,v6,v7}}.

A machine may elect to distribute a stripmined instruction across multiple ALUs.

op[5]	m
0	""
1	“.m”

2-arg .xx (Load / Store)

Instruction	func2	Notes
vld	00 xx0PSL	1-arg
vld.l	01 xx0PSL
vld.s	02 xx0PSL
vld.p	04 xx0PSL	1 or 2-arg
vld.lp	05 xx0PSL
vld.sp	06 xx0PSL
vld.tp	07 xx0PSL
vst	08 xx1PSL	1-arg
vst.l	09 xx1PSL
vst.s	10 xx1PSL
vst.p	12 xx1PSL	1 or 2-arg
vst.lp	13 xx1PSL
vst.sp	14 xx1PSL
vst.tp	15 xx1PSL
vdup.x	16 x10000
vcget	20 x10100	0-arg
vstq.s	26 x11PSL
vstq.sp	30 x11PSL

To saving encoding space, use the compile time knowledge that if vld.p.xx or vst.p.xx post-incremented by a zero amount, do not encode x0, instead disable the post-increment operation so as to reuse the encoding where xs2==x0 for vld.p.x or vst.p.x which have different base update behavior. If the post-increment were programmatic behavior then a register where xs2!=x0 would be used.

NOTE: Scalar register xs1 uses the same encoding bitfield as the vector register vs1, but HAS ONE BIT PADDED AT LSB. That is xs1 has the same encoding as the regular RISC-V instructions (bit[19:15]). On the other head, xs2 shares the same encoding bitfield vs2, but HAS ONE BIT PADDED AT MSB, so it is consistent with the regular RISC-V instructions (bit[24:20]).

1-arg .x (Load / Store)

Instructions of the format “op.xx vd, xs1, x0” (xs2=x0, the scalar zero register) are reduced to the shortened form “op.x vd, xs1”.

0-arg

Instructions of the format “op.xx vd, x0, x0” (xs1=x0, xs2=x0, the scalar zero register) are reduced to the shortened form “op vd”.

1-arg .v

Single argument vector operations “.v” use xs2 scalar encoding “x0|zero”.

2-arg .vv|.vx

Instruction	func2	func1 / Notes
Arithmetic	...	000
vadd	00 xxxxxx
vsub	01 xxxxxx
vrsub	02 xxxxxx
veq	06 xxxxxx
vne	07 xxxxxx
vlt.{u}	08 xxxxxU
vle.{u}	10 xxxxxU
vgt.{u}	12 xxxxxU
vge.{u}	14 xxxxxU
vabsd.{u}	16 xxxxxU
vmax.{u}	18 xxxxxU
vmin.{u}	20 xxxxxU
vadd3	24 xxxxxx
Arithmetic2	...	100
vadds.{u}	00 xxxxxU
vsubs.{u}	02 xxxxxU
vaddw.{u}	04 xxxxxU
vsubw.{u}	06 xxxxxU
vacc.{u}	10 xxxxxU
vpadd.{u}	12 xxxxxU	.v
vpsub.{u}	14 xxxxxU	.v
vhadd.{ur}	16 xxxxRU
vhsub.{ur}	20 xxxxRU
Logical	...	001
vand	00 xxxxxx
vor	01 xxxxxx
vxor	02 xxxxxx
vnot	03 xxxxxx	.v
vrev	04 xxxxxx
vror	05 xxxxxx
vclb	08 xxxxxx	.v
vclz	09 xxxxxx	.v
vcpop	10 xxxxxx	.v
vmv	12 xxxxxx	.v
vmvp	13 xxxxxx
acset	16 xxxxxx
actr	17 xxxxxx	.v
adwinit	18 xxxxxx
Shift	...	010
vsll	01 xxxxxx
vsra	02 xxxxx0
vsrl	03 xxxxx1
vsha.{r}	08 xxxxR0	+/- shamt
vshl.{r}	09 xxxxR1	+/- shamt
vsrans{u}.{r}	16 xxxxRU	narrowing saturating (x2)
vsraqs{u}.{r}	24 xxxxRU	narrowing saturating (x4)
Mul/Div	...	011
vmul	00 xxxxxx
vmuls	02 xxxxxU
vmulw	04 xxxxxU
vmulh.{ur}	08 xxxxRU
vdmulh.{rn}	16 xxxxRN
vmacc	20 xxxxxx
vmadd	21 xxxxxx
Float	...	101
--reserved--	xx xxxxxx
Shuffle	...	110
vslidevn	00 xxxxNN
vslidehn	04 xxxxNN
vslidevp	08 xxxxNN
vslidehp	12 xxxxNN
vsel	16 xxxxxx
vevn	24 xxxxxx
vodd	25 xxxxxx
vevnodd	26 xxxxxx
vzip	28 xxxxxx
Reserved7	...	111
--reserved--	xx xxxxxx

3-arg .vvv|.vxv

Instruction	func3	Notes
aconv	8	scalar: sign
vdwconv	10	scalar: sign/type/swizzle

Typeless

Operations that do not have a {.b,.h,.w} type have the same behavior regardless of the size field (bitwise: vand, vnot, vor, vxor; move: vmv, vmvp). The tooling convention is to use size=0b00 “.b” encoding.

Vertical Modes

The “.tp” mode of vld or vst uses the four registers of “.m” in a vertical structure, compared to other modes horizontal usage. The “.m” base update is a single register width, vs 4x width for other modes. The usage model is four “lines” being processed at the same time, vs a single line chained together in other “.m” modes.

Horizontal
... AAAA BBBB CCCC DDDD ...

vs.

Vertical (".tp")
... AAAA ...
... BBBB ...
... CCCC ...
... DDDD ...

Aliases

vneg.v ← vrsub.xv vd, vs1, zero
vabs.v ← vabsd.vx vd, vs1, zero
vwiden.v ← vaddw.vx vd, vs1, zero

System Instructions

The execution model is designed towards OS-less and interrupt-less operation. A machine will typically operate as run-to-completion of small restartable workloads. A user/machine mode split is provided as a runtime convenience, though there is no difference in access permissions between the modes.

31..28	27	26	25	24	23	22	21	20	19..15	14..12	11..7	6..2	1	0	OP
0000	PI	PO	PR	PW	SI	SO	SR	SW	00000	000	00000	00011	1	1	FENCE

31..28	27..24	23..20	19..15	14..12	11..7	6..2	1	0	OP
0000	0000	0000	00000	001	00000	00011	1	1	FENCE.I

31..27	26..25	24..20	19..15	14..12	11..7	6..2	1	0	OP
00100	11	00000	xs1	000	00000	11101	1	1	FLUSH
0001M	sz	xs2	xs1	000	xd	11101	1	1	GET{MAX}VL
01111	00	00000	xs1	mode	00000	11101	1	1	[F,S,K,C]LOG

31..20	6..2	1	0	OP
000000000001	11100	1	1	EBREAK
001100000010	11100	1	1	MRET
000010000000	11100	1	1	MPAUSE
000001100000	11100	1	1	ECTXSW
000001000000	11100	1	1	EYIELD
000000100000	11100	1	1	EEXIT
000000000000	11100	1	1	ECALL

Exit Cause

enum_IDLE = 0
enum_EBREAK = 1
enum_ECALL = 2
enum_EEXIT = 3
enum_EYIELD = 4
enum_ECTXSW = 5
enum_UNDEF_INST = (1u<<31) | 2
enum_USAGE_FAULT = (1u<<31) | 16

Instruction Definitions

FLUSH

Cache clean and invalidate operations at the private level

Encodings

flushat xs1
flushall

Operation

Start = End = xs1
Line  = xs1

The instruction is a standard way of describing cache maintenance operations.

Type	Visibility	System1	System2
Private	Core	Core L1	Core L1 + Coherent L2

FENCE

Enforce memory ordering of loads and stores for external visibility.

Encodings

fence [i|o|r|w], [i|o|r|w]
fence

Operation

PI predecessor I/O input
PO predecessor I/O output
PR predecessor memory read
PW predecessor memory write
<ordering between marked predecessors and successors>
SI successor I/O input
SO successor I/O output
SR successor memory read
SW successor memory write

Note: a simplified implementation may have the frontend stall until all preceding operations are completed before permitting any trailing instruction to be dispatched.

FENCE.I

Ensure subsequent instruction fetches observe prior data operations.

Encodings

fence.i

Operation

InvalidateInstructionCaches()
InvalidateInstructionPrefetchBuffers()

GETVL

Calculate the vector length.

Encodings

getvl.[b,h,w].x xd, xs1
getvl.[b,h,w].xx xd, xs1, xs2
getvl.[b,h,w].x.m xd, xs1
getvl.[b,h,w].xx.m xd, xs1, xs2

Operation

xd = min(vl.type.size, unsigned(xs1), xs2 ? unsigned(xs2) : ignore)

Find the minimum of the maximum vector length by type and the two input values. If xs2 is zero (either x0 or register contents) then it is ignored (or considered MaxInt), acting as a clamp less than maxvl.

Type	Instruction	Description
00	getvl.b	8bit lane count
01	getvl.h	16bit lane count
10	getvl.w	32bit lane count

GETMAXVL

Obtain the maximum vector length.

Encodings

getmaxvl.[b,h,w].{m} xd

Operation

xd = vl.type.size

Type	Instruction	Description
00	getmaxvl.b	8bit lane count
01	getmaxvl.h	16bit lane count
10	getmaxvl.w	32bit lane count

For a machine with 256bit SIMD registers:

getmaxvl.w = 8 lanes
getmaxvl.h = 16 lanes
getmaxvl.b = 32 lanes
getmaxvl.w.m = 32 lanes // multiply by 4 with strip mine.
getmaxvl.h.m = 64 lanes
getmaxvl.b.m = 128 lanes

ECALL

Execution call to supervisor OS.

Encodings

ecall

Operation

if (mode == User)
  mcause = enum_ECALL
  mepc = pc
  pc = mtvec
  mode = Machine
else
  mcause = enum_USAGE_FAULT
  mfault = pc
  EndExecution

EEXIT

Execution exit to supervisor OS.

Encodings

eexit

Operation

if (mode == User)
  mcause = enum_EEXIT
  mepc = pc
  pc = mtvec
  mode = Machine
else
  mcause = enum_USAGE_FAULT
  mfault = pc
  EndExecution

EYIELD

Synchronous execution switch to supervisor OS.

Encodings

eyield

Operation

if (mode == User)
  if (YIELD_REQUEST == 1)
    mcause = enum_EYIELD
    mepc = pc + 4  # advance to next instruction
    pc = mtvec
    mode = Machine
  else
    NOP  # pc = pc + 4
else
  mcause = enum_USAGE_FAULT
  mfault = pc
  EndExecution

YIELD_REQUEST refers to a signal the supervisor core sets to request a context switch.

Note: use when MIE=0 eyield is inserted at synchronization points for cooperative context switching.

ECTXSW

Asynchronous execution switch to supervisor OS.

Encodings

ectxsw

Operation

if (mode == User)
  mcause = enum_ECTXSW
  mepc = pc
  pc = mtvec
  mode = Machine
else
  mcause = enum_USAGE_FAULT
  mfault = pc
  EndExecution

EBREAK

Execution breakpoint to supervisor OS.

Encodings

ebreak

Operation

if (mode == User)
  mcause = enum_EBREAK
  mepc = pc
  pc = mtvec
  mode = Machine
else
  mcause = enum_UNDEF_INST
  mfault = pc
  EndExecution

MRET

Return from machine mode to user mode.

Encodings

mret

Operation

if (mode == Machine)
  pc = mepc
  mode = User
else
  mcause = enum_UNDEF_INST
  mepc = pc
  pc = mtvec
  mode = Machine

MPAUSE

Machine pause and release for next execution context.

Encodings

mpause

Operation

if (mode == Machine)
  EndExecution
else
  mcause = enum_UNDEF_INST
  mepc = pc
  pc = mtvec
  mode = Machine

VABSD

Absolute difference with unsigned result.

Encodings

vabsd.[b,h,w].{u}.vv.{m} vd, vs1, vs2
vabsd.[b,h,w].{u}.vx.{m} vd, vs1, xs2

Operation

for L in Op.typelen
  vd[L] = vs1[L] > vs2[L] ? vs1[L] - vs2[L] : vs2[L] - vs1[L]

Note: for signed(INTx_MAX - INTx_MIN) the result will be UINTx_MAX.

VACC

Accumulates a value into a wider register.

Encodings

vacc.[h,w].{u}.vv.{m} vd, vs1, vs2
vacc.[h,w].{u}.vx.{m} vd, vs1, xs2

Operation

for L in Op.typelen
  {vd+0}[L] = {vs1+0} + vs2.asHalfType[2*L+0]
  {vd+1}[L] = {vs1+1} + vs2.asHalfType[2*L+1]

VADD

Add operands.

Encodings

vadd.[b,h,w].vv.{m} vd, vs1, vs2
vadd.[b,h,w].vx.{m} vd, vs1, xs2

Operation

for L in Op.typelen
  vd[L] = vs1[L] + vs2[L]

VADDS

Add operands with saturation.

Encodings

vadds.[b,h,w].{u}.vv.{m} vd, vs1, vs2
vadds.[b,h,w].{u}.vx.{m} vd, vs1, xs2

Operation

for L in Op.typelen
  vd[L] = Saturate(vs1[L] + vs2[L])

VADDW

Add operands with widening.

Encodings

vaddw.[h,w].{u}.vv.{m} vd, vs1, vs2
vaddw.[h,w].{u}.vx.{m} vd, vs1, xs2

Operation

for L in Op.typelen
  {vd+0}[L] = vs1.asHalfType[2*L+0] + vs2.asHalfType[2*L+0]
  {vd+1}[L] = vs1.asHalfType[2*L+1] + vs2.asHalfType[2*L+1]

VADD3

Add three operands.

Encodings

vadd3.[w].vv.{m} vd, vs1, vs2
vadd3.[w].vx.{m} vd, vs1, xs2

Operation

for L in i32.typelen
  vd[L] = vd[L] + vs1[L] + vs2[L]

VAND

AND operands.

Encodings

vand.vv.{m} vd, vs1, vs2
vand.[b,h,w].vx.{m} vd, vs1, xs2

Operation

for L in Op.typelen
  vd[L] = vs1[L] & vs2[L]

ACONV

Performs matmul vs1*vs3, accumulating into the accumulator.

Encodings

aconv.vxv vd, vs1, xs2, vs3

Encoding ‘aconv’ uses a ‘1’ in the unused 5th bit (b25) of vs2.

Operation

#  8b: 0123456789abcdef
# 32b: 048c 26ae 159d 37bf
assert(vd == 48)
N = is_simd512 ? 16 : is_simd256 ? 8 : assert(0)

func Interleave(Y,L):
  m = L % 4
  if (m == 0) (Y & ~3) + 0
  if (m == 1) (Y & ~3) + 2
  if (m == 2) (Y & ~3) + 1
  if (m == 3) (Y & ~3) + 3

# i32 += i8 x i8 (u*u, u*s, s*u, s*s)
for Y in [0..N-1]
  for X in [Start..Stop]
    for L in i8.typelen
      Data1 = {vs1+Y}.i8[4*X + L&3]  # 'transpose and broadcast'
      Data2 = {vs3+X-Start}.u8[L]
      {Accum+Interleave(Y,L)}[L / 4] +=
        ((signed(SData1,Data1{7:0}) + signed(Bias1{8:0})){9:0} *
         (signed(SData2,Data2{7:0}) + signed(Bias2{8:0})){9:0}){18:0}

vs1 goes to the narrow port of the matmul. 8 vectors are always used.

vs3 goes to the wide port of the matmul, up to 8 vectors are used.

vx2 specifies control params used in the operation and has the following format:

Mode	Mode	Usage
Common		Mode[1:0] Start[6:2] Stop[11:7]
s8	0	SData2[31] SBias2[30:22] SData1[21] SBias1[20:12]

Start and Stop controls the window of input values to participate in the matmul:

On vs1 this is in 4-byte words on all 8 vectors at the same time.
On vs3 this is the register number to use (vs3+0 to vs3+7).
The operation takes (stop - start + 1) ticks to complete.

When using SIMD256, the folling operands are valid:

vd: v48
vs1: v0, v16, v32, v48
vs3: v8, v24, v40, v56

Notes:

v48 is used as vd but never written to.
v48-v55 will always be overwritten upon VCGET.

VCGET

Copy convolution accumulators into general registers.

Encodings

vcget vd

Operation

assert(vd == 48)
N = is_simd512 ? 15 : is_simd256 ? 7 : assert(0)
for Y in [0..N]
  vd{Y} = Accum{Y}
  Accum{Y} = 0

v48 is the only valid vd in this instruction.

ACSET

Copy general registers into convolution accumulators.

Encodings

acset.v vd, vs1

Operation

assert(vd == 48)
N = is_simd512 ? 15 : is_simd256 ? 7 : assert(0)
for Y in [0..N]
  Accum{Y} = vd{Y}

Note that v48 is used as vd but never written to.

ACTR

Transpose a register group into the convolution accumulators.

Encodings

actr.[w].v.{m} vd, vs1

Operation

assert(vd == 48)
assert(vs1 in {v0, v16, v32, v48}
for I in i32.typelen
  for J in i32.typelen
    ACCUM[J][I] = vs1[I][J]

Note that v48 is used as vd but never written to.

VCLB

Count the leading bits.

Encodings

vclb.[b,h,w].v.{m} vd, vs1

Operation

MSB = 1 << (vtype.size - 1)
for L in Op.typelen
  vd[L] = vs1[L] & MSB ? CLZ(~vs1[L]) : CLZ(vs1[L])

Note: (clb - 1) is equivalent to __builtin_clrsb.

clb examples

clb.w(0xffffffff) = 32
clb.w(0xcfffffff) = 2
clb.w(0x80001000) = 1
clb.w(0x00007fff) = 17
clb.w(0x00000000) = 32

VCLZ

Count the leading zeros.

Encodings

vclz.[b,h,w].v.{m} vd, vs1

Operation

for L in Op.typelen
  vd[L] = CLZ(vs1[L])

Note: clz.b,h,w returns [8,16,32].

VDWCONV

Depthwise convolution 3-way multiply accumulate.

Encodings

vdwconv.vxv vd, vs1, x2, vs3
adwconv.vxv vd, vs1, x2, vs3

Encoding ‘adwconv’ uses a ‘1’ in the unused 5th bit (b25) of vs2.

Operation

The vertical axis is typically tiled which requires preserving registers for this functionality. The sparse formats require shuffles so that additional registers of intermediate state are not required.

# quant8
{vs1+0,vs1+1,vs1+2} = Rebase({vs1}, Mode::RegBase)
{b0} = {vs3+0}.asByteType
{b1} = {vs3+1}.asByteType
{b2} = {vs3+2}.asByteType
if IsDenseFormat
  a0 = {vs1+0}.asByteType
  a1 = {vs1+1}.asByteType
  a2 = {vs1+2}.asByteType
if IsSparseFormat1  # [n-1,n,n+1]
  a0 = vslide_p({vs1+1}, {vs1+0}, 1).asByteType
  a1 = {vs1+1}.asByteType
  a2 = vslide_n({vs1+1}, {vs1+2}, 1).asByteType
if IsSparseFormat2  # [n,n+1,n+2]
  a0 = {vs1+0}.asByteType
  a1 = vslide_n({vs1+0}, {vs1+1}, 1).asByteType
  a2 = vslide_n({vs1+0}, {vs1+1}, 2).asByteType

#  8b: 0123456789abcdef
# 32b: 048c 26ae 159d 37bf
func Interleave(L):
  i = L % 4
  if (i == 0) 0
  if (i == 1) 2
  if (i == 2) 1
  if (i == 3) 3

for L in Op.typelen
  B = 4*L  # 8b --> 32b
  for i in [0..3]
    # int19_t multiply results
    # int23_t addition results
    # int32_t storage
    {dwacc+i}[L/4] +=
        (SData1(a0[B+i]) + bias1) * (SData2(b0[B+i]) + bias2) +
        (SData1(a1[B+i]) + bias1) * (SData2(b1[B+i]) + bias2) +
        (SData1(a2[B+i]) + bias1) * (SData2(b2[B+i]) + bias2)
  if is_vdwconv  // !adwconv
    for i in [0..3]
      {vd+i} = {dwacc+i}

Mode	Encoding	Usage
Common	xs2	Mode[1:0] Sparsity[3:2] RegBase[7:4]
q8	0	SData2[31] SBias2[30:22] SData1[21] SBias1[20:12]

The Mode::Sparity sets the swizzling patterns.

Sparsity	Format	Swizzle
b00	Dense	none
b01	Sparse1	[n-1,n,n+1]
b10	Sparse2	[n,n+1,n+2]

The Mode::RegBase allows for the start point of the 3 register group to allow for cycling of [prev,curr,next] values.

RegBase	Prev	Curr	Next
b0000	{vs1+0}	{vs1+1}	{vs1+2}
b0001	{vs1+1}	{vs1+2}	{vs1+3}
b0010	{vs1+2}	{vs1+3}	{vs1+4}
b0011	{vs1+3}	{vs1+4}	{vs1+5}
b0100	{vs1+4}	{vs1+5}	{vs1+6}
b0101	{vs1+5}	{vs1+6}	{vs1+7}
b0110	{vs1+6}	{vs1+7}	{vs1+8}
b0111	{vs1+1}	{vs1+0}	{vs1+2}
b1000	{vs1+1}	{vs1+2}	{vs1+0}
b1001	{vs1+3}	{vs1+4}	{vs1+0}
b1010	{vs1+5}	{vs1+6}	{vs1+0}
b1011	{vs1+7}	{vs1+8}	{vs1+0}
b1100	{vs1+2}	{vs1+0}	{vs1+1}
b1101	{vs1+4}	{vs1+0}	{vs1+1}
b1110	{vs1+6}	{vs1+0}	{vs1+1}
b1111	{vs1+8}	{vs1+0}	{vs1+1}

Regbase supports upto 3x3 5x5 7x7 9x9, or use the extra horizontal range for input latency hiding.

The vdwconv instruction includes a non-architectural state accumulator to increase registerfile bandwidth. The dwinit instruction must be used to prepare the depthwise accumulator for a sequence of dwconv instructions, and the sequence must be dispatched without other instructions interleaved otherwise the results will be unpredictable. Should other operations be required then a dwinit must be inserted to resume the sequence.

In a context switch save where the accumulator must be saved alongside the architectural simd registers, v0..63 are saved to thread stack or tcb and then a vdwconv with vdup prepared zero inputs can be used to write the values to simd registers and then saved to memory. In a context switch restore the values can be loaded from memory and set in the accumulator registers using the dwinit instruction.

ADWINIT

Load the depthwise convolution accumulator state.

Encodings

adwinit.v vd, vs1

Operation

for L in Op.typelen
  {dwacc+0} = {vs1+0}[L]
  {dwacc+1} = {vs1+1}[L]
  {dwacc+2} = {vs1+2}[L]
  {dwacc+3} = {vs1+3}[L]

VDMULH

Saturating signed doubling multiply returning high half with optional rounding.

Encodings

vdmulh.[b,h,w].{r,rn}.vv.{m} vd, vs1, vs2
vdmulh.[b,h,w].{r,rn}.vx.{m} vd, vs1, xs2

Operation

SZ = vtype.size * 8
for L in Op.typelen
  LHS = SignExtend(vs1[L], 2*SZ)
  RHS = SignExtend(vs2[L], 2*SZ)
  MUL = LHS * RHS
  RND = R ? (N && MUL < 0 ? -(1<<(SZ-1)) : (1<<(SZ-1))) : 0
  vd[L] = SignedSaturation(2 * MUL + RND)[2*SZ-1:SZ]

Note: saturation is only needed for MaxNeg inputs (eg. 0x80000000).

Note: vdmulh.w.r.vx.m is used in ML activations so may be optimized by implementations.

VDUP

Duplicate a scalar value into a vector register.

Encodings

vdup.[b,h,w].x.{m} vd, xs2

Operation

for L in Op.typelen
  vd[L] = [xs2]

VEQ

Integer equal comparison.

Encodings

veq.[b,h,w].vv.{m} vd, vs1, vs2
veq.[b,h,w].vx.{m} vd, vs1, xs2

Operation

for L in Op.typelen
  vd[L] = vs1[L] == vs2[L] ? 1 : 0

VEVN, VODD, VEVNODD

Even/odd of concatenated registers.

Encodings

vevn.[b,h,w].vv.{m} vd, vs1, vs2
vevn.[b,h,w].vx.{m} vd, vs1, xs2
vodd.[b,h,w].vv.{m} vd, vs1, vs2
vodd.[b,h,w].vx.{m} vd, vs1, xs2
vevnodd.[b,h,w].vv.{m} vd, vs1, vs2
vevnodd.[b,h,w].vx.{m} vd, vs1, xs2

Operation

M = Op.typelen / 2

if vevn || vevnodd
  {dst0} = {vd+0}
  {dst1} = {vd+1}
if vodd
  {dst1} = {vd+0}

if vevn || vevnodd
  for L in Op.typelen
    dst0[L] = L < M ? vs1[2 * L + 0] : vs2[2 * (L - M) + 0]  # even

if odd || vevnodd
  for L in Op.typelen
    dst1[L] = L < M ? vs1[2 * L + 1] : vs2[2 * (L - M) + 1]  # odd

where:
  vs1    = 0x33221100
  vs2    = 0x77665544
  {vd+0} = 0x66442200
  {vd+1} = 0x77553311

VGE

Integer greater-than-or-equal comparison.

Encodings

vge.[b,h,w].{u}.vv.{m} vd, vs1, vs2
vge.[b,h,w].{u}.vx.{m} vd, vs1, xs2

Operation

for L in Op.typelen
  vd[L] = vs1[L] >= vs2[L] ? 1 : 0

VGT

Integer greater-than comparison.

Encodings

vgt.[b,h,w].{u}.vv.{m} vd, vs1, vs2
vgt.[b,h,w].{u}.vx.{m} vd, vs1, xs2

Operation

for L in Op.typelen
  vd[L] = vs1[L] > vs2[L] ? 1 : 0

VHADD

Halving addition with optional rounding bit.

Encodings

vhadd.[b,h,w].{r,u,ur}.vv.{m} vd, vs1, vs2
vhadd.[b,h,w].{r,u,ur}.vx.{m} vd, vs1, xs2

Operation

for L in Op.typelen
  if IsSigned()
    vd[L] = (signed(vs1[L]) + signed(vs2[L]) + R) >> 1
  else
    vd[L] = (unsigned(vs1[L]) + unsigned(vs2[L]) + R) >> 1

VHSUB

Halving subtraction with optional rounding bit.

Encodings

vhsub.[b,h,w].{r,u,ur}.vv.{m} vd, vs1, vs2
vhsub.[b,h,w].{r,u,ur}.vx.{m} vd, vs1, xs2

Operation

for L in Op.typelen
  if IsSigned()
    vd[L] = (signed(vs1[L]) - signed(vs2[L]) + R) >> 1
  else
    vd[L] = (unsigned(vs1[L]) - unsigned(vs2[L]) + R) >> 1

VLD

Vector load from memory with optional post-increment by scalar.

Encodings

vld.[b,h,w].{p}.x.{m} vd, xs1
vld.[b,h,w].[l,p,s,lp,sp,tp].xx.{m} vd, xs1, xs2

Operation

addr = xs1
sm   = Op.m ? 4 : 1
len  = min(Op.typelen * sm, unsigned(xs2))
for M in Op.m
  for L in Op.typelen
    if !Op.bit.l || (L + M * Op.typelen) < len
      vd[L] = mem[addr + L].type
    else
      vd[L] = 0
  if (Op.bit.s)
    addr += xs2 * sizeof(type)
  else
    addr += Reg.bytes
if Op.bit.p
  if Op.bit.l && Op.bit.s                                  # .tp
    xs1 += Reg.bytes
  elif !Op.bit.l && !Op.bit.s && !{xs2}                    # .p.x
    xs1 += Reg.bytes * sm
  elif Op.bit.l                                            # .lp
    xs1 += len * sizeof(type)
  elif Op.bit.s                                            # .sp
    xs1 += xs2 * sizeof(type) * sm
  else                                                     # .p.xx
    xs1 += xs2 * sizeof(type)

VLE

Integer less-than-or-equal comparison.

Encodings

vle.[b,h,w].{u}.vv.{m} vd, vs1, vs2
vle.[b,h,w].{u}.vx.{m} vd, vs1, xs2

Operation

for L in Op.typelen
  vd[L] = vs1[L] <= vs2[L] ? 1 : 0

VLT

Integer less-than comparison.

Encodings

vlt.[b,h,w].{u}.vv.{m} vd, vs1, vs2
vlt.[b,h,w].{u}.vx.{m} vd, vs1, xs2

Operation

for L in Op.typelen
  vd[L] = vs1[L] < vs2[L] ? 1 : 0

VMACC

Multiply accumulate.

Encodings

vmacc.[b,h,w].vv.{m} vd, vs1, vs2
vmacc.[b,h,w].vx.{m} vd, vs1, xs2

Operation

for L in Op.typelen
  vd[N] += vs1[L] * vs2[L]

VMADD

Multiply add.

Encodings

vmadd.[b,h,w].vv.{m} vd, vs1, vs2
vmadd.[b,h,w].vx.{m} vd, vs1, xs2

Operation

for L in Op.typelen
  vd[N] = vd[L] * vs2[L] + vs1[L]

VMAX

Find the unsigned or signed maximum of two registers.

Encodings

vmax.[b,h,w].{u}.vv.{m} vd, vs1, vs2
vmax.[b,h,w].{u}.vx.{m} vd, vs1, xs2

Operation

for L in Op.typelen
  vd[L] = vs1[L] > vs2[L] ? vs1[L] : vs2[L]

VMIN

Find the minimum of two registers.

Encodings

vmin.[b,h,w].{u}.vv.{m} vd, vs1, vs2
vmin.[b,h,w].{u}.vx.{m} vd, vs1, xs2

Operation

for L in Op.typelen
  vd[L] = vs1[L] < vs2[L] ? vs1[L] : vs2[L]

VMUL

Multiply two registers.

Encodings

vmul.[b,h,w].vv.{m} vd, vs1, vs2
vmul.[b,h,w].vx.{m} vd, vs1, xs2

Operation

for L in Op.typelen
  vd[L] = vs1[L] * vs2[L]

VMULS

Multiply with saturation two registers.

Encodings

vmuls.[b,h,w].{u}.vv.{m} vd, vs1, vs2
vmuls.[b,h,w].{u}.vx.{m} vd, vs1, xs2

Operation

for L in Op.typelen
  vd[L] = Saturation(vs1[L] * vs2[L])

VMULW

Multiply with widening two registers.

Encodings

vmulw.[h,w].{u}.vv.{m} vd, vs1, vs2
vmulw.[h,w].{u}.vx.{m} vd, vs1, xs2

Operation

for L in Op.typelen
  {vd+0}[L] = vs1.asHalfType[2*L+0] * vs2.asHalfType[2*L+0]
  {vd+1}[L] = vs1.asHalfType[2*L+1] * vs2.asHalfType[2*L+1]

VMULH

Multiply with widening two registers returning the high half.

Encodings

vmulh.[b,h,w].{u}.{r}.vv.{m} vd, vs1, vs2
vmulh.[b,h,w].{u}.{r}.vx.{m} vd, vs1, xs2

Operation

SZ = vtype.size * 8
RND = IsRounded ? 1<<(SZ-1) : 0
for L in Op.typelen
  if IsU()
    vd[L] = (unsigned(vs1[L]) * unsigned(vs2[L] + RND))[2*SZ-1:SZ]
  else if IsSU()
    vd[L] = (  signed(vs1[L]) * unsigned(vs2[L] + RND))[2*SZ-1:SZ]
  else
    vd[L] = (  signed(vs1[L]) *   signed(vs2[L] + RND))[2*SZ-1:SZ]

VMV

Move a register.

Encodings

vmv.v.{m} vd, vs1

Operation

for L in Op.typelen
  vd[L] = vs1[L]

Note: in the stripmined case an implemention may deliver more than one write per cycle.

VMVP

Move a pair of registers.

Encodings

vmvp.vv.{m} vd, vs1, vs2
vmvp.[b,h,w].vx.{m} vd, vs1, xs2

Operation

for L in Op.typelen
  {vd+0}[L] = vs1[L]
  {vd+1}[L] = vs2[L]

VNE

Integer not-equal comparison.

Encodings

vne.[b,h,w].vv.{m} vd, vs1, vs2
vne.[b,h,w].vx.{m} vd, vs1, xs2

Operation

for L in Op.typelen
  vd[L] = vs1[L] != vs2[L] ? 1 : 0

VNOT

Bitwise NOT a register.

Encodings

vnot.v.{m} vd, vs1

Operation

for L in Op.typelen
  vd[L] = ~vs1[L]

VOR

OR two operands.

Encodings

vor.vv.{m} vd, vs1, vs2
vor.[b,h,w].vx.{m} vd, vs1, xs2

Operation

for L in Op.typelen
  vd[L] = vs1[L] | vs2[L]

VPADD

Adds the lane pairs.

Encodings

vpadd.[h,w].{u}.v.{m} vd, vs1

Operation

if .v
  for L in Op.typelen
    vd[L] = (vs1.asHalfType[2 * L] + vs1.asHalfType[2 * L + 1])

VPSUB

Subtracts the lane pairs.

Encodings

vpsub.[h,w].{u}.v.{m} vd, vs1

Operation

if .v
  for L in Op.typelen
    vd[L] = (vs1.asHalfType[2 * L] - vs1.asHalfType[2 * L + 1])

VCPOP

Count the set bits.

Encodings

vcpop.[b,h,w].v.{m} vd, vs1

Operation

for L in Op.typelen
  vd[L] = CountPopulation(vs1[L])

VREV

Generalized reverse using bit ladder.

The size of the flip is based on the log_2(data type)

Encodings

vrev.[b,h,w].vv.{m} vd, vs1, vs2
vrev.[b,h,w].vx.{m} vd, vs1, xs2

Operation

N = vtype.bits - 1  # 7, 15, 31
shamt = xs2[4:0] & N
for L in Op.typelen
  r = vs1[L]
  if (shamt & 1)  r = ((r & 0x55..) << 1)  | ((r & 0xAA..) >> 1)
  if (shamt & 2)  r = ((r & 0x33..) << 2)  | ((r & 0xCC..) >> 2)
  if (shamt & 4)  r = ((r & 0x0F..) << 4)  | ((r & 0xF0..) >> 4)
  if (sz == 0) vd[L] = r; continue;
  if (shamt & 8)  r = ((r & 0x00..) << 8)  | ((r & 0xFF..) >> 8)
  if (sz == 1) vd[L] = r; continue;
  if (shamt & 16) r = ((r & 0x00..) << 16) | ((r & 0xFF..) >> 16)
  vd[L] = r

VROR

Logical rotate right.

Encodings

vror.[b,h,w].vv.{m} vd, vs1, vs2
vror.[b,h,w].vx.{m} vd, vs1, xs2

Operation

N = vtype.bits - 1  # 7, 15, 31
shamt = xs2[4:0] & N
for L in Op.typelen
  r = vs1[L]
  if (shamt & 1)  for (B in vtype.bits) r[B] = r[(N+1) % N]
  if (shamt & 2)  for (B in vtype.bits) r[B] = r[(N+2) % N]
  if (shamt & 4)  for (B in vtype.bits) r[B] = r[(N+4) % N]
  if (shamt & 8)  for (B in vtype.bits) r[B] = r[(N+8) % N]
  if (shamt & 16) for (B in vtype.bits) r[B] = r[(N+16) % N]
  vd[L] = r

VSHA, VSHL

Arithmetic and logical left/right shift with saturating shift amount and result.

Encodings

vsha.[b,h,w].{r}.vv.{m} vd, vs1, vs2

vshl.[b,h,w].{r}.vv.{m} vd, vs1, vs2

Operation

M = Op.size  # 8, 16, 32
N = [8->3, 16->4, 32->5][Op.size]
SHSAT[L] = vs2[L][M-1:N] != 0
SHAMT[L] = vs2[L][N-1:0]
RND  = R && SHAMT ? 1 << (SHAMT-1) : 0
RND -= N && (vs1[L] < 0) ? 1 : 0
SZ = sizeof(src.type) * 8 * (W ? 2 : 1)
RESULT_NEG = (vs1[L] <<[<] SHAMT[L])[SZ-1:0]  // !A "<<<" logical shift
RESULT_NEG = S ? Saturate(RESULT_POS, SHSAT[L]) : RESULT_NEG
RESULT_POS = ((vs1[L] + RND) >>[>] SHAMT[L])  // !A ">>>" logical shift
RESULT_POS = S ? Saturate(RESULT_NEG, SHSAT[L]) : RESULT_POS
xd[L] = SHAMT[L] >= 0 ? RESULT_POS : RESULT_NEG

VSEL

Select lanes from two operands with vector selection boolean.

Encodings

vsel.[b,h,w].vv.{m} vd, vs1, vs2
vsel.[b,h,w].vx.{m} vd, vs1, xs2

Operation

for L in Op.typelen
  vd[L] = vs1[L].bit(0) ? vd[L] : vs2[L]

VSLL

Logical left shift.

Encodings

vsll.[b,h,w].vv.{m} vd, vs1, vs2
vsll.[b,h,w].vx.{m} vd, vs1, xs2

Operation

N = [8->3, 16->4, 32->5][Op.size]
xd[L] = vs1[L] <<< vs2[L][N-1:0]

VSLIDEN

Slide next register by index.

For the horizontal mode, it treats the stripmine vm register based on vs1 as a contiguous block, and only the first index elements from vs2 will be used. For the vertical mode, each stripmine vector register op_index is mapped separatedly. it mimics the imaging tiling process shift of

  |--------|--------|
  | 4xVLEN | 4xVLEN |
  |  (vs1) |  (vs2) |
  |--------|--------|

The vertical mode can also support the non-stripmine version to handle the last columns of the image.

Encodings

Horizontal slide:

vslidehn.[b,h,w].[1,2,3,4].vv.m vd, vs1, vs2
vslidehn.[b,h,w].[1,2,3,4].vx.m vd, vs1, xs2

Vertical slide:

vsliden.[b,h,w].[1,2,3,4].vv vd, vs1, vs2
vslidevn.[b,h,w].[1,2,3,4].vv.m vd, vs1, vs2
vslidevn.[b,h,w].[1,2,3,4].vx.m vd, vs1, xs2

Operation

assert vd != vs1 && vd != vs2
if Op.h  // A contiguous horizontal slide based on vs1
  va = {{vs1},{vs1+1},{vs1+2},{vs1+3}}
  vb = {{vs1+1},{vs1+2},{vs1+3},{vs2}}
if Op.v  // vs1/vs2 vertical slide
  va = {{vs1},{vs1+1},{vs1+2},{vs1+3}}
  vb = {{vs2},{vs2+1},{vs2+2},{vs2+3}}

sm = Op.m ? 4 : 1

for M in sm
  for L in Op.typelen
    if (L + index < Op.typelen)
      vd[L] = va[M][L + index]
    else
      vd[L] = is_vx ? xs2 : vb[M][L + index - Op.typelen]

VSLIDEP

Slide previous register by index.

For the horizontal mode, it treats the stripmine vm register based on vs2 as a contiguous block, and only the LAST index elements from stripmine vm register based on vs1 will be used AT THE BEGINNING. For the vertical mode, each stripmine vector register op_index is mapped separatedly. it mimics the imaging tiling process shift of

  |--------|--------|
  | 4xVLEN | 4xVLEN |
  |  (vs1) |  (vs2) |
  |--------|--------|

The vertical mode can also support the non-stripmine version to handle the last columns of the image.

Encodings

Horizontal slide:

vslidehp.[b,h,w].[1,2,3,4].vv.m vd, vs1, vs2
vslidehp.[b,h,w].[1,2,3,4].vx.m vd, vs1, xs2

Vertical slide:

vslidep.[b,h,w].[1,2,3,4].vv vd, vs1, vs2
vslidevp.[b,h,w].[1,2,3,4].vv.m vd, vs1, vs2
vslidevp.[b,h,w].[1,2,3,4].vx.m vd, vs1, xs2

Operation

assert vd != vs1 && vd != vs2

if Op.h  // A continuous horizontal slide based on vs2
  va = {{vs1+3},{vs2},{vs2+1},{vs2+2}}
  vb = {{vs2},{vs2+1},{vs2+2},{vs2+3}}
if Op.v  // vs1/vs2 vertical slide
  va = {{vs1},{vs1+1},{vs1+2},{vs1+3}}
  vb = {{vs2},{vs2+1},{vs2+2},{vs2+3}}

sm = Op.m ? 4 : 1

for M in sm
  for L in Op.typelen
    if (L < index)
      vd[L] = va[M][Op.typelen + L - index]
    else
      vd[L] = is_vx ? xs2 : vb[M][L - index]

VSRA, VSRL

Arithmetic and logical right shift.

Encodings

vsra.[b,h,w].vv.{m} vd, vs1, vs2
vsra.[b,h,w].vx.{m} vd, vs1, xs2

vsrl.[b,h,w].vv.{m} vd, vs1, vs2
vsrl.[b,h,w].vx.{m} vd, vs1, xs2

Operation

N = Op.size[8->3, 16->4, 32->5]
xd[L] = vs1[L] >>[>] vs2[L][N-1:0]

VSRANS, VSRANSU

Arithmetic right shift with rounding and signed/unsigned saturation.

Encodings

vsrans{u}.[b,h].{r}.vv.{m} vd, vs1, vs2
vsrans{u}.[b,h].{r}.vx.{m} vd, vs1, xs2

Operation

for L in Op.typelen
  N = [8->3, 16->4, 32->5][Op.size]
  SHAMT[L] = vs2[L][2*N-1:0]  # source size index
  RND  = R && SHAMT ? 1 << (SHAMT-1) : 0
  RND -= N && (vs1[L] < 0) ? 1 : 0
  vd[L+0] = Saturate({vs1+0}[L/2] + RND, u) >>[>] SHAMT
  vd[L+1] = Saturate({vs1+1}[L/2] + RND, u) >>[>] SHAMT

Note: vsrans.[b,h].vx.m are used in ML activations so may be optimized by implementations.

VSRAQS

Arithmetic quarter narrowing right shift with rounding and signed/unsigned saturation.

Encodings

vsraqs{u}.b.{r}.vv.{m} vd, vs1, vs2
vsraqs{u}.b.{r}.vx.{m} vd, vs1, xs2

Operation

for L in i32.typelen
  SHAMT[L] = vs2[L][4:0]
  RND  = R && SHAMT ? 1 << (SHAMT-1) : 0
  RND -= N && (vs1[L] < 0) ? 1 : 0
  vd[L+0] = Saturate({vs1+0}[L/4] + RND, u) >>[>] SHAMT
  vd[L+1] = Saturate({vs1+2}[L/4] + RND, u) >>[>] SHAMT
  vd[L+2] = Saturate({vs1+1}[L/4] + RND, u) >>[>] SHAMT
  vd[L+3] = Saturate({vs1+3}[L/4] + RND, u) >>[>] SHAMT

Note: The register interleaving is [0,2,1,3] and not [0,1,2,3] as this matches vconv/vdwconv requirements, and one vsraqs is the same as two chained vsrans.

VRSUB

Reverse subtract two operands.

Encodings

vrsub.[b,h,w].vx.{m} vd, vs1, xs2

Operation

for L in Op.typelen
  vd[L] = xs2[L] - vs1[L]

VSUB

Subtract two operands.

Encodings

vsub.[b,h,w].vv.{m} vd, vs1, vs2
vsub.[b,h,w].vx.{m} vd, vs1, xs2

Operation

for L in Op.typelen
  vd[L] = vs1[L] - vs2[L]

VSUBS

Subtract two operands with saturation.

Encodings

vsubs.[b,h,w].{u}.vv.{m} vd, vs1, vs2
vsubs.[b,h,w].{u}.vx.{m} vd, vs1, xs2

Operation

for L in Op.typelen
  vd[L] = Saturate(vs1[L] - vs2[L])

VSUBW

Subtract two operands with widening.

Encodings

vsubw.[h,w].{u}.vv.{m} vd, vs1, vs2
vsubw.[h,w].{u}.vx.{m} vd, vs1, xs2

Operation

for L in Op.typelen
  {vd+0}[L] = vs1.asHalfType[2*L+0] - vs2.asHalfType[2*L+0]
  {vd+1}[L] = vs1.asHalfType[2*L+1] - vs2.asHalfType[2*L+1]

VST

Vector store to memory with optional post-increment by scalar.

Encodings

vst.[b,h,w].{p}.x.{m} vd, xs1
vst.[b,h,w].[l,p,s,lp,sp,tp].xx.{m} vd, xs1, xs2

Operation

addr = xs1
sm   = Op.m ? 4 : 1
len  = min(Op.typelen * sm, unsigned(xs2))
for M in Op.m
  for L in Op.typelen
    if !Op.bit.l || (L + M * Op.typelen) < len
      mem[addr + L].type = vd[L]
  if (Op.bit.s)
    addr += xs2 * sizeof(type)
  else
    addr += Reg.bytes
if Op.bit.p
  if Op.bit.l && Op.bit.s                                  # .tp
    xs1 += Reg.bytes
  elif !Op.bit.l && !Op.bit.s && !{xs2}                    # .p.x
    xs1 += Reg.bytes * sm
  elif Op.bit.l                                            # .lp
    xs1 += len * sizeof(type)
  elif Op.bit.s                                            # .sp
    xs1 += xs2 * sizeof(type) * sm
  else                                                     # .p.xx
    xs1 += xs2 * sizeof(type)

VSTQ

Vector store quads to memory with optional post-increment by scalar.

Encodings

vstq.[b,h,w].[s,sp].xx.{m} vd, xs1, xs2

Operation

addr = xs1
sm   = Op.m ? 4 : 1
for M in Op.m
  for Q in 0 to 3
    for L in Op.typelen / 4
      mem[addr + L].type = vd[L + Q * Op.typelen / 4]
      addr += xs2 * sizeof(type)
if Op.bit.p
  xs1 += xs2 * sizeof(type) * sm

Note: This is principally for storing the results of vconv after 32b to 8b reduction.

VXOR

XOR two operands.

Encodings

vxor.vv.{m} vd, vs1, vs2
vxor.[b,h,w].vx.{m} vd, vs1, xs2

Operation

for L in Op.typelen
  vd[L] = vs1[L] ^ vs2[L]

VZIP

Interleave even/odd lanes of two operands.

Encodings

vzip.[b,h,w].vv.{m} vd, vs1, vs2
vzip.[b,h,w].vx.{m} vd, vs1, xs2

Operation

index = Is(a=>0, b=>1)
for L in Op.typelen
  M = L / 2
  N = L / 2 + Op.typelen / 2
  {vd+0}[L] = L & 1 ? vs2[M] : vs1[M]
  {vd+1}[L] = L & 1 ? vs2[N] : vs1[N]

where:
  vs1    = 0x66442200
  vs2    = 0x77553311
  {vd+0} = 0x33221100
  {vd+1} = 0x77665544

Note: vd must not be in the range of vs1 or vs2.

FLOG, SLOG, CLOG, KLOG

Log a register in a printf contract.

Encodings

flog rs1 // mode=0, “printf” formatted command, rs1=(context)
slog rs1 // mode=1, scalar log
clog rs1 // mode=2, character log
klog rs1 // mode=3, const string log

Operation

A number of arguments are sent with SLOG or CLOG, and then a FLOG operation closes the packet and may emit a timestamp and context data like ASID. A receiving tool can construct messages, e.g. XML records per printf stream, by collecting the arguments as they arrive in a variable length buffer, and closing the record when the FLOG instruction arrives.

A transport layer may choose to encode in the flog format footer the preceding count of arguments or bytes sent. This is so that detection of payload errors or hot connections are possible.

The SLOG instruction will send a payload packet represented by the starting memory location.

The CLOG instruction will send a multiple 32-bit packet message of a character stream. The packet message will close when a zero character is detected. A single character may be sent in a 32bit packet.

Pseudo code

const uint8_t p[] = "text message";
printf(“Test %s\n”, p);
    KLOG p
    FLOG &fmt

printf(“Test”);
    FLOG &fmt

print(“Test %d\n”, result_int);
    SLOG result_int
    FLOG &fmt

printf(“Test %d %f %s %s %s\n”, 123, "abc", "1234", “789AB”);
    SLOG 123
    CLOG ‘abc\0’
    CLOG ‘1234’ CLOG ‘\0’
    CLOG ‘789A’ CLOG ‘B\0’
    FLOG &fmt