blob: 5aaac0cdf06215308c9c9f2f342fea706930e6bf [file] [log] [blame] [view]
# Kelvin Instruction Reference
An ML+SIMD+Scalar instruction set for ML accelerator cores.
## SIMD register configuration
Kelvin has 64 vector registers, `v0` to `v63`, with the vector length of 256-bit
for each of the registers. The register can store data in the format of 8b, 16b,
and 32b, as encoded in the instructions (See the next section for detail).
Kelvin also supports the stripmine behaviors, which utilizes 16 vector registers
with each one 4x the size of the typical register (Also see the details in the
next section).
## SIMD Instructions
The SIMD instructions utilize a register file with 64 entries which serves both
standard arithmetic and logical operations and the domain compute. SIMD lane
size, scalar broadcast, arithmetic operation sign, and stripmine behaviors are
encoded explictly in the opcodes.
The SIMD instructions replace the encoding space of the compressed instruction
set extension (those with 2-bit prefixes 00, 01, and 10). See [The RISC-V
Instruction Set Manual v2.2 "Available 30-bit instruction encoding
spaces"]( for
quadrupling the available encoding space within the 32-bit format.
### Instruction Encodings
31..26 | 25..20 | 19..14 | 13..12 | 11..6 | 5 | 4..2 | 1..0 | form
:----: | :----: | :----: | :----: | :---: | :-: | :---: | :--: | :--:
func2 | vs2 | vs1 | sz | vd | m | func1 | 00 | .vv
func2 | [0]xs2 | vs1 | sz | vd | m | func1 | 10 | .vx
func2 | 000000 | vs1 | sz | vd | m | func1 | 10 | .v
func2 | [0]xs2 | xs1[0] | sz | vd | m | 111 | 11 | .xx
func2 | 000000 | xs1[0] | sz | vd | m | 111 | 11 | .x
31..26 | 25..20 | 19..14 | 13..12 | 11..6 | 5 | 4..3 | 2..0 | form
:----: | :----: | :----: | :--------: | :---: | :-: | :--------: | :--: | :--:
vs3 | vs2 | vs1 | func3[3:2] | vd | m | func3[1:0] | 001 | .vvv
vs3 | [0]xs2 | vs1 | func3[3:2] | vd | m | func3[1:0] | 101 | .vxv
### Types ".b" ".h" ".w"
The SIMD lane size is encoded in the opcode definition indicating the
destination type. For many opcodes source and destination sizes are the same,
differing for widening and narrowing operations.
op[13:12] | sz | type
:-------: | :--: | :--:
00 | ".b" | 8b
01 | ".h" | 16b
10 | ".w" | 32b
### Scalar ".vx"
Instructions may use a scalar register to perform a value broadcast (8b, 16b,
32b) to all SIMD lanes of one operand.
op[2:0] | form
:-----: | :------------:
x00 | ".vv"
x10 | ".vx"
x10 | ".v" (xs2==x0)
x11 | ".xx"
x11 | ".x" (xs2==x0)
001 | ".vvv"
101 | ".vxv"
### Signed/Unsigned ".u"
Instructions which may be marked with ".u" have signed and unsigned variants.
See comparisons, arithmetic operations and saturation for usage, the side
effects being typical behaviors unless otherwise noted.
### Stripmine ".m"
The stripmine functionality is an instruction compression mechanism. Frontend
dispatch captures a single instruction, while the backend issue expands to four
operations. Conceptually the register file is reduced from 64 locations to 16,
where a stripmine register must use a mod4 base aligned register (eg. v0, v4,
v8, ...). Normal instruction and stripmine variants may be mixed together.
Currently, neither the assembler nor kelvin_sim checks for invalid stripmine
registers. Code using invalid registers (like v1) will compile and sim, but
will cause FPGA to hang.
When stripmining is used in conjunction with instructions which use a register
index as a base to several registers, the offset of +4 (instead of +1) shall be
used. e.g., {vm0,vm1} becomes {{v0,v1,v2,v3},{v4,v5,v6,v7}}.
A machine may elect to distribute a stripmined instruction across multiple ALUs.
op[5] | m
:---: | :--:
0 | ""
1 | ".m"
### 2-arg .xx (Load / Store)
Instruction | func2 | Notes
:---------: | :-------: | :--------:
vld | 00 xx0PSL | 1-arg
vld.l | 01 xx0PSL |
vld.s | 02 xx0PSL |
vld.p | 04 xx0PSL | 1 or 2-arg
vld.lp | 05 xx0PSL |
vld.sp | 06 xx0PSL | | 07 xx0PSL |
vst | 08 xx1PSL | 1-arg
vst.l | 09 xx1PSL |
vst.s | 10 xx1PSL |
vst.p | 12 xx1PSL | 1 or 2-arg
vst.lp | 13 xx1PSL |
vst.sp | 14 xx1PSL | | 15 xx1PSL |
vdup.x | 16 x10000 |
vcget | 20 x10100 | 0-arg
vstq.s | 26 x11PSL |
vstq.sp | 30 x11PSL |
To saving encoding space, use the compile time knowledge that if vld.p.xx or
vst.p.xx post-incremented by a zero amount, do not encode x0, instead disable
the post-increment operation so as to reuse the encoding where xs2==x0 for
vld.p.x or vst.p.x which have different base update behavior. If the
post-increment were programmatic behavior then a register where xs2!=x0 would be
**NOTE**: Scalar register `xs1` uses the same encoding bitfield as the vector
register `vs1`, but **HAS ONE BIT PADDED AT LSB**. That is `xs1` has the same
encoding as the regular RISC-V instructions (bit[19:15]). On the other head,
`xs2` shares the same encoding bitfield `vs2`, but **HAS ONE BIT PADDED AT MSB**,
so it is consistent with the regular RISC-V instructions (bit[24:20]).
### 1-arg .x (Load / Store)
Instructions of the format "op.xx vd, xs1, x0" (xs2=x0, the scalar zero
register) are reduced to the shortened form "op.x vd, xs1".
**NOTE**: Scalar register `xs1` uses the same encoding bitfield as the vector
register `vs1`, but **HAS ONE BIT PADDED AT LSB**. That is `xs1` has the same
encoding as the regular RISC-V instructions (bit[19:15]).
### 0-arg
Instructions of the format "op.xx vd, x0, x0" (xs1=x0, xs2=x0, the scalar zero
register) are reduced to the shortened form "op vd".
### 1-arg .v
Single argument vector operations ".v" use xs2 scalar encoding "x0|zero".
### 2-arg .vv|.vx
**Instruction** | func2 | **func1** / Notes
:-------------: | :-------: | :-----------------------:
**Arithmetic** | ... | **000**
vadd | 00 xxxxxx |
vsub | 01 xxxxxx |
vrsub | 02 xxxxxx |
veq | 06 xxxxxx |
vne | 07 xxxxxx |
vlt.{u} | 08 xxxxxU |
vle.{u} | 10 xxxxxU |
vgt.{u} | 12 xxxxxU |
vge.{u} | 14 xxxxxU |
vabsd.{u} | 16 xxxxxU |
vmax.{u} | 18 xxxxxU |
vmin.{u} | 20 xxxxxU |
vadd3 | 24 xxxxxx |
**Arithmetic2** | ... | **100**
vadds.{u} | 00 xxxxxU |
vsubs.{u} | 02 xxxxxU |
vaddw.{u} | 04 xxxxxU |
vsubw.{u} | 06 xxxxxU |
vacc.{u} | 10 xxxxxU |
vpadd.{u} | 12 xxxxxU | .v
vpsub.{u} | 14 xxxxxU | .v
vhadd.{ur} | 16 xxxxRU |
vhsub.{ur} | 20 xxxxRU |
**Logical** | ... | **001**
vand | 00 xxxxxx |
vor | 01 xxxxxx |
vxor | 02 xxxxxx |
vnot | 03 xxxxxx | .v
vrev | 04 xxxxxx |
vror | 05 xxxxxx |
vclb | 08 xxxxxx | .v
vclz | 09 xxxxxx | .v
vcpop | 10 xxxxxx | .v
vmv | 12 xxxxxx | .v
vmvp | 13 xxxxxx |
acset | 16 xxxxxx |
actr | 17 xxxxxx | .v
adwinit | 18 xxxxxx |
**Shift** | ... | **010**
vsll | 01 xxxxxx |
vsra | 02 xxxxx0 |
vsrl | 03 xxxxx1 |
vsha.{r} | 08 xxxxR0 | +/- shamt
vshl.{r} | 09 xxxxR1 | +/- shamt
vsrans{u}.{r} | 16 xxxxRU | narrowing saturating (x2)
vsraqs{u}.{r} | 24 xxxxRU | narrowing saturating (x4)
**Mul/Div** | **...** | **011**
vmul | 00 xxxxxx |
vmuls | 02 xxxxxU |
vmulw | 04 xxxxxU |
vmulh.{ur} | 08 xxxxRU |
vdmulh.{rn} | 16 xxxxRN |
vmacc | 20 xxxxxx |
vmadd | 21 xxxxxx |
**Float** | ... | **101**
--reserved-- | xx xxxxxx |
**Shuffle** | ... | **110**
vslidevn | 00 xxxxNN |
vslidehn | 04 xxxxNN |
vslidevp | 08 xxxxNN |
vslidehp | 12 xxxxNN |
vsel | 16 xxxxxx |
vevn | 24 xxxxxx |
vodd | 25 xxxxxx |
vevnodd | 26 xxxxxx |
vzip | 28 xxxxxx |
**Reserved7** | ... | **111**
--reserved-- | xx xxxxxx |
### 3-arg .vvv|.vxv
Instruction | func3 | Notes
:---------: | :---: | :-----------------------:
aconv | 8 | scalar: sign
vdwconv | 10 | scalar: sign/type/swizzle
### Typeless
Operations that do not have a {.b,.h,.w} type have the same behavior regardless
of the size field (bitwise: vand, vnot, vor, vxor; move: vmv, vmvp). The tooling
convention is to use size=0b00 ".b" encoding.
### Vertical Modes
The ".tp" mode of vld or vst uses the four registers of ".m" in a vertical
structure, compared to other modes horizontal usage. The ".m" base update is a
single register width, vs 4x width for other modes. The usage model is four
"lines" being processed at the same time, vs a single line chained together in
other ".m" modes.
Vertical (".tp")
... AAAA ...
... BBBB ...
... CCCC ...
... DDDD ...
### Aliases
vneg.v vrsub.xv vd, vs1, zero \
vabs.v vabsd.vx vd, vs1, zero \
vwiden.v vaddw.vx vd, vs1, zero
## System Instructions
The execution model is designed towards OS-less and interrupt-less operation. A
machine will typically operate as run-to-completion of small restartable
workloads. A user/machine mode split is provided as a runtime convenience,
though there is no difference in access permissions between the modes.
31..28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19..15 | 14..12 | 11..7 | 6..2 | 1 | 0 | OP
:----: | :-: | :-: | :-: | :-: | :-: | :-: | :-: | :-: | :----: | :----: | :---: | :---: | :-: | :-: | :-:
0000 | PI | PO | PR | PW | SI | SO | SR | SW | 00000 | 000 | 00000 | 00011 | 1 | 1 | FENCE
31..28 | 27..24 | 23..20 | 19..15 | 14..12 | 11..7 | 6..2 | 1 | 0 | OP
:----: | :----: | :----: | :----: | :----: | :---: | :---: | :-: | :-: | :-----:
0000 | 0000 | 0000 | 00000 | 001 | 00000 | 00011 | 1 | 1 | FENCE.I
31..27 | 26..25 | 24..20 | 19..15 | 14..12 | 11..7 | 6..2 | 1 | 0 | OP
:----: | :----: | :----: | :----: | :----: | :---: | :---: | :-: | :-: | :-:
00100 | 11 | 00000 | xs1 | 000 | 00000 | 11101 | 1 | 1 | FLUSH
0001M | sz | xs2 | xs1 | 000 | xd | 11101 | 1 | 1 | GET{MAX}VL
01111 | 00 | 00000 | xs1 | mode | 00000 | 11101 | 1 | 1 | \[F,S,K,C\]LOG
31..20 | 19..15 | 14..12 | 11..7 | 6..2 | 1 | 0 | OP
:----------: | :----: | :----: | :---: | :---: | :-: | :-: | :----:
000000000001 | 00000 | 000 | 00000 | 11100 | 1 | 1 | EBREAK
001100000010 | 00000 | 000 | 00000 | 11100 | 1 | 1 | MRET
000010000000 | 00000 | 000 | 00000 | 11100 | 1 | 1 | MPAUSE
000001100000 | 00000 | 000 | 00000 | 11100 | 1 | 1 | ECTXSW
000001000000 | 00000 | 000 | 00000 | 11100 | 1 | 1 | EYIELD
000000100000 | 00000 | 000 | 00000 | 11100 | 1 | 1 | EEXIT
000000000000 | 00000 | 000 | 00000 | 11100 | 1 | 1 | ECALL
### Exit Cause
* `enum_IDLE = 0`
* `enum_EBREAK = 1`
* `enum_ECALL = 2`
* `enum_EEXIT = 3`
* `enum_EYIELD = 4`
* `enum_ECTXSW = 5`
* `enum_UNDEF_INST = (1u<<31) | 2`
* `enum_USAGE_FAULT = (1u<<31) | 16`
## Instruction Definitions
Cache clean and invalidate operations at the private level
flushat xs1 \
Start = End = xs1
Line = xs1
The instruction is a standard way of describing cache maintenance operations.
Type | Visibility | System1 | System2
------- | ---------- | ----------------- | ---------------------
Private | Core | Core L1 | Core L1 + Coherent L2
Enforce memory ordering of loads and stores for external visibility.
fence \[i|o|r|w\], \[i|o|r|w\] \
PI predecessor I/O input
PO predecessor I/O output
PR predecessor memory read
PW predecessor memory write
<ordering between marked predecessors and successors>
SI successor I/O input
SO successor I/O output
SR successor memory read
SW successor memory write
Note: a simplified implementation may have the frontend stall until all
preceding operations are completed before permitting any trailing instruction to
be dispatched.
Ensure subsequent instruction fetches observe prior data operations.
Calculate the vector length.
getvl.[b,h,w].x xd, xs1 \
getvl.[b,h,w].xx xd, xs1, xs2 \
getvl.[b,h,w].x.m xd, xs1 \
getvl.[b,h,w].xx.m xd, xs1, xs2
xd = min(vl.type.size, unsigned(xs1), xs2 ? unsigned(xs2) : ignore)
Find the minimum of the maximum vector length by type and the two input values.
If xs2 is zero (either x0 or register contents) then it is ignored (or
considered MaxInt), acting as a clamp less than maxvl.
Type | Instruction | Description
---- | ----------- | ----------------
00 | getvl.b | 8bit lane count
01 | getvl.h | 16bit lane count
10 | getvl.w | 32bit lane count
Obtain the maximum vector length.
getmaxvl.[b,h,w].{m} xd
xd = vl.type.size
Type | Instruction | Description
---- | ----------- | ----------------
00 | getmaxvl.b | 8bit lane count
01 | getmaxvl.h | 16bit lane count
10 | getmaxvl.w | 32bit lane count
For a machine with 256bit SIMD registers:
* getmaxvl.w = 8 lanes
* getmaxvl.h = 16 lanes
* getmaxvl.b = 32 lanes
* getmaxvl.w.m = 32 lanes &ensp; // multiply by 4 with strip mine.
* getmaxvl.h.m = 64 lanes
* getmaxvl.b.m = 128 lanes
Execution call to supervisor OS.
if (mode == User)
mcause = enum_ECALL
mepc = pc
pc = mtvec
mode = Machine
mcause = enum_USAGE_FAULT
mfault = pc
Execution exit to supervisor OS.
if (mode == User)
mcause = enum_EEXIT
mepc = pc
pc = mtvec
mode = Machine
mcause = enum_USAGE_FAULT
mfault = pc
Synchronous execution switch to supervisor OS.
if (mode == User)
mcause = enum_EYIELD
mepc = pc + 4 # advance to next instruction
pc = mtvec
mode = Machine
NOP # pc = pc + 4
mcause = enum_USAGE_FAULT
mfault = pc
YIELD_REQUEST refers to a signal the supervisor core sets to request a context
Note: use when MIE=0 eyield is inserted at synchronization points for
cooperative context switching.
Asynchronous execution switch to supervisor OS.
if (mode == User)
mcause = enum_ECTXSW
mepc = pc
pc = mtvec
mode = Machine
mcause = enum_USAGE_FAULT
mfault = pc
Execution breakpoint to supervisor OS.
if (mode == User)
mcause = enum_EBREAK
mepc = pc
pc = mtvec
mode = Machine
mcause = enum_UNDEF_INST
mfault = pc
### MRET
Return from machine mode to user mode.
if (mode == Machine)
pc = mepc
mode = User
mcause = enum_UNDEF_INST
mepc = pc
pc = mtvec
mode = Machine
Machine pause and release for next execution context.
if (mode == Machine)
mcause = enum_UNDEF_INST
mepc = pc
pc = mtvec
mode = Machine
Absolute difference with unsigned result.
vabsd.[b,h,w].{u}.vv.{m} vd, vs1, vs2 \
vabsd.[b,h,w].{u}.vx.{m} vd, vs1, xs2
for L in Op.typelen
vd[L] = vs1[L] > vs2[L] ? vs1[L] - vs2[L] : vs2[L] - vs1[L]
Note: for signed(INTx_MAX - INTx_MIN) the result will be UINTx_MAX.
### VACC
Accumulates a value into a wider register.
vacc.[h,w].{u}.vv.{m} vd, vs1, vs2 \
vacc.[h,w].{u}.vx.{m} vd, vs1, xs2
for L in Op.typelen
{vd+0}[L] = {vs1+0} + vs2.asHalfType[2*L+0]
{vd+1}[L] = {vs1+1} + vs2.asHalfType[2*L+1]
### VADD
Add operands.
vadd.[b,h,w].vv.{m} vd, vs1, vs2 \
vadd.[b,h,w].vx.{m} vd, vs1, xs2
for L in Op.typelen
vd[L] = vs1[L] + vs2[L]
Add operands with saturation.
vadds.[b,h,w].{u}.vv.{m} vd, vs1, vs2 \
vadds.[b,h,w].{u}.vx.{m} vd, vs1, xs2
for L in Op.typelen
vd[L] = Saturate(vs1[L] + vs2[L])
Add operands with widening.
vaddw.[h,w].{u}.vv.{m} vd, vs1, vs2 \
vaddw.[h,w].{u}.vx.{m} vd, vs1, xs2
for L in Op.typelen
{vd+0}[L] = vs1.asHalfType[2*L+0] + vs2.asHalfType[2*L+0]
{vd+1}[L] = vs1.asHalfType[2*L+1] + vs2.asHalfType[2*L+1]
### VADD3
Add three operands.
vadd3.[w].vv.{m} vd, vs1, vs2 \
vadd3.[w].vx.{m} vd, vs1, xs2
for L in i32.typelen
vd[L] = vd[L] + vs1[L] + vs2[L]
### VAND
AND operands.
vand.vv.{m} vd, vs1, vs2 \
vand.[b,h,w].vx.{m} vd, vs1, xs2
for L in Op.typelen
vd[L] = vs1[L] & vs2[L]
Performs matmul vs1*vs3, accumulating into the accumulator.
aconv.vxv vd, vs1, xs2, vs3
Encoding 'aconv' uses a '1' in the unused 5th bit (b25) of vs2.
# 8b: 0123456789abcdef
# 32b: 048c 26ae 159d 37bf
assert(vd == 48)
N = is_simd512 ? 16 : is_simd256 ? 8 : assert(0)
func Interleave(Y,L):
m = L % 4
if (m == 0) (Y & ~3) + 0
if (m == 1) (Y & ~3) + 2
if (m == 2) (Y & ~3) + 1
if (m == 3) (Y & ~3) + 3
# i32 += i8 x i8 (u*u, u*s, s*u, s*s)
for Y in [0..N-1]
for X in [Start..Stop]
for L in i8.typelen
Data1 = {vs1+Y}.i8[4*X + L&3] # 'transpose and broadcast'
Data2 = {vs3+X-Start}.u8[L]
{Accum+Interleave(Y,L)}[L / 4] +=
((signed(SData1,Data1{7:0}) + signed(Bias1{8:0})){9:0} *
(signed(SData2,Data2{7:0}) + signed(Bias2{8:0})){9:0}){18:0}
vs1 goes to the *narrow* port of the matmul. 8 vectors are always used.
vs3 goes to the *wide* port of the matmul, up to 8 vectors are used.
vx2 specifies control params used in the operation and has the following
Mode | Mode | Usage
:----: | :--: | :-----------------------------------------------:
Common | | Mode[1:0] Start[6:2] Stop[11:7]
s8 | 0 | SData2[31] SBias2[30:22] SData1[21] SBias1[20:12]
Start and Stop controls the window of input values to participate in the
- On vs1 this is in 4-byte words on all 8 vectors at the same time.
- On vs3 this is the register number to use (vs3+0 to vs3+7).
- The operation takes (stop - start + 1) ticks to complete.
When using SIMD256, the folling operands are valid:
- vd: v48
- vs1: v0, v16, v32, v48
- vs3: v8, v24, v40, v56
- v48 is used as vd but never written to.
- v48-v55 will always be overwritten upon VCGET.
Copy convolution accumulators into general registers.
vcget vd
assert(vd == 48)
N = is_simd512 ? 15 : is_simd256 ? 7 : assert(0)
for Y in [0..N]
vd{Y} = Accum{Y}
Accum{Y} = 0
v48 is the only valid vd in this instruction.
Copy general registers into convolution accumulators.
acset.v vd, vs1
assert(vd == 48)
N = is_simd512 ? 15 : is_simd256 ? 7 : assert(0)
for Y in [0..N]
Accum{Y} = vd{Y}
Note that v48 is used as vd but never written to.
### ACTR
Transpose a register group into the convolution accumulators.
actr.[w].v.{m} vd, vs1
assert(vd == 48)
assert(vs1 in {v0, v16, v32, v48}
for I in i32.typelen
for J in i32.typelen
ACCUM[J][I] = vs1[I][J]
Note that v48 is used as vd but never written to.
### VCLB
Count the leading bits.
vclb.[b,h,w].v.{m} vd, vs1
MSB = 1 << (vtype.size - 1)
for L in Op.typelen
vd[L] = vs1[L] & MSB ? CLZ(~vs1[L]) : CLZ(vs1[L])
Note: (clb - 1) is equivalent to `__builtin_clrsb`.
**clb examples**
clb.w(0xffffffff) = 32
clb.w(0xcfffffff) = 2
clb.w(0x80001000) = 1
clb.w(0x00007fff) = 17
clb.w(0x00000000) = 32
### VCLZ
Count the leading zeros.
vclz.[b,h,w].v.{m} vd, vs1
for L in Op.typelen
vd[L] = CLZ(vs1[L])
Note: clz.[b,h,w](0) returns [8,16,32].
Depthwise convolution 3-way multiply accumulate.
vdwconv.vxv vd, vs1, x2, vs3 \
adwconv.vxv vd, vs1, x2, vs3
Encoding 'adwconv' uses a '1' in the unused 5th bit (b25) of vs2.
The vertical axis is typically tiled which requires preserving registers for
this functionality. The sparse formats require shuffles so that additional
registers of intermediate state are not required.
# quant8
{vs1+0,vs1+1,vs1+2} = Rebase({vs1}, Mode::RegBase)
{b0} = {vs3+0}.asByteType
{b1} = {vs3+1}.asByteType
{b2} = {vs3+2}.asByteType
if IsDenseFormat
a0 = {vs1+0}.asByteType
a1 = {vs1+1}.asByteType
a2 = {vs1+2}.asByteType
if IsSparseFormat1 # [n-1,n,n+1]
a0 = vslide_p({vs1+1}, {vs1+0}, 1).asByteType
a1 = {vs1+1}.asByteType
a2 = vslide_n({vs1+1}, {vs1+2}, 1).asByteType
if IsSparseFormat2 # [n,n+1,n+2]
a0 = {vs1+0}.asByteType
a1 = vslide_n({vs1+0}, {vs1+1}, 1).asByteType
a2 = vslide_n({vs1+0}, {vs1+1}, 2).asByteType
# 8b: 0123456789abcdef
# 32b: 048c 26ae 159d 37bf
func Interleave(L):
i = L % 4
if (i == 0) 0
if (i == 1) 2
if (i == 2) 1
if (i == 3) 3
for L in Op.typelen
B = 4*L # 8b --> 32b
for i in [0..3]
# int19_t multiply results
# int23_t addition results
# int32_t storage
{dwacc+i}[L/4] +=
(SData1(a0[B+i]) + bias1) * (SData2(b0[B+i]) + bias2) +
(SData1(a1[B+i]) + bias1) * (SData2(b1[B+i]) + bias2) +
(SData1(a2[B+i]) + bias1) * (SData2(b2[B+i]) + bias2)
if is_vdwconv // !adwconv
for i in [0..3]
{vd+i} = {dwacc+i}
Mode | Encoding | Usage
:----: | :------: | :-----------------------------------------------:
Common | xs2 | Mode[1:0] Sparsity[3:2] RegBase[7:4]
q8 | 0 | SData2[31] SBias2[30:22] SData1[21] SBias1[20:12]
The Mode::Sparity sets the swizzling patterns.
Sparsity | Format | Swizzle
:------: | :-----: | :---------:
b00 | Dense | none
b01 | Sparse1 | [n-1,n,n+1]
b10 | Sparse2 | [n,n+1,n+2]
The Mode::RegBase allows for the start point of the 3 register group to allow
for cycling of [prev,curr,next] values.
RegBase | Prev | Curr | Next
:-----: | :-----: | :-----: | :-----:
b0000 | {vs1+0} | {vs1+1} | {vs1+2}
b0001 | {vs1+1} | {vs1+2} | {vs1+3}
b0010 | {vs1+2} | {vs1+3} | {vs1+4}
b0011 | {vs1+3} | {vs1+4} | {vs1+5}
b0100 | {vs1+4} | {vs1+5} | {vs1+6}
b0101 | {vs1+5} | {vs1+6} | {vs1+7}
b0110 | {vs1+6} | {vs1+7} | {vs1+8}
b0111 | {vs1+1} | {vs1+0} | {vs1+2}
b1000 | {vs1+1} | {vs1+2} | {vs1+0}
b1001 | {vs1+3} | {vs1+4} | {vs1+0}
b1010 | {vs1+5} | {vs1+6} | {vs1+0}
b1011 | {vs1+7} | {vs1+8} | {vs1+0}
b1100 | {vs1+2} | {vs1+0} | {vs1+1}
b1101 | {vs1+4} | {vs1+0} | {vs1+1}
b1110 | {vs1+6} | {vs1+0} | {vs1+1}
b1111 | {vs1+8} | {vs1+0} | {vs1+1}
Regbase supports upto 3x3 5x5 7x7 9x9, or use the extra horizontal range for
input latency hiding.
The vdwconv instruction includes a non-architectural state accumulator to
increase registerfile bandwidth. The dwinit instruction must be used to prepare
the depthwise accumulator for a sequence of dwconv instructions, and the
sequence must be dispatched without other instructions interleaved otherwise the
results will be unpredictable. Should other operations be required then a dwinit
must be inserted to resume the sequence.
In a context switch save where the accumulator must be saved alongside the
architectural simd registers, v0..63 are saved to thread stack or tcb and then a
vdwconv with vdup prepared zero inputs can be used to write the values to simd
registers and then saved to memory. In a context switch restore the values can
be loaded from memory and set in the accumulator registers using the dwinit
Load the depthwise convolution accumulator state.
adwinit.v vd, vs1
for L in Op.typelen
{dwacc+0} = {vs1+0}[L]
{dwacc+1} = {vs1+1}[L]
{dwacc+2} = {vs1+2}[L]
{dwacc+3} = {vs1+3}[L]
Saturating signed doubling multiply returning high half with optional rounding.
vdmulh.[b,h,w].{r,rn}.vv.{m} vd, vs1, vs2 \
vdmulh.[b,h,w].{r,rn}.vx.{m} vd, vs1, xs2
SZ = vtype.size * 8
for L in Op.typelen
LHS = SignExtend(vs1[L], 2*SZ)
RHS = SignExtend(vs2[L], 2*SZ)
RND = R ? (N && MUL < 0 ? -(1<<(SZ-1)) : (1<<(SZ-1))) : 0
vd[L] = SignedSaturation(2 * MUL + RND)[2*SZ-1:SZ]
Note: saturation is only needed for MaxNeg inputs (eg. 0x80000000).
Note: vdmulh.w.r.vx.m is used in ML activations so may be optimized by
### VDUP
Duplicate a scalar value into a vector register.
vdup.[b,h,w].x.{m} vd, xs2
for L in Op.typelen
vd[L] = [xs2]
### VEQ
Integer equal comparison.
veq.[b,h,w].vv.{m} vd, vs1, vs2 \
veq.[b,h,w].vx.{m} vd, vs1, xs2
for L in Op.typelen
vd[L] = vs1[L] == vs2[L] ? 1 : 0
Even/odd of concatenated registers.
vevn.[b,h,w].vv.{m} vd, vs1, vs2 \
vevn.[b,h,w].vx.{m} vd, vs1, xs2 \
vodd.[b,h,w].vv.{m} vd, vs1, vs2 \
vodd.[b,h,w].vx.{m} vd, vs1, xs2 \
vevnodd.[b,h,w].vv.{m} vd, vs1, vs2 \
vevnodd.[b,h,w].vx.{m} vd, vs1, xs2
M = Op.typelen / 2
if vevn || vevnodd
{dst0} = {vd+0}
{dst1} = {vd+1}
if vodd
{dst1} = {vd+0}
if vevn || vevnodd
for L in Op.typelen
dst0[L] = L < M ? vs1[2 * L + 0] : vs2[2 * (L - M) + 0] # even
if odd || vevnodd
for L in Op.typelen
dst1[L] = L < M ? vs1[2 * L + 1] : vs2[2 * (L - M) + 1] # odd
vs1 = 0x33221100
vs2 = 0x77665544
{vd+0} = 0x66442200
{vd+1} = 0x77553311
#### VGE
Integer greater-than-or-equal comparison.
vge.[b,h,w].{u}.vv.{m} vd, vs1, vs2 \
vge.[b,h,w].{u}.vx.{m} vd, vs1, xs2
for L in Op.typelen
vd[L] = vs1[L] >= vs2[L] ? 1 : 0
#### VGT
Integer greater-than comparison.
vgt.[b,h,w].{u}.vv.{m} vd, vs1, vs2 \
vgt.[b,h,w].{u}.vx.{m} vd, vs1, xs2
for L in Op.typelen
vd[L] = vs1[L] > vs2[L] ? 1 : 0
Halving addition with optional rounding bit.
vhadd.[b,h,w].{r,u,ur}.vv.{m} vd, vs1, vs2 \
vhadd.[b,h,w].{r,u,ur}.vx.{m} vd, vs1, xs2
for L in Op.typelen
if IsSigned()
vd[L] = (signed(vs1[L]) + signed(vs2[L]) + R) >> 1
vd[L] = (unsigned(vs1[L]) + unsigned(vs2[L]) + R) >> 1
Halving subtraction with optional rounding bit.
vhsub.[b,h,w].{r,u,ur}.vv.{m} vd, vs1, vs2 \
vhsub.[b,h,w].{r,u,ur}.vx.{m} vd, vs1, xs2
for L in Op.typelen
if IsSigned()
vd[L] = (signed(vs1[L]) - signed(vs2[L]) + R) >> 1
vd[L] = (unsigned(vs1[L]) - unsigned(vs2[L]) + R) >> 1
### VLD
Vector load from memory with optional post-increment by scalar.
vld.[b,h,w].{p}.x.{m} vd, xs1 \
vld.[b,h,w].[l,p,s,lp,sp,tp].xx.{m} vd, xs1, xs2
addr = xs1
sm = Op.m ? 4 : 1
len = min(Op.typelen * sm, unsigned(xs2))
for M in Op.m
for L in Op.typelen
if !Op.bit.l || (L + M * Op.typelen) < len
vd[L] = mem[addr + L].type
vd[L] = 0
if (Op.bit.s)
addr += xs2 * sizeof(type)
addr += Reg.bytes
if Op.bit.p
if Op.bit.l && Op.bit.s # .tp
xs1 += Reg.bytes
elif !Op.bit.l && !Op.bit.s && !{xs2} # .p.x
xs1 += Reg.bytes * sm
elif Op.bit.l # .lp
xs1 += len * sizeof(type)
elif Op.bit.s # .sp
xs1 += xs2 * sizeof(type) * sm
else # .p.xx
xs1 += xs2 * sizeof(type)
### VLE
Integer less-than-or-equal comparison.
vle.[b,h,w].{u}.vv.{m} vd, vs1, vs2 \
vle.[b,h,w].{u}.vx.{m} vd, vs1, xs2
for L in Op.typelen
vd[L] = vs1[L] <= vs2[L] ? 1 : 0
### VLT
Integer less-than comparison.
vlt.[b,h,w].{u}.vv.{m} vd, vs1, vs2 \
vlt.[b,h,w].{u}.vx.{m} vd, vs1, xs2
for L in Op.typelen
vd[L] = vs1[L] < vs2[L] ? 1 : 0
Multiply accumulate.
vmacc.[b,h,w].vv.{m} vd, vs1, vs2 \
vmacc.[b,h,w].vx.{m} vd, vs1, xs2
for L in Op.typelen
vd[N] += vs1[L] * vs2[L]
Multiply add.
vmadd.[b,h,w].vv.{m} vd, vs1, vs2 \
vmadd.[b,h,w].vx.{m} vd, vs1, xs2
for L in Op.typelen
vd[N] = vd[L] * vs2[L] + vs1[L]
### VMAX
Find the unsigned or signed maximum of two registers.
vmax.[b,h,w].{u}.vv.{m} vd, vs1, vs2 \
vmax.[b,h,w].{u}.vx.{m} vd, vs1, xs2
for L in Op.typelen
vd[L] = vs1[L] > vs2[L] ? vs1[L] : vs2[L]
### VMIN
Find the minimum of two registers.
vmin.[b,h,w].{u}.vv.{m} vd, vs1, vs2 \
vmin.[b,h,w].{u}.vx.{m} vd, vs1, xs2
for L in Op.typelen
vd[L] = vs1[L] < vs2[L] ? vs1[L] : vs2[L]
### VMUL
Multiply two registers.
vmul.[b,h,w].vv.{m} vd, vs1, vs2 \
vmul.[b,h,w].vx.{m} vd, vs1, xs2
for L in Op.typelen
vd[L] = vs1[L] * vs2[L]
Multiply with saturation two registers.
vmuls.[b,h,w].{u}.vv.{m} vd, vs1, vs2 \
vmuls.[b,h,w].{u}.vx.{m} vd, vs1, xs2
for L in Op.typelen
vd[L] = Saturation(vs1[L] * vs2[L])
Multiply with widening two registers.
vmulw.[h,w].{u}.vv.{m} vd, vs1, vs2 \
vmulw.[h,w].{u}.vx.{m} vd, vs1, xs2
for L in Op.typelen
{vd+0}[L] = vs1.asHalfType[2*L+0] * vs2.asHalfType[2*L+0]
{vd+1}[L] = vs1.asHalfType[2*L+1] * vs2.asHalfType[2*L+1]
Multiply with widening two registers returning the high half.
vmulh.[b,h,w].{u}.{r}.vv.{m} vd, vs1, vs2 \
vmulh.[b,h,w].{u}.{r}.vx.{m} vd, vs1, xs2
SZ = vtype.size * 8
RND = IsRounded ? 1<<(SZ-1) : 0
for L in Op.typelen
if IsU()
vd[L] = (unsigned(vs1[L]) * unsigned(vs2[L] + RND))[2*SZ-1:SZ]
else if IsSU()
vd[L] = ( signed(vs1[L]) * unsigned(vs2[L] + RND))[2*SZ-1:SZ]
vd[L] = ( signed(vs1[L]) * signed(vs2[L] + RND))[2*SZ-1:SZ]
### VMV
Move a register.
vmv.v.{m} vd, vs1
for L in Op.typelen
vd[L] = vs1[L]
Note: in the stripmined case an implemention may deliver more than one write per
### VMVP
Move a pair of registers.
vmvp.vv.{m} vd, vs1, vs2 \
vmvp.[b,h,w].vx.{m} vd, vs1, xs2
for L in Op.typelen
{vd+0}[L] = vs1[L]
{vd+1}[L] = vs2[L]
### VNE
Integer not-equal comparison.
vne.[b,h,w].vv.{m} vd, vs1, vs2 \
vne.[b,h,w].vx.{m} vd, vs1, xs2
for L in Op.typelen
vd[L] = vs1[L] != vs2[L] ? 1 : 0
### VNOT
Bitwise NOT a register.
vnot.v.{m} vd, vs1
for L in Op.typelen
vd[L] = ~vs1[L]
### VOR
OR two operands.
vor.vv.{m} vd, vs1, vs2 \
vor.[b,h,w].vx.{m} vd, vs1, xs2
for L in Op.typelen
vd[L] = vs1[L] | vs2[L]
Adds the lane pairs.
vpadd.[h,w].{u}.v.{m} vd, vs1
if .v
for L in Op.typelen
vd[L] = (vs1.asHalfType[2 * L] + vs1.asHalfType[2 * L + 1])
Subtracts the lane pairs.
vpsub.[h,w].{u}.v.{m} vd, vs1
if .v
for L in Op.typelen
vd[L] = (vs1.asHalfType[2 * L] - vs1.asHalfType[2 * L + 1])
Count the set bits.
vcpop.[b,h,w].v.{m} vd, vs1
for L in Op.typelen
vd[L] = CountPopulation(vs1[L])
### VREV
Generalized reverse using bit ladder.
The size of the flip is based on the `log_2(data type)`
vrev.[b,h,w].vv.{m} vd, vs1, vs2 \
vrev.[b,h,w].vx.{m} vd, vs1, xs2
N = vtype.bits - 1 # 7, 15, 31
shamt = xs2[4:0] & N
for L in Op.typelen
r = vs1[L]
if (shamt & 1) r = ((r & 0x55..) << 1) | ((r & 0xAA..) >> 1)
if (shamt & 2) r = ((r & 0x33..) << 2) | ((r & 0xCC..) >> 2)
if (shamt & 4) r = ((r & 0x0F..) << 4) | ((r & 0xF0..) >> 4)
if (sz == 0) vd[L] = r; continue;
if (shamt & 8) r = ((r & 0x00..) << 8) | ((r & 0xFF..) >> 8)
if (sz == 1) vd[L] = r; continue;
if (shamt & 16) r = ((r & 0x00..) << 16) | ((r & 0xFF..) >> 16)
vd[L] = r
### VROR
Logical rotate right.
vror.[b,h,w].vv.{m} vd, vs1, vs2 \
vror.[b,h,w].vx.{m} vd, vs1, xs2
N = vtype.bits - 1 # 7, 15, 31
shamt = xs2[4:0] & N
for L in Op.typelen
r = vs1[L]
if (shamt & 1) for (B in vtype.bits) r[B] = r[(N+1) % N]
if (shamt & 2) for (B in vtype.bits) r[B] = r[(N+2) % N]
if (shamt & 4) for (B in vtype.bits) r[B] = r[(N+4) % N]
if (shamt & 8) for (B in vtype.bits) r[B] = r[(N+8) % N]
if (shamt & 16) for (B in vtype.bits) r[B] = r[(N+16) % N]
vd[L] = r
Arithmetic and logical left/right shift with saturating shift amount and result.
vsha.[b,h,w].{r}.vv.{m} vd, vs1, vs2
vshl.[b,h,w].{r}.vv.{m} vd, vs1, vs2
M = Op.size # 8, 16, 32
N = [8->3, 16->4, 32->5][Op.size]
SHSAT[L] = vs2[L][M-1:N] != 0
SHAMT[L] = vs2[L][N-1:0]
RND = R && SHAMT ? 1 << (SHAMT-1) : 0
RND -= N && (vs1[L] < 0) ? 1 : 0
SZ = sizeof(src.type) * 8 * (W ? 2 : 1)
RESULT_NEG = (vs1[L] <<[<] SHAMT[L])[SZ-1:0] // !A "<<<" logical shift
RESULT_POS = ((vs1[L] + RND) >>[>] SHAMT[L]) // !A ">>>" logical shift
### VSEL
Select lanes from two operands with vector selection boolean.
vsel.[b,h,w].vv.{m} vd, vs1, vs2 \
vsel.[b,h,w].vx.{m} vd, vs1, xs2
for L in Op.typelen
vd[L] = vs1[L].bit(0) ? vd[L] : vs2[L]
### VSLL
Logical left shift.
vsll.[b,h,w].vv.{m} vd, vs1, vs2 \
vsll.[b,h,w].vx.{m} vd, vs1, xs2
N = [8->3, 16->4, 32->5][Op.size]
xd[L] = vs1[L] <<< vs2[L][N-1:0]
Slide next register by index.
For the horizontal mode, it treats the stripmine `vm` register based on
`vs1` as a contiguous block, and only the first `index` elements from `vs2`
will be used.
For the vertical mode, each stripmine vector register `op_index` is mapped
separatedly. it mimics the imaging tiling process shift of
| 4xVLEN | 4xVLEN |
| (vs1) | (vs2) |
The vertical mode can also support the non-stripmine version to handle
the last columns of the image.
Horizontal slide:
vslidehn.[b,h,w].[1,2,3,4].vv.m vd, vs1, vs2 \
vslidehn.[b,h,w].[1,2,3,4].vx.m vd, vs1, xs2
Vertical slide:
vsliden.[b,h,w].[1,2,3,4].vv vd, vs1, vs2 \
vslidevn.[b,h,w].[1,2,3,4].vv.m vd, vs1, vs2 \
vslidevn.[b,h,w].[1,2,3,4].vx.m vd, vs1, xs2
assert vd != vs1 && vd != vs2
if Op.h // A contiguous horizontal slide based on vs1
va = {{vs1},{vs1+1},{vs1+2},{vs1+3}}
vb = {{vs1+1},{vs1+2},{vs1+3},{vs2}}
if Op.v // vs1/vs2 vertical slide
va = {{vs1},{vs1+1},{vs1+2},{vs1+3}}
vb = {{vs2},{vs2+1},{vs2+2},{vs2+3}}
sm = Op.m ? 4 : 1
for M in sm
for L in Op.typelen
if (L + index < Op.typelen)
vd[L] = va[M][L + index]
vd[L] = is_vx ? xs2 : vb[M][L + index - Op.typelen]
Slide previous register by index.
For the horizontal mode, it treats the stripmine `vm` register based on
**`vs2`** as a contiguous block, and only the _LAST_ `index` elements from
stripmine vm register based on `vs1` will be used AT THE BEGINNING.
For the vertical mode, each stripmine vector register `op_index` is mapped
separatedly. it mimics the imaging tiling process shift of
| 4xVLEN | 4xVLEN |
| (vs1) | (vs2) |
The vertical mode can also support the non-stripmine version to handle
the last columns of the image.
Horizontal slide:
vslidehp.[b,h,w].[1,2,3,4].vv.m vd, vs1, vs2 \
vslidehp.[b,h,w].[1,2,3,4].vx.m vd, vs1, xs2
Vertical slide:
vslidep.[b,h,w].[1,2,3,4].vv vd, vs1, vs2 \
vslidevp.[b,h,w].[1,2,3,4].vv.m vd, vs1, vs2 \
vslidevp.[b,h,w].[1,2,3,4].vx.m vd, vs1, xs2
assert vd != vs1 && vd != vs2
if Op.h // A continuous horizontal slide based on vs2
va = {{vs1+3},{vs2},{vs2+1},{vs2+2}}
vb = {{vs2},{vs2+1},{vs2+2},{vs2+3}}
if Op.v // vs1/vs2 vertical slide
va = {{vs1},{vs1+1},{vs1+2},{vs1+3}}
vb = {{vs2},{vs2+1},{vs2+2},{vs2+3}}
sm = Op.m ? 4 : 1
for M in sm
for L in Op.typelen
if (L < index)
vd[L] = va[M][Op.typelen + L - index]
vd[L] = is_vx ? xs2 : vb[M][L - index]
Arithmetic and logical right shift.
vsra.[b,h,w].vv.{m} vd, vs1, vs2 \
vsra.[b,h,w].vx.{m} vd, vs1, xs2
vsrl.[b,h,w].vv.{m} vd, vs1, vs2 \
vsrl.[b,h,w].vx.{m} vd, vs1, xs2
N = Op.size[8->3, 16->4, 32->5]
xd[L] = vs1[L] >>[>] vs2[L][N-1:0]
Arithmetic right shift with rounding and signed/unsigned saturation.
vsrans{u}.[b,h].{r}.vv.{m} vd, vs1, vs2 \
vsrans{u}.[b,h].{r}.vx.{m} vd, vs1, xs2
for L in Op.typelen
N = [8->3, 16->4, 32->5][Op.size]
SHAMT[L] = vs2[L][2*N-1:0] # source size index
RND = R && SHAMT ? 1 << (SHAMT-1) : 0
RND -= N && (vs1[L] < 0) ? 1 : 0
vd[L+0] = Saturate({vs1+0}[L/2] + RND, u) >>[>] SHAMT
vd[L+1] = Saturate({vs1+1}[L/2] + RND, u) >>[>] SHAMT
Note: vsrans.[b,h].vx.m are used in ML activations so may be optimized by
Arithmetic quarter narrowing right shift with rounding and signed/unsigned
vsraqs{u}.b.{r}.vv.{m} vd, vs1, vs2 \
vsraqs{u}.b.{r}.vx.{m} vd, vs1, xs2
for L in i32.typelen
SHAMT[L] = vs2[L][4:0]
RND = R && SHAMT ? 1 << (SHAMT-1) : 0
RND -= N && (vs1[L] < 0) ? 1 : 0
vd[L+0] = Saturate({vs1+0}[L/4] + RND, u) >>[>] SHAMT
vd[L+1] = Saturate({vs1+2}[L/4] + RND, u) >>[>] SHAMT
vd[L+2] = Saturate({vs1+1}[L/4] + RND, u) >>[>] SHAMT
vd[L+3] = Saturate({vs1+3}[L/4] + RND, u) >>[>] SHAMT
Note: The register interleaving is [0,2,1,3] and not [0,1,2,3] as this matches
vconv/vdwconv requirements, and one vsraqs is the same as two chained vsrans.
Reverse subtract two operands.
vrsub.[b,h,w].vx.{m} vd, vs1, xs2
for L in Op.typelen
vd[L] = xs2[L] - vs1[L]
### VSUB
Subtract two operands.
vsub.[b,h,w].vv.{m} vd, vs1, vs2 \
vsub.[b,h,w].vx.{m} vd, vs1, xs2
for L in Op.typelen
vd[L] = vs1[L] - vs2[L]
Subtract two operands with saturation.
vsubs.[b,h,w].{u}.vv.{m} vd, vs1, vs2 \
vsubs.[b,h,w].{u}.vx.{m} vd, vs1, xs2
for L in Op.typelen
vd[L] = Saturate(vs1[L] - vs2[L])
Subtract two operands with widening.
vsubw.[h,w].{u}.vv.{m} vd, vs1, vs2 \
vsubw.[h,w].{u}.vx.{m} vd, vs1, xs2
for L in Op.typelen
{vd+0}[L] = vs1.asHalfType[2*L+0] - vs2.asHalfType[2*L+0]
{vd+1}[L] = vs1.asHalfType[2*L+1] - vs2.asHalfType[2*L+1]
### VST
Vector store to memory with optional post-increment by scalar.
vst.[b,h,w].{p}.x.{m} vd, xs1 \
vst.[b,h,w].[l,p,s,lp,sp,tp].xx.{m} vd, xs1, xs2
addr = xs1
sm = Op.m ? 4 : 1
len = min(Op.typelen * sm, unsigned(xs2))
for M in Op.m
for L in Op.typelen
if !Op.bit.l || (L + M * Op.typelen) < len
mem[addr + L].type = vd[L]
if (Op.bit.s)
addr += xs2 * sizeof(type)
addr += Reg.bytes
if Op.bit.p
if Op.bit.l && Op.bit.s # .tp
xs1 += Reg.bytes
elif !Op.bit.l && !Op.bit.s && !{xs2} # .p.x
xs1 += Reg.bytes * sm
elif Op.bit.l # .lp
xs1 += len * sizeof(type)
elif Op.bit.s # .sp
xs1 += xs2 * sizeof(type) * sm
else # .p.xx
xs1 += xs2 * sizeof(type)
### VSTQ
Vector store quads to memory with optional post-increment by scalar.
vstq.[b,h,w].[s,sp].xx.{m} vd, xs1, xs2
addr = xs1
sm = Op.m ? 4 : 1
for M in Op.m
for Q in 0 to 3
for L in Op.typelen / 4
mem[addr + L].type = vd[L + Q * Op.typelen / 4]
addr += xs2 * sizeof(type)
if Op.bit.p
xs1 += xs2 * sizeof(type) * sm
Note: This is principally for storing the results of vconv after 32b to 8b
### VXOR
XOR two operands.
vxor.vv.{m} vd, vs1, vs2 \
vxor.[b,h,w].vx.{m} vd, vs1, xs2
for L in Op.typelen
vd[L] = vs1[L] ^ vs2[L]
### VZIP
Interleave even/odd lanes of two operands.
vzip.[b,h,w].vv.{m} vd, vs1, vs2 \
vzip.[b,h,w].vx.{m} vd, vs1, xs2
index = Is(a=>0, b=>1)
for L in Op.typelen
M = L / 2
N = L / 2 + Op.typelen / 2
{vd+0}[L] = L & 1 ? vs2[M] : vs1[M]
{vd+1}[L] = L & 1 ? vs2[N] : vs1[N]
vs1 = 0x66442200
vs2 = 0x77553311
{vd+0} = 0x33221100
{vd+1} = 0x77665544
Note: vd must not be in the range of vs1 or vs2.
Log a register in a printf contract.
flog rs1 &ensp; // mode=0, “printf” formatted command, rs1=(context) \
slog rs1 &ensp; // mode=1, scalar log \
clog rs1 &ensp; // mode=2, character log \
klog rs1 &ensp; // mode=3, const string log
A number of arguments are sent with SLOG or CLOG, and then a FLOG operation
closes the packet and may emit a timestamp and context data like ASID. A
receiving tool can construct messages, e.g. XML records per printf stream, by
collecting the arguments as they arrive in a variable length buffer, and closing
the record when the FLOG instruction arrives.
A transport layer may choose to encode in the flog format footer the preceding
count of arguments or bytes sent. This is so that detection of payload errors or
hot connections are possible.
The SLOG instruction will send a payload packet represented by the starting
memory location.
The CLOG instruction will send a multiple 32-bit packet message of a character
stream. The packet message will close when a zero character is detected. A
single character may be sent in a 32bit packet.
**Pseudo code**
const uint8_t p[] = "text message";
printf(“Test %s\n”, p);
FLOG &fmt
FLOG &fmt
print(“Test %d\n”, result_int);
SLOG result_int
FLOG &fmt
printf(“Test %d %f %s %s %s\n”, 123, "abc", "1234", “789AB”);
SLOG 123
CLOG ‘abc\0’
CLOG ‘1234’ CLOG ‘\0’
CLOG ‘789A’ CLOG ‘B\0’
FLOG &fmt