[sram_ctrl/doc] Clean up documentation and add blockdiagram
Signed-off-by: Michael Schaffner <msf@opentitan.org>
diff --git a/hw/ip/sram_ctrl/doc/_index.md b/hw/ip/sram_ctrl/doc/_index.md
index 5cf4c5f..febdbd6 100644
--- a/hw/ip/sram_ctrl/doc/_index.md
+++ b/hw/ip/sram_ctrl/doc/_index.md
@@ -19,12 +19,33 @@
# Theory of Operations
-### Block Diagram
+## Block Diagram
-**TODO: draw block diagram and add description**
+As shown in the blockdiagram below (for `Width = 32`), the SRAM controller contains a CSR node, a key request interface, a TL-UL SRAM adapter and an instance of `prim_ram_1p_scr` that implements the actual scrambling algorithm.
+Scrambling is always enabled, but the scrambling device uses an all-zero scrambling key and nonce when it comes out of reset.
+It is the task of SW to request a new scrambling key and nonce via the CSRs as described in the [Programmer's Guide]({{< relref "#programmers-guide" >}}) below.
![SRAM Controller Block Diagram](sram_ctrl_blockdiag.svg)
+The scrambling device employs a reduced-round (5 instead of 11) PRINCE block cipher in CTR mode to scramble the data.
+The PRINCE lightweight block cipher has been selected due to its low latency and low area characteristics, see also [prim_prince]({{< relref "hw/ip/prim/doc/prim_prince" >}}) for more information on PRINCE.
+The number of rounds is reduced to 5 in order to ease timing pressure and ensure single cycle operation (the number of rounds can always be increased if it turns out that there is enough timing slack).
+
+In [CTR mode](https://en.wikipedia.org/wiki/Block_cipher_mode_of_operation#Counter_(CTR)), the block cipher is used to encrypt a 64bit IV with the scrambling key in order to create a 64bit keystream block that is bitwise XOR'ed with the data in order to transform plaintext into ciphertext and vice versa.
+The IV is assembled by concatenating a nonce with the word address.
+
+If the input data word is smaller than 64bit, the keystream block is truncated to fit the input data width.
+If the input data word is wider than 64bit, the SRAM controller by default instantiates multiple PRINCE primitives in order to create a unique keystream for the full datawidth.
+For area constrained settings, the parameter `ReplicateKeyStream` can be set to 1 in order to replicate the keystream block generated by one single primitive instead of using multiple parallel PRINCE instances (but it should be understood that this lowers the level of security).
+
+Since plain CTR mode does not diffuse the data bits due to the bitwise XOR, the scheme is augmented by passing each individual byte through a two-layer substitution-permutation (S&P) network implemented with the `prim_subst_perm` primitive.
+This is applied byte-wise in order to maintain byte-write-ability without having to perform a read-modify-write operation.
+The S&P network employed is similar to the one employed in PRESENT and will be explained in more detail [further below]({{< relref "#custom-substitution-permutation-network" >}}).
+
+Another CTR mode augmentation that is aimed at breaking the linear address space is SRAM address scrambling.
+The same two-layer S&P network that is used for byte diffusion is leveraged to non-linearly remap the SRAM address as shown in the block diagram above.
+As opposed to the byte diffusion S&P networks, this particular address scrambling network additionally XOR's in a nonce that has the same width as the address.
+
## Hardware Interfaces
### Parameters
@@ -34,11 +55,12 @@
Parameter | Default (Max) | Top Earlgrey | Description
----------------------------|-----------------------|--------------|---------------
`Depth` | 512 | multiple | SRAM depth, needs to be a power of 2 if `NumAddrScrRounds` > 0.
-`Width` | 32 (64) | 32 | Effective SRAM width without redundancy.
+`Width` | 32 | 32 | Effective SRAM width without redundancy.
`CfgWidth` | 8 | 8 | Width of SRAM attributes field.
`NumPrinceRoundsHalf` | 2 (5) | 2 | Number of PRINCE half-rounds.
`NumByteScrRounds` | 2 | 2 | Number of intra-byte diffusion rounds, set to 0 to disable.
`NumAddrScrRounds` | 2 | 2 | Number of address scrambling rounds, set to 0 to disable.
+`ReplicateKeyStream` | 0 (1) | 0 | If set to 1, the same 64bit key stream is replicated if the data port is wider than 64bit. Otherwise, multiple PRINCE primitives are employed to generate a unique keystream for the full data width.
### Signals
@@ -51,7 +73,7 @@
`sram_tl_i` | `input` | `tlul_pkg::tl_h2d_t` | Second TL-UL interface for the SRAM macro (independent from the CSR TL-UL port).
`sram_tl_o` | `input` | `tlul_pkg::tl_d2h_t` | Second TL-UL interface for the SRAM macro (independent from the CSR TL-UL port).
`lc_escalate_en_i` | `input` | `lc_ctrl_pkg::lc_tx_t` | Multibit life cycle escalation enable signal coming from life cycle controller.
-`sram_otp_key_o` | `output` | `otp_ctrl_pkg::sram_otp_key_req_t` | Key derivation request going to the key derivation inferface of the OTP controller.
+`sram_otp_key_o` | `output` | `otp_ctrl_pkg::sram_otp_key_req_t` | Key derivation request going to the key derivation interface of the OTP controller.
`sram_otp_key_i` | `input` | `otp_ctrl_pkg::sram_otp_key_rsp_t` | Ephemeral scrambling key coming back from the key derivation inferface of the OTP controller.
#### Lifecycle Escalation Input
@@ -89,9 +111,69 @@
Hence, if the SRAM controller clock `clk_i` is faster or in the same order of magnitude as `clk_otp_i`, the data can be directly sampled upon assertion of `src_ack_o`.
If the SRAM controller runs on a significantly slower clock than OTP, an additional register (as indicated with dashed grey lines in the figure) has to be added.
-## Design Details
+## Custom Substitution Permutation Network
-** TODO: add detailed description of scrambling mechanism **
+In addition to the PRINCE primitive, the SRAM controller employs a custom S&P network for byte diffusion and address scrambling.
+The structure of that S&P network is similar to the one used in PRESENT, but it uses a modified permutation function that makes it possible to parameterize the network to arbitrary data widths as shown in the pseudo code below.
+
+```c++
+
+NUM_ROUNDS = 2;
+DATA_WIDTH = 8; // bitwidth of the data
+
+// Apply PRESENT Sbox4 on all nibbles, leave uppermost bits unchanged
+// if the width is not divisible by 4.
+state_t sbox4_layer(state) {
+ for (int i = 0; i < DATA_WIDTH/4; i ++) {
+ nibble_t nibble = get_nibble(state, i);
+ nibble = present_sbox4(nibble)
+ set_nibble(state, i, nibble);
+ }
+ return state;
+}
+
+// Reverses the bit order.
+state_t flip_vector(state) {
+ state_t state_flipped;
+ for (int i = 0; i < DATA_WIDTH; i ++) {
+ state_flipped[i] = state[width-1-i];
+ }
+ return state_flipped;
+}
+
+// Gather all even bits and put them into the lower half.
+// Gather all odd bits and put them into the upper half.
+state_t perm_layer(state) {
+ // Initialize with input state.
+ // If the number of bits is odd, the uppermost bit
+ // will stay in position, as intended.
+ state_t state_perm = state;
+ for (int i = 0; i < DATA_WIDTH/2; i++) {
+ state_perm[i] = state[i * 2];
+ state_perm[i + DATA_WIDTH/2] = state[i * 2 + 1];
+ }
+ return state_perm;
+}
+
+state_t prim_subst_perm(data_i, key_i) {
+
+ state_t state = data_i;
+ for (int i = 0; i < NUM_ROUNDS; i++) {
+ state ^= key_i;
+ state = sbox4_layer(state);
+ // The vector flip and permutation operations have the
+ // combined effect that all bits will be passed through an
+ // Sbox4 eventually, even if the number of bits in data_i
+ // is not aligned with 4.
+ state = flip_vector(state);
+ state = perm_layer(state);
+ }
+
+ return state ^ key_i;
+}
+
+```
+
# Programmer's Guide