[sram_ctrl/doc] Clean up documentation and add blockdiagram

Signed-off-by: Michael Schaffner <msf@opentitan.org>
diff --git a/hw/ip/sram_ctrl/doc/_index.md b/hw/ip/sram_ctrl/doc/_index.md
index 5cf4c5f..febdbd6 100644
--- a/hw/ip/sram_ctrl/doc/_index.md
+++ b/hw/ip/sram_ctrl/doc/_index.md
@@ -19,12 +19,33 @@
 
 # Theory of Operations
 
-### Block Diagram
+## Block Diagram
 
-**TODO: draw block diagram and add description**
+As shown in the blockdiagram below (for `Width = 32`), the SRAM controller contains a CSR node, a key request interface, a TL-UL SRAM adapter and an instance of `prim_ram_1p_scr` that implements the actual scrambling algorithm.
+Scrambling is always enabled, but the scrambling device uses an all-zero scrambling key and nonce when it comes out of reset.
+It is the task of SW to request a new scrambling key and nonce via the CSRs as described in the [Programmer's Guide]({{< relref "#programmers-guide" >}}) below.
 
 ![SRAM Controller Block Diagram](sram_ctrl_blockdiag.svg)
 
+The scrambling device employs a reduced-round (5 instead of 11) PRINCE block cipher in CTR mode to scramble the data.
+The PRINCE lightweight block cipher has been selected due to its low latency and low area characteristics, see also [prim_prince]({{< relref "hw/ip/prim/doc/prim_prince" >}}) for more information on PRINCE.
+The number of rounds is reduced to 5 in order to ease timing pressure and ensure single cycle operation (the number of rounds can always be increased if it turns out that there is enough timing slack).
+
+In [CTR mode](https://en.wikipedia.org/wiki/Block_cipher_mode_of_operation#Counter_(CTR)), the block cipher is used to encrypt a 64bit IV with the scrambling key in order to create a 64bit keystream block that is bitwise XOR'ed with the data in order to transform plaintext into ciphertext and vice versa.
+The IV is assembled by concatenating a nonce with the word address.
+
+If the input data word is smaller than 64bit, the keystream block is truncated to fit the input data width.
+If the input data word is wider than 64bit, the SRAM controller by default instantiates multiple PRINCE primitives in order to create a unique keystream for the full datawidth.
+For area constrained settings, the parameter `ReplicateKeyStream` can be set to 1 in order to replicate the keystream block generated by one single primitive instead of using multiple parallel PRINCE instances (but it should be understood that this lowers the level of security).
+
+Since plain CTR mode does not diffuse the data bits due to the bitwise XOR, the scheme is augmented by passing each individual byte through a two-layer substitution-permutation (S&P) network implemented with the `prim_subst_perm` primitive.
+This is applied byte-wise in order to maintain byte-write-ability without having to perform a read-modify-write operation.
+The S&P network employed is similar to the one employed in PRESENT and will be explained in more detail [further below]({{< relref "#custom-substitution-permutation-network" >}}).
+
+Another CTR mode augmentation that is aimed at breaking the linear address space is SRAM address scrambling.
+The same two-layer S&P network that is used for byte diffusion is leveraged to non-linearly remap the SRAM address as shown in the block diagram above.
+As opposed to the byte diffusion S&P networks, this particular address scrambling network additionally XOR's in a nonce that has the same width as the address.
+
 ## Hardware Interfaces
 
 ### Parameters
@@ -34,11 +55,12 @@
 Parameter                   | Default (Max)         | Top Earlgrey | Description
 ----------------------------|-----------------------|--------------|---------------
 `Depth`                     | 512                   | multiple     | SRAM depth, needs to be a power of 2 if `NumAddrScrRounds` > 0.
-`Width`                     | 32 (64)               | 32           | Effective SRAM width without redundancy.
+`Width`                     | 32                    | 32           | Effective SRAM width without redundancy.
 `CfgWidth`                  | 8                     | 8            | Width of SRAM attributes field.
 `NumPrinceRoundsHalf`       | 2 (5)                 | 2            | Number of PRINCE half-rounds.
 `NumByteScrRounds`          | 2                     | 2            | Number of intra-byte diffusion rounds, set to 0 to disable.
 `NumAddrScrRounds`          | 2                     | 2            | Number of address scrambling rounds, set to 0 to disable.
+`ReplicateKeyStream`        | 0 (1)                 | 0            | If set to 1, the same 64bit key stream is replicated if the data port is wider than 64bit. Otherwise, multiple PRINCE primitives are employed to generate a unique keystream for the full data width.
 
 ### Signals
 
@@ -51,7 +73,7 @@
 `sram_tl_i`                | `input`          | `tlul_pkg::tl_h2d_t`               | Second TL-UL interface for the SRAM macro (independent from the CSR TL-UL port).
 `sram_tl_o`                | `input`          | `tlul_pkg::tl_d2h_t`               | Second TL-UL interface for the SRAM macro (independent from the CSR TL-UL port).
 `lc_escalate_en_i`         | `input`          | `lc_ctrl_pkg::lc_tx_t`             | Multibit life cycle escalation enable signal coming from life cycle controller.
-`sram_otp_key_o`           | `output`         | `otp_ctrl_pkg::sram_otp_key_req_t` | Key derivation request going to the key derivation inferface of the OTP controller.
+`sram_otp_key_o`           | `output`         | `otp_ctrl_pkg::sram_otp_key_req_t` | Key derivation request going to the key derivation interface of the OTP controller.
 `sram_otp_key_i`           | `input`          | `otp_ctrl_pkg::sram_otp_key_rsp_t` | Ephemeral scrambling key coming back from the key derivation inferface of the OTP controller.
 
 #### Lifecycle Escalation Input
@@ -89,9 +111,69 @@
 Hence, if the SRAM controller clock `clk_i` is faster or in the same order of magnitude as `clk_otp_i`, the data can be directly sampled upon assertion of `src_ack_o`.
 If the SRAM controller runs on a significantly slower clock than OTP, an additional register (as indicated with dashed grey lines in the figure) has to be added.
 
-## Design Details
+## Custom Substitution Permutation Network
 
-** TODO: add detailed description of scrambling mechanism **
+In addition to the PRINCE primitive, the SRAM controller employs a custom S&P network for byte diffusion and address scrambling.
+The structure of that S&P network is similar to the one used in PRESENT, but it uses a modified permutation function that makes it possible to parameterize the network to arbitrary data widths as shown in the pseudo code below.
+
+```c++
+
+NUM_ROUNDS = 2;
+DATA_WIDTH = 8; // bitwidth of the data
+
+// Apply PRESENT Sbox4 on all nibbles, leave uppermost bits unchanged
+// if the width is not divisible by 4.
+state_t sbox4_layer(state) {
+    for (int i = 0; i < DATA_WIDTH/4; i ++) {
+        nibble_t nibble = get_nibble(state, i);
+        nibble = present_sbox4(nibble)
+        set_nibble(state, i, nibble);
+    }
+    return state;
+}
+
+// Reverses the bit order.
+state_t flip_vector(state) {
+    state_t state_flipped;
+    for (int i = 0; i < DATA_WIDTH; i ++) {
+        state_flipped[i] = state[width-1-i];
+    }
+    return state_flipped;
+}
+
+// Gather all even bits and put them into the lower half.
+// Gather all odd bits and put them into the upper half.
+state_t perm_layer(state) {
+    // Initialize with input state.
+    // If the number of bits is odd, the uppermost bit
+    // will stay in position, as intended.
+    state_t state_perm = state;
+    for (int i = 0; i < DATA_WIDTH/2; i++) {
+      state_perm[i]                = state[i * 2];
+      state_perm[i + DATA_WIDTH/2] = state[i * 2 + 1];
+    }
+    return state_perm;
+}
+
+state_t prim_subst_perm(data_i, key_i) {
+
+    state_t state = data_i;
+    for (int i = 0; i < NUM_ROUNDS; i++) {
+        state ^= key_i;
+        state = sbox4_layer(state);
+        // The vector flip and permutation operations have the
+        // combined effect that all bits will be passed through an
+        // Sbox4 eventually, even if the number of bits in data_i
+        // is not aligned with 4.
+        state = flip_vector(state);
+        state = perm_layer(state);
+    }
+
+    return state ^ key_i;
+}
+
+```
+
 
 # Programmer's Guide