feat(debug): Transition debug module to CSR-based interface

This commit refactors the debug module's external interface, moving from direct pin-level connections to a memory-mapped CSR (Control and Status Register) interface accessible via the AXI bus. This change provides a more structured and extensible mechanism for controlling and interacting with the debug module.

Key changes include:

- **Hardware:**
    - The `CoreAxi` top-level module is updated to connect the debug module to the new CSR interface, removing the direct `dm` port.
    - `CoreAxiCSR` is enhanced with a set of registers for debug requests and responses, including registers for address, data, operation, and status.
    - The debug module's request and response signals are now driven by these CSRs, enabling control via AXI writes and reads.

- **Testbench:**
    - The `CoreMiniAxiInterface` is significantly updated to use the new CSR-based communication protocol.
    - The `dm_req_agent` and `dm_rsp_agent`, which previously managed the pin-level protocol, have been removed.
    - The `dm_read` and `dm_write` functions are rewritten to perform the multi-step process of writing to the request CSRs, polling the status CSR, and reading the response CSRs.
    - All debug-related tests are updated to use the new `dm_read` and `dm_write` functions.

- **Documentation:**
    - The debug module documentation is updated to describe the new CSR-based command protocol.
    - A table of the new AXI CSRs is added, including their addresses and descriptions.
    - The command examples are updated to reflect the new multi-step process for reading and writing GPRs via the CSR interface.

This refactoring simplifies the hardware interface, improves the robustness of the testbench, and provides a clearer and more comprehensive programming model for external debuggers.

Change-Id: I39be61d805d0e1236550a33a40dbb20f91da0d67
diff --git a/doc/microarch/debug.md b/doc/microarch/debug.md
index d70f86f..80d92e3 100644
--- a/doc/microarch/debug.md
+++ b/doc/microarch/debug.md
@@ -14,7 +14,7 @@
 
 ## Interfaces
 
-The following table describes the inputs and outputs of the debug module:
+The following table describes the internal hardware interfaces of the debug module:
 
 | Name           | Direction | Type                  | Width | Description                               |
 |----------------|-----------|-----------------------|-------|-------------------------------------------|
@@ -61,19 +61,48 @@
 
 ## Command Protocol
 
-An external debugger communicates with the debug module by reading and writing its internal registers. The `ext.req` and `ext.rsp` interfaces are used for this purpose.
+An external debugger communicates with the debug module by reading and writing a set of memory-mapped Control and Status Registers (CSRs) over the AXI interface. These CSRs provide a communication channel to the debug module's internal registers.
 
-To issue a command, the debugger sends a request on the `ext.req` interface. The `address` field specifies the register to access, and the `op` field specifies the operation (read or write). For write operations, the `data` field contains the value to write.
+### Write Operation
 
-The debug module responds on the `ext.rsp` interface. The `op` field indicates the status of the operation, and for read operations, the `data` field contains the value read from the register.
+To write to an internal debug module register (e.g., writing `0x1` to `dmcontrol` at address `0x10`):
 
-All debug operations, including halting the core, resuming the core, and executing abstract commands, are performed by reading and writing the debug module's registers via this protocol.
+1.  **Poll for readiness:** Read the `status` CSR at `0x31014` and wait for bit 0 to be `1`.
+2.  **Set address:** Write the target internal register address (`0x10`) to the `req_addr` CSR at `0x31000`.
+3.  **Set data:** Write the data (`0x1`) to the `req_data` CSR at `0x31004`.
+4.  **Initiate write:** Write the `WRITE` operation code (`2`) to the `req_op` CSR at `0x31008`.
+5.  **Poll for response:** Read the `status` CSR at `0x31014` and wait for bit 1 to be `1`.
+6.  **Check status:** Read the `rsp_op` CSR at `0x31010` to confirm the operation was successful.
+7.  **Acknowledge response:** Write to the `status` CSR at `0x31014` to clear the response.
 
-## Registers
+### Read Operation
 
-The debug module implements a set of registers that are accessible to an external debugger. These registers are used to control and monitor the core.
+To read from an internal debug module register (e.g., reading from `dmstatus` at address `0x11`):
 
-The following table lists the debug module registers:
+1.  **Poll for readiness:** Read the `status` CSR at `0x31014` and wait for bit 0 to be `1`.
+2.  **Set address:** Write the target internal register address (`0x11`) to the `req_addr` CSR at `0x31000`.
+3.  **Initiate read:** Write the `READ` operation code (`1`) to the `req_op` CSR at `0x31008`.
+4.  **Poll for response:** Read the `status` CSR at `0x31014` and wait for bit 1 to be `1`.
+5.  **Check status:** Read the `rsp_op` CSR at `0x31010` to confirm the operation was successful.
+6.  **Read data:** Read the result from the `rsp_data` CSR at `0x3100c`.
+7.  **Acknowledge response:** Write to the `status` CSR at `0x31014` to clear the response.
+
+## AXI CSR Interface
+
+These registers are mapped into the Kelvin CSR address space and are used to communicate with the debug module.
+
+| Address    | Name       | Description                                                                                                |
+|------------|------------|------------------------------------------------------------------------------------------------------------|
+| 0x31000    | req_addr   | Write the target debug module register address here.                                                       |
+| 0x31004    | req_data   | Write data for the debug module operation here.                                                            |
+| 0x31008    | req_op     | Write the operation type (e.g., READ, WRITE) to this register to initiate a debug module command.          |
+| 0x3100c    | rsp_data   | After a command completes, the data result is available here.                                              |
+| 0x31010    | rsp_op     | After a command completes, the status result (e.g., SUCCESS, FAILED) is available here.                    |
+| 0x31014    | status     | A read-only register to check the status of the debug module. Bit 0 indicates if the module is ready for a new request. Bit 1 indicates if a response is available. |
+
+## Internal Debug Module Registers
+
+The debug module implements a set of internal registers that are accessible to an external debugger via the AXI CSR interface. These registers are used to control and monitor the core.
 
 | Address | Name         | Description                               |
 |---------|--------------|-------------------------------------------|
@@ -172,7 +201,4 @@
     *   `write` (bit 16): `1` (write)
     *   `regno` (bits 15:0): `0x100A` (for `a0`)
 3.  **Wait for completion:** Poll the `abstractcs` register (address `0x16`) until the `busy` bit (bit 12) is cleared.
-4.  **Check for errors:** Read the `abstractcs` register again and check that the `cmderr` field (bits 10:8) is `0`.
-
-
-
+4.  **Check for errors:** Read the `abstractcs` register again and check that the `cmderr` field (bits 10:8) is `0`.
\ No newline at end of file
diff --git a/hdl/chisel/src/kelvin/CoreAxi.scala b/hdl/chisel/src/kelvin/CoreAxi.scala
index 8bac00a..9de9466 100644
--- a/hdl/chisel/src/kelvin/CoreAxi.scala
+++ b/hdl/chisel/src/kelvin/CoreAxi.scala
@@ -40,9 +40,6 @@
     // String logging interface
     val slog = new SLogIO(p)
     val te = Input(Bool())
-
-    // DM-IF
-    val dm = Option.when(p.useDebugModule)(new DebugModuleIO(p))
   })
   dontTouch(io)
 
@@ -68,8 +65,8 @@
       dontTouch(dm.get.io)
       val dmEnable = RegInit(false.B)
       dmEnable := true.B
-      dm.get.io.ext.req <> GateDecoupled(io.dm.get.req, dmEnable)
-      io.dm.get.rsp <> GateDecoupled(dm.get.io.ext.rsp, dmEnable)
+      dm.get.io.ext.req <> GateDecoupled(csr.io.debug.get.req, dmEnable)
+      csr.io.debug.get.rsp <> GateDecoupled(dm.get.io.ext.rsp, dmEnable)
     }
 
     val core_reset = Mux(io.te, (!io.aresetn.asBool).asAsyncReset, (csr.io.reset || dm.map(_.io.ndmreset).getOrElse(false.B)).asAsyncReset)
diff --git a/hdl/chisel/src/kelvin/CoreAxiCSR.scala b/hdl/chisel/src/kelvin/CoreAxiCSR.scala
index 7c2447c..3aac4a8 100644
--- a/hdl/chisel/src/kelvin/CoreAxiCSR.scala
+++ b/hdl/chisel/src/kelvin/CoreAxiCSR.scala
@@ -19,6 +19,15 @@
 
 import bus.AxiMasterIO
 
+object CoreCsrAddrs {
+  val DbgReqAddr = 0x1000.U
+  val DbgReqData = 0x1004.U
+  val DbgReqOp   = 0x1008.U
+  val DbgRspData = 0x100c.U
+  val DbgRspOp   = 0x1010.U
+  val DbgStatus  = 0x1014.U
+}
+
 class CoreCSR(p: Parameters) extends Module {
   val io = IO(new Bundle {
     val fabric = Flipped(new FabricIO(p))
@@ -31,6 +40,7 @@
     val halted = Input(Bool())
     val fault = Input(Bool())
     val kelvin_csr = Input(new CsrOutIO(p))
+    val debug = Option.when(p.useDebugModule)(Flipped(new DebugModuleIO(p)))
   })
 
   // Bit 0 - Reset (Active High)
@@ -39,21 +49,79 @@
   val resetReg = RegInit(3.U(p.fetchAddrBits.W))
   val pcStartReg = RegInit(0.U(p.fetchAddrBits.W))
   val statusReg = RegInit(0.U(p.fetchAddrBits.W))
+  val debugReqAddrReg = Option.when(p.useDebugModule)(RegInit(0.U(32.W)))
+  val debugReqDataReg = Option.when(p.useDebugModule)(RegInit(0.U(32.W)))
+  val debugReqOpReg = Option.when(p.useDebugModule)(RegInit(DmReqOp.NOP.asUInt))
+
+  val writeEn = io.fabric.writeDataAddr.valid && !io.internal
+  val writeAddr = io.fabric.writeDataAddr.bits
+  val writeData = io.fabric.writeDataBits
+
+  val rsp_queue = if (p.useDebugModule) {
+    val queue = Module(new Queue(new DebugModuleRspIO(p), 1))
+    queue.io.enq <> io.debug.get.rsp
+
+    val req_valid_pulse = RegInit(false.B)
+    val write_to_op_reg = writeEn && writeAddr === CoreCsrAddrs.DbgReqOp
+    req_valid_pulse := Mux(write_to_op_reg && io.debug.get.req.ready, true.B, false.B)
+    io.debug.get.req.valid := req_valid_pulse
+
+    io.debug.get.req.bits.address := debugReqAddrReg.get
+    io.debug.get.req.bits.data := debugReqDataReg.get
+    val (req_op, req_op_valid) = DmReqOp.safe(debugReqOpReg.get)
+    io.debug.get.req.bits.op := Mux(req_op_valid, req_op, DmReqOp.NOP)
+
+    val write_to_status_reg = writeEn && writeAddr === CoreCsrAddrs.DbgStatus
+    queue.io.deq.ready := write_to_status_reg
+    Some(queue)
+  } else {
+    None
+  }
+
+  val debugReadMap = if (p.useDebugModule) {
+    val debugStatusReg = Cat(rsp_queue.get.io.deq.valid, io.debug.get.req.ready)
+    Seq(
+      CoreCsrAddrs.DbgReqAddr -> Cat(0.U(96.W), debugReqAddrReg.get),
+      CoreCsrAddrs.DbgReqData -> Cat(0.U(64.W), debugReqDataReg.get, 0.U(32.W)),
+      CoreCsrAddrs.DbgReqOp   -> Cat(0.U(32.W), debugReqOpReg.get, 0.U(64.W)),
+      CoreCsrAddrs.DbgRspData -> Cat(rsp_queue.get.io.deq.bits.data, 0.U(96.W)),
+      CoreCsrAddrs.DbgRspOp   -> Cat(0.U(96.W), rsp_queue.get.io.deq.bits.op.asUInt),
+      CoreCsrAddrs.DbgStatus  -> Cat(0.U(64.W), debugStatusReg, 0.U(32.W)),
+    )
+  } else {
+    Seq()
+  }
 
   val readData =
     MuxLookup(io.fabric.readDataAddr.bits, 0.U)(Seq(
       0x0.U -> Cat(0.U(96.W), resetReg),
       0x4.U -> Cat(0.U(64.W), pcStartReg, 0.U(32.W)),
       0x8.U -> Cat(0.U(32.W), statusReg, 0.U(64.W)),
-    ) ++ ((0 until p.csrOutCount).map(
+    ) ++ debugReadMap
+      ++ ((0 until p.csrOutCount).map(
       x => ((0x100 + 4*x).U -> (io.kelvin_csr.value(x) << (32 * (x % 4)).U))
     )))
+
+  val debugReadValidMap = if (p.useDebugModule) {
+    Seq(
+      CoreCsrAddrs.DbgReqAddr -> true.B,
+      CoreCsrAddrs.DbgReqData -> true.B,
+      CoreCsrAddrs.DbgReqOp   -> true.B,
+      CoreCsrAddrs.DbgRspData -> true.B,
+      CoreCsrAddrs.DbgRspOp   -> true.B,
+      CoreCsrAddrs.DbgStatus  -> true.B,
+    )
+  } else {
+    Seq()
+  }
+
   val readDataValid =
     MuxLookup(io.fabric.readDataAddr.bits, false.B)(Seq(
       0x0.U -> true.B,
       0x4.U -> true.B,
       0x8.U -> true.B,
-    ) ++ ((0 until p.csrOutCount).map(x => ((0x100 + 4*x).U -> true.B))))
+    ) ++ debugReadValidMap
+      ++ ((0 until p.csrOutCount).map(x => ((0x100 + 4*x).U -> true.B))))
 
   // Delay reads by one cycle
   val readDataNext = Pipe(readDataValid, readData, 1)
@@ -64,13 +132,30 @@
   io.pcStart := pcStartReg
   statusReg := Cat(io.fault, io.halted)
 
-  // TODO(atv): What bits are allowed to change in these? Add a mask or something.
-  resetReg := Mux(io.fabric.writeDataAddr.valid && io.fabric.writeDataAddr.bits === 0x0.U && !io.internal, io.fabric.writeDataBits(31,0), resetReg)
-  pcStartReg := Mux(io.fabric.writeDataAddr.valid && io.fabric.writeDataAddr.bits === 0x4.U && !io.internal, io.fabric.writeDataBits(63,32), pcStartReg)
-  io.fabric.writeResp := io.fabric.writeDataAddr.valid && MuxLookup(io.fabric.writeDataAddr.bits, false.B)(Seq(
+  // Register writes
+  resetReg := Mux(writeEn && writeAddr === 0x0.U, writeData(31,0), resetReg)
+  pcStartReg := Mux(writeEn && writeAddr === 0x4.U, writeData(63,32), pcStartReg)
+  if (p.useDebugModule) {
+    debugReqAddrReg.get := Mux(writeEn && writeAddr === CoreCsrAddrs.DbgReqAddr, writeData(31,0), debugReqAddrReg.get)
+    debugReqDataReg.get := Mux(writeEn && writeAddr === CoreCsrAddrs.DbgReqData, writeData(63,32), debugReqDataReg.get)
+    debugReqOpReg.get := Mux(writeEn && writeAddr === CoreCsrAddrs.DbgReqOp, writeData(95,64), debugReqOpReg.get)
+  }
+
+  val debugWriteValidMap = if (p.useDebugModule) {
+    Seq(
+      CoreCsrAddrs.DbgReqAddr -> true.B,
+      CoreCsrAddrs.DbgReqData -> true.B,
+      CoreCsrAddrs.DbgReqOp   -> true.B,
+      CoreCsrAddrs.DbgStatus  -> true.B,
+    )
+  } else {
+    Seq()
+  }
+
+  io.fabric.writeResp := writeEn && MuxLookup(writeAddr, false.B)(Seq(
     0x0.U -> true.B,
     0x4.U -> true.B,
-  ))
+  ) ++ debugWriteValidMap)
 }
 
 class CoreAxiCSR(p: Parameters,
@@ -87,6 +172,7 @@
     val halted = Input(Bool())
     val fault = Input(Bool())
     val kelvin_csr = Input(new CsrOutIO(p))
+    val debug = Option.when(p.useDebugModule)(Flipped(new DebugModuleIO(p)))
   })
 
   val axi = Module(new AxiSlave(p))
@@ -107,4 +193,7 @@
   csr.io.halted := io.halted
   csr.io.fault := io.fault
   csr.io.kelvin_csr := io.kelvin_csr
+  if (p.useDebugModule) {
+    io.debug.get <> csr.io.debug.get
+  }
 }
diff --git a/kelvin_test_utils/core_mini_axi_interface.py b/kelvin_test_utils/core_mini_axi_interface.py
index ab184d5..2c9ebdc 100644
--- a/kelvin_test_utils/core_mini_axi_interface.py
+++ b/kelvin_test_utils/core_mini_axi_interface.py
@@ -122,13 +122,6 @@
     self.slave_wfifo = Queue()
     self.slave_bfifo = Queue()
 
-    try:
-      self.debug_available = (self.dut.io_dm_req_valid != None)
-      self.dm_req_fifo = Queue()
-      self.dm_rsp_fifo = Queue()
-    except AttributeError as e:
-      self.debug_available = False
-
   async def init(self):
     cocotb.start_soon(self.master_awagent())
     cocotb.start_soon(self.master_wagent())
@@ -142,9 +135,13 @@
     cocotb.start_soon(self.slave_ragent())
     cocotb.start_soon(self.memory_write_agent())
     cocotb.start_soon(self.memory_read_agent())
-    if self.debug_available:
-      cocotb.start_soon(self.dm_req_agent())
-      cocotb.start_soon(self.dm_rsp_agent())
+
+  async def read_csr(self, addr):
+    val = await self.read_word(0x30000 + addr)
+    return val
+
+  async def write_csr(self, addr, data):
+    await self.write_word(0x30000 + addr, data)
 
   async def slave_awagent(self, timeout=4096):
     self.dut.io_axi_slave_write_addr_valid.value = 0
@@ -318,43 +315,6 @@
         if timeout_count >= timeout:
           assert False, "timeout waiting for rready"
 
-  async def dm_req_agent(self, timeout=4096):
-    self.dut.io_dm_req_valid.value = 0
-    self.dut.io_dm_req_bits_address.value = 0
-    self.dut.io_dm_req_bits_data.value = 0
-    self.dut.io_dm_req_bits_op.value = 0
-    while True:
-      while True:
-        await RisingEdge(self.dut.io_aclk)
-        self.dut.io_dm_req_valid.value = 0
-        if self.dm_req_fifo.qsize():
-          break
-      req_data = await self.dm_req_fifo.get()
-      self.dut.io_dm_req_valid.value = 1
-      self.dut.io_dm_req_bits_address.value = req_data["address"]
-      self.dut.io_dm_req_bits_data.value = req_data["data"]
-      self.dut.io_dm_req_bits_op.value = req_data["op"]
-      await FallingEdge(self.dut.io_aclk)
-      timeout_count = 0
-      while self.dut.io_dm_req_ready.value == 0:
-        await FallingEdge(self.dut.io_aclk)
-        timeout_count += 1
-        if timeout_count >= timeout:
-          assert False, "timeout waiting for dm_req_ready"
-
-  async def dm_rsp_agent(self):
-    self.dut.io_dm_rsp_ready.value = 1
-    while True:
-      await RisingEdge(self.dut.io_aclk)
-      try:
-        if self.dut.io_dm_rsp_valid.value:
-          rsp = dict()
-          rsp["data"] = self.dut.io_dm_rsp_bits_data.value.to_unsigned()
-          rsp["op"] = self.dut.io_dm_rsp_bits_op.value.to_unsigned()
-          await self.dm_rsp_fifo.put(rsp)
-      except Exception as e:
-        print('X seen in dm_rsp_agent: ' + str(e))
-
   async def memory_write_agent(self):
     while True:
       while True:
@@ -445,23 +405,43 @@
     kelvin_reset_csr_addr = 0x30000
     await self.write_word(kelvin_reset_csr_addr, 3)
 
+  async def _poll_dm_status(self, bit, value):
+    while True:
+      status = await self.read_csr(0x1014)
+      if (status[0] & (1 << bit)) == value:
+        break
+      await ClockCycles(self.dut.io_aclk, 10)
+
   async def dm_read(self, addr):
-    req = dict()
-    req["address"] = addr
-    req["data"] = 0
-    req["op"] = DmReqOp.READ
-    await self.dm_req_fifo.put(req)
-    rsp = await self.dm_rsp_fifo.get()
+    await self._poll_dm_status(0, 1)
+
+    await self.write_csr(0x1000, addr)
+    await self.write_csr(0x1004, 0)
+    await self.write_csr(0x1008, DmReqOp.READ)
+
+    await self._poll_dm_status(1, 2)
+
+    rsp = dict()
+    rsp["data"] = int((await self.read_csr(0x100c)).view(np.uint32)[0])
+    rsp["op"] = (await self.read_csr(0x1010)).view(np.uint32)[0]
+    await self.write_csr(0x1014, 0)  # Acknowledge response.
+
     assert rsp["op"] == DmRspOp.SUCCESS
     return rsp["data"]
 
   async def dm_write(self, addr, data):
-    req = dict()
-    req["address"] = addr
-    req["data"] = convert_to_binary_value(np.array([data], dtype=np.uint32).view(np.uint8))
-    req["op"] = DmReqOp.WRITE
-    await self.dm_req_fifo.put(req)
-    rsp = await self.dm_rsp_fifo.get()
+    await self._poll_dm_status(0, 1)
+
+    await self.write_csr(0x1000, addr)
+    await self.write_csr(0x1004, data)
+    await self.write_csr(0x1008, DmReqOp.WRITE)
+
+    await self._poll_dm_status(1, 2)
+
+    rsp = dict()
+    rsp["data"] = int((await self.read_csr(0x100c)).view(np.uint32)[0])
+    rsp["op"] = (await self.read_csr(0x1010)).view(np.uint32)[0]
+    await self.write_csr(0x1014, 0)  # Acknowledge response.
     return rsp
 
   async def dm_read_reg(self, addr, expected_op=DmRspOp.SUCCESS):