RTL Debugging Methodology
Systematic approach for debugging RTL from verification results and test scenarios.
When to Use This Skill
- •Analyzing UVM test failures to identify RTL bugs
- •Investigating assertion violations in simulation
- •Debugging scoreboard mismatches between expected and actual behavior
- •Triaging multiple test failures to find common root causes
- •Understanding why specific test scenarios fail while others pass
- •Analyzing coverage holes related to bugs
Debugging Workflow (推論プロセス)
1. Analyze Test Failure Pattern
Objective: Understand which tests fail and why
Questions to answer:
- •Which tests pass and which fail? (Pattern analysis)
- •Do failures occur in specific test scenarios only?
- •Is the failure deterministic or random? (Check with different seeds)
- •At what phase does the test fail? (Build, run, scoreboard check)
Evidence sources:
- •Test execution logs (sim/logs/)
- •Regression test results (sim/reports/)
- •UVM report summary (UVM_ERROR, UVM_FATAL locations)
- •DSIM Collect Verification Evidence
Objective: Gather all available evidence from verification components
Verification evidence sources:
From Assertions:
UVM_ERROR @ 1250ns: Assertion 'a_axi_wdata_stable' failed Location: sim/assertions/axi4_protocol_checker.sv:45 Property: wdata must remain stable when wvalid=1 and wready=0
→ RTL violates AXI4 protocol specification
From Scoreboard:
UVM_ERROR: [SCOREBOARD] Data mismatch detected Expected: 0xDEADBEEF Actual: 0xDEADBEE0 Address: 0x1000 Time: 1250ns
→ LSB nibble corrupted, check datapath width or masking logic
From Monitor:
UVM_WARNING: [MONITOR] Unexpected transaction observed Type: WRITE Address: 0x1004 (expected: 0x1000) PossiMap Evidence to RTL Problem Domain **Objective**: Translate verification failures to RTL problem categories **Evidence-to-Problem mapping**: | Verification Evidence | RTL Problem Domain | Investigation Focus | |----------------------|-------------------|---------------------| | **Assertion: Protocol violation** | Interface logic | Check handshake FSM, signal timing | | **Scoreboard: Data mismatch** | Datapath logic | Check ALU, mux select, forwarding | | **Scoreboard: Missing transaction** | Control logic | Check enable signals, FSM transitions | | **Scoreboard: Extra transaction** | Control logic | Check termination conditions, counters | | **Monitor: Wrong address** | Address generation | Check increment/decrement logic, offset calculation | | **Monitor: Wrong timing** | Pipeline control | Check stall logic, valid/ready propagation | | **Assertion: X-propagation** | Reset/initialization | Check reset assignments, case completeness | **Test scenario analysis**:
Failing scenario: Back-to-back writes with no idle cycles Passing scenario: Writes with 2-cycle gaps
Hypothesis generation:
- •Pipeline hazard when no bubble between transactions
- •Backpressure handling assumes idle cycles
- •State machine doesn't handle consecutive valid inputs
- •Register forwarding path missing for zero-latency case
**Objective**: Create minimal test to isolate root cause **Experiment design strategies**: **Modify existing failing test**: ```systemverilog // Original failing test: Back-to-back writes sequence.add_transaction(WRITE, addr=0x1000, data=0xAA); sequence.add_transaction(WRITE, addr=0x1004, data=0xBB); // ← FAILS // Experiment 1: Add gap between transactions sequence.add_transaction(WRITE, addr=0x1000, data=0xAA); sequence.add_idle_cycles(2); sequence.add_transaction(WRITE, addr=0x1004, data=0xBB); // ← PASS? // If passes: Confirms pipeline hazard hypothesis // Experiment 2: Same address back-to-back sequence.add_transaction(WRITE, addr=0x1000, data=0xAA); sequence.add_transaction(WRITE, addr=0x1000, data=0xBB); // ← PASS/FAIL? // If passes: Problem is address-generation specific
Create minimal directed test:
// Hypothesis: Burst counter overflows at length=16
class minimal_burst_test extends base_test;
virtual task run_phase(uvm_phase phase);
phase.raise_objection(this);
// Test exactly at boundary
send_burst(addr=0x0, length=15); // Should work
send_burst(addr=0x0, length=16); // Should fail
send_burst(addr=0x0, length=17); // Should fail
phase.drop_objection(this);
endtask
endclass
Add debug assertions:
// Insert temporary assertion at suspected problem point
bind axi_slave_fsm debug_assertions (
.clk(clk),
.state(current_state),
.wvalid(wvalid),
.wready(wready)
);
Trace from Verification to RTL Root Cause
**Objective**: Navigate from high-level test failure to specific RTL bug
**Top-down tracing workflow**:
- •
Test Failure └─ axiuart_burst_test fails with scoreboard mismatch
- •
Scoreboard Analysis └─ Expected data: 0xBB, Actual: 0xAA └─ Second write returned first write's data
- •
Monitor Analysis (check transactions observed) └─ WRITE(addr=0x1000, data=0xAA) @ 1000ns - acknowledged └─ WRITE(addr=0x1004, data=0xBB) @ 1002ns - acknowledged └─ READ(addr=0x1004) @ 1010ns - returned 0xAA (wrong!)
- •
Waveform Analysis at 1002ns (second write) └─ axi_wdata = 0xBB ✓ └─ axi_waddr = 0x1004 ✓ └─ write_enable = 1'b1 ✓ └─ But: register_select still points to 0x1000 ✗
- •
RTL Module Analwith Test Suite
Objective: Confirm fix resolves issue without breaking other tests
Verification workflow:
Step 1: Re-run failing test
# Run specific test that previously failed run_uvm_simulation --test axiuart_burst_test --seed 12345 # Expected: PASS
Step 2: Run related tests (test suite partitioning)
# Run all tests that exercise same RTL module run_uvm_simulation --regression smoke_suite # Focus: Tests with write transactions, address decoding
Step 3: Full regression
### By Test Failure Type | Failure Type | Root Cause Category | Investigation Focus | |-------------|---------------------|---------------------| | **Scoreboard mismatch: wrong data** | Datapath error | Trace data from source to sink, check mux selects, forwarding | | **Scoreboard mismatch: missing transaction** | Control flow error | Check FSM transitions, enable signals, counter termination | | **Scoreboard mismatch: extra transaction** | Control flow error | Check counter overflow, FSM looping, duplicate strobes | | **Assertion: Protocol violation** | Interface timing | Check handshake sequences, stability requirements, backpressure | | **Assertion: Stability violation** | Combinational logic | Check for unintended signal changes, glitches, race conditions | | **Assertion: X-propagation** | Initialization error | Check reset coverage, case statement completeness, undriven signals | | **Timeout: No response** | Deadlock or FSM stuck | Check FSM for unreachable transitions, missing conditions | | **UVM_FATAL: Null object** | Verification code bug | Not RTL issue - check testbench configuration | ### By Test Pass/Fail Pattern **Pattern: Only random tests fail, directed tests pass** - **Hypothesis**: Corner case not covered by directed tests - **Action**: Analyze failing random test stimulus for common characteristics - **Example**: Random test hits burst length=256, directed tests only ≤16 **Pattern: All tests with feature X fail, others pass** - **Hypothesis**: Feature X has RTL bug - **Action**: Focus debug on RTL module implementing feature X - **Example**: All interrupt tests fail → debug interrupt controller **Pattern: Intermittent failures with different seeds** - **Hypothesis**: Race condition or initialization dependency - *From Verification Evidence to RTL Root Cause ### Scoreboard-Driven Investigation **Scoreboard reports data mismatch**:
Step 1: Identify transaction with mismatch Monitor: WRITE(addr=0x1000, data=0xDEADBEEF) @ 1000ns Scoreboard: Expected 0xDEADBEEF at 0x1000 Monitor: READ(addr=0x1000) → 0xDEADBEE0 @ 1100ns Mismatch: LSB nibble changed 0xF → 0x0
Step 2: Hypothesize based on bit pattern
- •All bits except LSB nibble correct → byte masking issue
- •LSB nibble zeroed → possible width/alignment problem
Step 3: Check waveform at write cycle (1000ns) axi_wdata[31:0] = 0xDEADBEEF ✓ write_strobe[3:0] = 4'b1111 ✓ register_wdata[31:0] = 0xDEADBEE0 ✗ ← BUG IS HERE
Step 4: Trace write path axi_wdata → data_align_unit → register_wdata Check data_align_unit for LSB nibble handling
Step 5: Find root cause in RTL // Bug found in data_align_unit assign register_wdata = {axi_wdata[31:4], 4'b0000}; // ← Hardcoded zero!
### Assertion-Driven Investigation **Assertion reports protocol violation**:
Assertion 'a_axi_wdata_stable' failed @ 1250ns Property: (wvalid && !wready) |=> $stable(wdata)
Step 1: Understand assertion semantics
- •wdata must not change when wvalid=1 and wready=0
- •This is AXI4 protocol requirement
Step 2: Check waveform at violation timestamp @1249ns: wvalid=1, wready=0, wdata=0xAAAA @1250ns: wvalid=1, wready=0, wdata=0xBBBB ← Changed illegally
Step 3: Find source of wdata in RTL assign wdata = write_fifo_dout;
Step 4: Check FIFO read logic assign fifo_read_en = wvalid && wready; ✓ Correct condition
Step 5: Check for other paths affecting wdata // Found: Debug logic bypassing FIFO! assign wdata = debug_mode ? debug_data : write_fifo_dout; // debug_mode changed during backpressure → violation
### Test Suite Differential Analysis **Multiple tests analysis**: | Test Name | Scenario | Result | Common Attribute | |-----------|----------|--------|------------------| | basic_write | Single write | ✓ PASS | Burst length = 1 | | burst4_write | 4-beat burst | ✓ PASS | Burst length = 4 | | bDebugging Techniques from Test Results ### Regression Test Triage **Analyze multiple test results to find common root cause**:
Regression suite: 42 tests total
- •38 PASS
- •4 FAIL: axiuart_burst16, axiuart_burst32, axiuart_wrap16, axiuart_wrap32
Pattern recognition:
- •All failures involve burst length ≥ 16
- •Both INCR and WRAP burst types affected
- •Burst length ≤ 8 always passes
Common root cause hypothesis:
- •Burst counter width insufficient for length ≥ 16
- •Not specific to burst type (INCR vs WRAP)
- •Not data-pattern dependent
Single fix expected to resolve all 4 failures.
### Minimal Reproducing Test
**Create simplest test that triggers bug**:
```systemverilog
// Original failing test: 200 lines, 10 minutes runtime
class axiuart_burst16_test extends base_test;
// Complex randomization, multiple sequences, ...
endclass
// Minimal reproducer: 15 lines, 10 seconds runtime
class minimal_burst16_test extends base_test;
task run_phase(uvm_phase phase);
axi_seq seq = axi_seq::type_id::create("seq");
phase.raise_objection(this);
// Single burst-16 transaction
seq.addr = 32'h1000;
seq.burst_length = 16; // Minimal case that fails
seq.start(env.agent.sequencer);
phase.drop_objection(this);
endtask
endclass
// Run: Still fails with same root cause
// Benefit: Faster debug iteration (10s vs 10min)
Test Modification Experiments
Systematically modify test to isolate variable: Debugging Pitfalls
Don't Debug Without Test Evidence
❌ Wrong: "I think the problem is in module X, let me check the code" ✅ Right: "Test Y failed with scoreboard mismatch at time T, let me analyze the evidence"
Don't Ignore Test Pass/Fail Patterns
❌ Wrong: Debug first failure in isolation, ignore other tests ✅ Right: Analyze which tests pass/fail to identify common characteristics
Don't Trust Single Test Result
❌ Wrong: Test passed once → bug is fixed ✅ Right: Run regression suite (multiple seeds, scenarios) to confirm fix
Don't Modify RTL Without Evidence
❌ Wrong: Change RTL based on intuition, hope test passes ✅ Right: Trace from test failure → scoreboard → monitor → waveform → RTL
Don't Create Tests Without Purpose
❌ Wrong: Write random tests hoping to find bugs ✅ Right: Analyze coverage holes, create tests targeting untested scenarios
Don't Skip Regression After Fix
❌ Wrong: Failing test now passes → Done ✅ Right: Run full regression to ensure fix doesn't break other tests // Final conclusion: Pure burst length issue, check counter width
### Coverage-Guided Root Cause Analysis
**Use coverage to identify untested paths related to bug**:
```systemverilog
// Coverage report after test failures
covergroup cg_burst_length;
cp_length: coverpoint burst_length {
bins short[] = {[1:8]}; // 100% hit
bins boundary = {15, 16}; // 16 causes failures
bins long[] = {[17:256]}; // 0% hit ← Never tested!
}
endgroup
// Analysis:
// - Tests never tried burst_length > 16
// - Bug might affect all values ≥ 16, not just 16
// - After fix, add test for burst_length=256 to verify
from Test Failures
### From Scoreboard Timestamp to Waveform
**Workflow**:
- •
Test log shows scoreboard error at simulation time 1250ns UVM_ERROR: [SCOREBOARD] Data mismatch at addr=0x1000
- •
Set waveform viewer to time 1250ns
- •
Identify relevant signals from monitor transaction:
- •axi_awaddr (write address channel)
- •axi_wdata (write data channel)
- •Internal register_file signals
- •
Check transaction timing: @1240ns: awvalid=1, awaddr=0x1000, awready=1 (address accepted) @1242ns: wvalid=1, wdata=0xBEEF, wready=1 (data accepted) @1250ns: register_file[0] = 0xBEE0 ← Should be 0xBEEF
- •
Trace internal path: axi_wdata (0xBEEF) → write_data_reg (0xBEEF) → data_align (0xBEE0) ← BUG HERE
### Backward Tracing from Assertion **Assertion fires, trace backward to root cause**:
Assertion violation @ 1250ns: a_valid_stable: (valid && !ready) |=> $stable(data)
Waveform analysis: @1249ns: valid=1, ready=0, data=0xAAAA @1250ns: valid=1, ready=0, data=0xBBBB ← Violated $stable()
Trace data signal backward: data ← output_mux output_mux ← select between fifo_out and bypass_data mux_select changed at 1250ns ← WHY?
Trace mux_selefrom verification results is evidence-driven investigation:
- •Analyze test failure patterns - Which tests fail? What do they have in common?
- •Collect verification evidence - Scoreboard, assertions, monitors, logs
- •Map evidence to RTL problem domain - Translate test failure to RTL category
- •Design targeted experiments - Create minimal tests to isolate root cause
- •Trace from verification to RTL - Navigate from test → scoreboard → waveform → RTL
- •Verify with test suite - Confirm fix with regression, add prevention tests
Key principle: Test results guide investigation. Start from verification evidence (test failures, assertion violations, scoreboard mismatches), not RTL code reading
By Affected Component
Datapath issues:
- •Check operand widths, sign extension, overflow handling
- •Verify bypass/forwarding conditions
- •Trace data flow from source to destination
Control logic issues:
- •Draw state transition diagram from code
- •Verify all states are reachable
- •Check for conflicting control signals
Interface issues:
- •Review protocol timing diagrams
- •Check handshake signal relationships (valid before ready, stable until accepted)
- •Verify backpressure handling
Hypothesis Generation Strategies
Backwards Tracing
Start at the failure point and work backwards:
- •Identify the first wrong signal at failure timestamp
- •Find all signals that directly drive it (combinational or registered)
- •Check if those signals are correct one cycle earlier
- •Repeat until you find where correct values become incorrect
Dependency Analysis
Map signal dependencies:
output_wrong [time=1250ns]
├─ driven by: alu_result (combinational)
│ ├─ operand_a (registered at 1249ns) ✓ correct
│ ├─ operand_b (registered at 1249ns) ✗ INCORRECT
│ └─ operation (registered at 1249ns) ✓ correct
└─ operand_b driven by: bypass_mux
├─ mem_result (registered at 1248ns) ✓ correct
├─ ex_result (registered at 1249ns) ✗ INCORRECT
└─ bypass_select ✗ WRONG MUX SELECT ← ROOT CAUSE
Differential Diagnosis
Compare working vs failing cases:
| Aspect | Working Case | Failing Case | Insight |
|---|---|---|---|
| Input pattern | 0x00000001 | 0x80000000 | MSB triggers bug |
| Execution path | State A→B→C | State A→B→D | Transition B→D buggy |
| Timing | No stalls | Pipeline stall | Stall logic incorrect |
Verification Techniques
Assertion-Based Isolation
Insert temporary assertions to partition the design:
// Check: Does problem occur before or after this pipeline stage?
property p_debug_stage2_input;
@(posedge clk) stage2_valid |-> stage2_input inside {[0:1000]};
endproperty
assert property (p_debug_stage2_input)
else $error("Problem exists at stage2 input");
Minimal Reproducer
Reduce test case to absolute minimum:
- •Start with failing test
- •Remove stimulus that doesn't affect failure
- •Shorten simulation time to just before failure
- •Remove unrelated RTL modules
- •Result: ~20 line testbench, ~50 line RTL
Benefits: Faster iteration, easier to share, clearer root cause
Force/Release Experiments
Test hypotheses by overriding signals:
// Hypothesis: Bug disappears if bypass is disabled
initial begin
#100ns;
force top.cpu.bypass_enable = 1'b0;
// Observe if problem still occurs
end
Caution: Only for debugging, never in production code
Coverage-Guided Debugging
Use coverage holes to identify untested scenarios:
covergroup cg_state_transitions @(posedge clk);
cp_current: coverpoint state;
cp_next: coverpoint state_next;
cross cp_current, cp_next; // Are all transitions covered?
endgroup
If bug occurs: Check if failing scenario corresponds to coverage hole
Common Pitfalls
Don't Trust Assumptions
❌ Wrong: "Signal X is always stable, so I won't check it" ✅ Right: Add assertion to verify assumption, then proceed
Don't Skip Symptom Observation
❌ Wrong: Jump straight to suspected module and start modifying ✅ Right: Observe exact failure in waveform, then form hypothesis
Don't Fix Symptoms
❌ Wrong: Add logic to mask the symptom without understanding root cause ✅ Right: Trace to root cause, fix it, verify symptom disappears
Don't Test Multiple Changes
❌ Wrong: Make 3 changes simultaneously, rerun simulation ✅ Right: Change one thing at a time, verify effect
Waveform Analysis Patterns
Cause → Effect Tracing
- •Find the symptom signal at failure timestamp
- •Look 1-2 cycles back for potential causes
- •Check if cause signals deviated from expected
- •Repeat backwards until finding the origin
Critical Path Analysis
Identify longest combinational path:
// Use $time in always_comb to detect long paths
always_comb begin
logic [31:0] temp1, temp2, temp3;
temp1 = input_a & input_b; // 1 gate delay
temp2 = temp1 | input_c; // 1 gate delay
temp3 = temp2 ^ input_d; // 1 gate delay
output_z = temp3 + input_e; // 1 gate delay
// Total: 4 gate delays - may violate timing
end
Clock Domain Crossing Detection
Look for signals crossing without proper synchronization:
Clock A domain: signal_a toggles at time 1250ns
Clock B domain: signal_b samples signal_a at 1251ns
↑ METASTABILITY RISK if clocks unrelated
Integration with Other Skills
- •dsim-debugging: Use when DSIM tool itself has issues (environment, waves, logs)
- •rtl-coding-standards: Apply when fixing identified bugs to maintain code quality
- •assertion-design: Create permanent assertions for bugs found during debugging
- •mcp-workflow: Use MCP commands to compile/run debug experiments quickly
Summary
RTL debugging is systematic reasoning:
- •Reproduce the problem reliably
- •Observe symptoms without assumptions
- •Generate hypotheses based on evidence
- •Test each hypothesis independently
- •Narrow down to single root cause
- •Verify fix and prevent regression
Key principle: Evidence over intuition. Always trace from observed symptoms to root cause using waveforms and assertions.