Systematic Debugging for Breenix
Document-driven debugging workflow for kernel issues.
Purpose
Complex kernel bugs require systematic investigation and documentation. This skill provides the pattern used in Breenix debugging docs like TIMER_INTERRUPT_INVESTIGATION.md, DIRECT_EXECUTION_FIX.md, and PAGE_TABLE_FIX.md.
The Four-Phase Pattern
All debugging documents follow this structure:
- •Problem: What's broken? Observable symptoms
- •Root Cause: Why is it broken? Deep analysis
- •Solution: What fixes it? Implementation details
- •Evidence: How do you know it's fixed? Before/after proof
When to Use
- •Complex kernel issues: Not simple typos or obvious bugs
- •Architectural problems: Issues requiring design changes
- •Recurring failures: Problems that reappear or are hard to reproduce
- •Learning opportunities: Bugs that teach important lessons
- •CI investigations: Failed tests requiring deep analysis
Debugging Workflow
Phase 1: Problem Definition
Document observable symptoms:
# Problem Summary [Brief description of what's failing] ## Symptoms - What fails? (test, boot, specific operation) - When does it fail? (always, intermittently, specific conditions) - Error messages or behavior observed - What works vs what doesn't
Example from DIRECT_EXECUTION_FIX.md:
# Problem Summary Direct userspace execution was failing with a double fault at `int 0x80` instruction (`0x10000019`). ## Symptoms - Userspace processes boot successfully - Calling int 0x80 triggers double fault - Error occurs during Ring 3 → Ring 0 transition
Phase 2: Root Cause Analysis
Investigate systematically:
- •
Reproduce consistently
bash# Use kernel-debug-loop for fast iteration kernel-debug-loop/scripts/quick_debug.py --signal "FAILURE_POINT" --timeout 10
- •
Add diagnostic logging
rustlog::debug!("About to perform operation X"); log::debug!("Variable state: {:?}", state); log::debug!("After operation X"); - •
Narrow down location
- •Binary search: Add checkpoint in middle of suspect code
- •If reached: problem is after
- •If not reached: problem is before
- •Repeat until isolated
- •
Analyze state
- •What values are variables?
- •What should they be?
- •What assumptions are violated?
Document findings:
## Root Cause Analysis 1. **Sequence of Events**: - Step 1 happens - Step 2 happens - Step 3 fails because X 2. **Technical Details**: - Specific memory addresses, registers, flags - Code paths taken - Assumptions violated 3. **Why It Happens**: - Fundamental reason for the failure - What design assumption was wrong
Phase 3: Solution Implementation
Document the fix:
## Solution ### 1. [Component] Fix **File**: `path/to/file.rs` **Lines**: X-Y [Explanation of what changed and why] ```rust // Code snippet showing the fix
2. [Another Component] Fix
File: path/to/another/file.rs
Lines: X-Y
[Explanation]
**Example structure:** - Identify all files that need changes - For each change: - File path - Line numbers - Explanation of change - Code snippet - Rationale ### Phase 4: Evidence Collection **Prove it works:** ```markdown ## Evidence ### Before Fix:
[Log output or error messages showing failure]
### After Fix:
[Log output showing success]
### Test Results: - Test X: PASS - Test Y: PASS - Feature Z: Working as expected
Integration with Tools
With kernel-debug-loop
Fast iteration during investigation:
# Test hypothesis quickly kernel-debug-loop/scripts/quick_debug.py \ --signal "CHECKPOINT_AFTER_FIX" \ --timeout 15
With log-analysis
Extract evidence from logs:
# Find before/after comparison echo '"Error pattern"' > /tmp/log-query.txt ./scripts/find-in-logs echo '"Success pattern"' > /tmp/log-query.txt ./scripts/find-in-logs
With ci-failure-analysis
Analyze CI test failures:
ci-failure-analysis/scripts/analyze_ci_failure.py \ --context target/xtask_*_output.txt
Debug Document Template
# [Issue Name] Fix Date: [YYYY-MM-DD] ## Problem Summary [What's broken - one paragraph] ## Symptoms - Symptom 1 - Symptom 2 - Error messages or behavior ## Root Cause Analysis ### Sequence of Events 1. [Step by step what happens] 2. [Leading to failure] ### Technical Details - Memory addresses, registers, etc. - Code paths taken - State at time of failure ### Why It Happens [Fundamental explanation] ## Solution ### 1. [First Change] **File**: `path/to/file.rs` **Lines**: X-Y [Explanation] ```rust // Code change
2. [Second Change]
File: path/to/file2.rs
Lines: X-Y
[Explanation]
Evidence
Before Fix:
[Error output]
After Fix:
[Success output]
Lessons Learned
- •[Key insight 1]
- •[Key insight 2]
- •[Patterns to apply in future]
Related Issues
- •[Link to similar past bugs]
- •[Related design decisions]
## Example: Real Debugging Session Based on TIMER_INTERRUPT_INVESTIGATION.md: **Problem**: Kernel hanging after enabling interrupts **Investigation**: 1. Compare with other OS implementations (blog_os, xv6, Linux) 2. Identify what they do (minimal timer handlers) 3. Identify what Breenix does (complex timer handler with locks) 4. Hypothesis: Timer handler too complex **Solution**: Create simple_timer.rs with minimal handler **Evidence**: - Before: Kernel hangs immediately - After: Kernel boots and reaches testing menu **Lesson**: Interrupt handlers must be TRULY minimal ## Best Practices 1. **Document as you debug**: Don't wait until after 2. **Include evidence**: Logs, test results, screenshots 3. **Explain reasoning**: Why you investigated X, not Y 4. **Note dead ends**: What you tried that didn't work 5. **Extract lessons**: What to remember for next time 6. **Update related docs**: If this reveals design issues 7. **Create regression tests**: Prevent this bug from returning ## When to Create a Debug Document Create a document when: - Bug took >2 hours to solve - Solution required design changes - Bug could reoccur without understanding - Lessons applicable to future development - Multiple components involved - Fix not immediately obvious from code change ## Summary Systematic debugging follows: 1. Problem - Clear symptom description 2. Root Cause - Deep technical analysis 3. Solution - Implementation with rationale 4. Evidence - Before/after proof This pattern ensures: - Thorough understanding - Proper fixes (not workarounds) - Knowledge preservation - Prevention of similar bugs