Debug: Root Cause Analysis
Purpose
Systematically identify and fix the root cause of defects, errors, or unexpected behavior using lean problem-solving methods. Stop fixing symptoms - find the true cause.
Mastery Levels (ShuHaRi)
Shu (守): Follow the 5 Whys method strictly; document each step.
Ha (破): Combine methods (Ishikawa + 5 Whys); adapt depth to problem complexity.
Ri (離): Develop domain-specific diagnostic patterns; teach others the methods.
Context
When to use:
- •Unexpected behavior or errors
- •Test failures with unclear cause
- •Integration issues
- •Performance problems
- •"It works on my machine" situations
When NOT to use:
- •Obvious typos or simple syntax errors
- •Well-documented known issues
- •Feature requests (use
/rai-story-planinstead)
Inputs required:
- •Clear problem statement
- •Steps to reproduce (if available)
- •Error messages or symptoms
Output:
- •Root cause identified
- •Fix implemented or documented
- •Prevention measures (optional)
Methods Overview
| Method | Use When | Depth |
|---|---|---|
| 5 Whys | Single causal chain | Quick (5-10 min) |
| Ishikawa | Multiple possible causes | Medium (15-30 min) |
| Gemba | Need to observe actual behavior | Variable |
| A3 | Complex problems requiring documentation | Deep (30+ min) |
Steps
Step 1: Define the Problem (Genchi Genbutsu)
Go see the actual problem. Don't rely on descriptions.
- •Reproduce the issue - Can you make it happen?
- •Capture evidence - Error messages, logs, screenshots
- •Write problem statement - One sentence, specific, observable
Problem Statement Template:
WHAT is happening: [specific behavior] WHEN it happens: [conditions/triggers] WHERE it occurs: [location in code/system] EXPECTED behavior: [what should happen]
Verification: Problem statement is specific and reproducible.
If you can't continue: Cannot reproduce → Gather more information, check logs.
Step 2: Apply 5 Whys
Ask "Why?" five times to drill down to root cause.
Rules:
- •Each answer must be factual, not speculative
- •Stay on one causal chain (don't branch)
- •Stop when you reach something you can change
- •"Human error" is never a root cause - ask why the error was possible
Template:
## 5 Whys Analysis **Problem:** [Problem statement] 1. **Why?** [First-level cause] → Because: [Evidence/observation] 2. **Why?** [Second-level cause] → Because: [Evidence/observation] 3. **Why?** [Third-level cause] → Because: [Evidence/observation] 4. **Why?** [Fourth-level cause] → Because: [Evidence/observation] 5. **Why?** [Root cause] → Because: [Evidence/observation] **Root Cause:** [Summary] **Countermeasure:** [Fix]
Verification: Root cause is actionable and explains all symptoms.
If you can't continue: Chain branches → Use Ishikawa for multiple causes.
Step 3: Ishikawa Diagram (If Needed)
For problems with multiple potential causes, use the fishbone diagram.
Categories (6 M's for software):
┌─── Method (process, algorithm)
│
├─── Machine (hardware, infrastructure)
│
├─── Material (data, inputs, dependencies)
PROBLEM ◄───────────┤
├─── Measurement (metrics, monitoring)
│
├─── Manpower (skills, knowledge gaps)
│
└─── Milieu (environment, configuration)
For each category, list potential causes:
## Ishikawa Analysis **Problem:** [Problem statement] ### Method - [ ] Algorithm logic error - [ ] Missing edge case handling - [ ] Incorrect sequence of operations ### Machine - [ ] Resource constraints (memory, CPU) - [ ] Platform-specific behavior - [ ] Network issues ### Material - [ ] Invalid input data - [ ] Dependency version mismatch - [ ] Missing configuration ### Measurement - [ ] Inadequate error messages - [ ] Missing logs at failure point - [ ] Incorrect assertions in tests ### Manpower - [ ] Documentation gap - [ ] Unclear requirements - [ ] Knowledge not shared ### Milieu - [ ] Environment variable missing - [ ] Different dev vs prod config - [ ] File path differences **Most Likely Causes:** [Top 2-3] **Investigation Order:** [Priority]
Verification: At least 3 categories explored; most likely causes identified.
If you can't continue: All causes eliminated → Broaden investigation scope.
Step 4: Investigate and Verify
Test hypotheses systematically.
- •Start with most likely cause
- •Design minimal test - How to confirm/refute?
- •Execute test - Gather evidence
- •Document result - Confirmed or eliminated?
Investigation Log:
## Investigation Log | Hypothesis | Test | Result | Conclusion | |------------|------|--------|------------| | [Cause 1] | [How tested] | [What happened] | Confirmed/Eliminated | | [Cause 2] | [How tested] | [What happened] | Confirmed/Eliminated |
Verification: Root cause confirmed with evidence.
If you can't continue: No hypothesis confirmed → Return to Step 3, add categories.
Step 5: Implement Fix
Fix the root cause, not symptoms.
Fix Checklist:
- • Fix addresses root cause (from Step 2/4)
- • Fix doesn't introduce new issues
- • Original problem no longer reproduces
- • Related edge cases considered
Verification: Problem no longer occurs; tests pass.
If you can't continue: Fix incomplete → Document partial fix, create follow-up task.
Step 6: Prevent Recurrence (Optional)
For significant issues, add prevention measures.
Prevention Options:
- •Test: Add regression test covering this case
- •Validation: Add input validation or type check
- •Documentation: Document gotcha or common mistake
- •Tooling: Add lint rule or pre-commit hook
- •Process: Update checklist or review criteria
Verification: Prevention measure in place.
Output
- •Artifact:
work/debug/{issue-name}/analysis.md(optional, for complex issues) - •Fix: Implemented code changes
- •Prevention: Test, validation, or documentation (optional)
Quick Reference
5 Whys Shortcuts
| Symptom Pattern | Common Root Causes |
|---|---|
| "Works locally, fails in CI" | Environment config, path differences, missing deps |
| "Intermittent failure" | Race condition, external dependency, resource limit |
| "Broke after update" | Dependency change, API breaking change, config drift |
| "Only fails with certain data" | Edge case, encoding issue, type coercion |
Red Flags (Stop and Think)
- •"Let me just try this..." → Stop. Formulate hypothesis first.
- •"It's probably X" → Test it. Don't assume.
- •"I'll fix it later" → Jidoka. Fix now or document why not.
- •"Nobody knows why it works" → Gemba. Understand before changing.
Notes
Jidoka Integration
This skill implements Jidoka principle: Stop and fix quality issues immediately.
When you detect a defect during any skill:
- •Stop the current work
- •Invoke
/rai-debugto find root cause - •Fix the root cause
- •Resume original work
Time Boxing
| Problem Complexity | Max Investigation Time |
|---|---|
| Simple (single component) | 15 minutes |
| Medium (multiple components) | 30 minutes |
| Complex (system-wide) | 60 minutes |
If exceeding time box, document findings and escalate.
References
- •5 Whys: Taiichi Ohno, Toyota Production System
- •Ishikawa Diagram: Kaoru Ishikawa, "Guide to Quality Control"
- •Gemba: "Go and see" - observe actual work
- •A3 Thinking: Toyota problem-solving methodology
- •Jidoka: Automation with human intelligence - stop on defects