Root Cause Analysis
Systematic approaches for identifying the true source of problems, not just symptoms.
RCA Methods Overview
| Method | Best For | Complexity | Time |
|---|---|---|---|
| 5 Whys | Simple, linear problems | Low | 15-30 min |
| Fishbone | Multi-factor problems | Medium | 30-60 min |
| Fault Tree | Critical systems, safety | High | 1-4 hours |
| Timeline Analysis | Incident investigation | Medium | 30-90 min |
5 Whys Method
Iteratively ask "why" to drill down from symptom to root cause.
Process
code
Problem Statement: [Clear description of the issue]
│
▼
Why #1: [First level cause]
│
▼
Why #2: [Deeper cause]
│
▼
Why #3: [Even deeper]
│
▼
Why #4: [Getting to root]
│
▼
Why #5: [Root cause identified]
│
▼
Action: [Fix that addresses root cause]
Example: Production Outage
markdown
**Problem:** Website was down for 2 hours **Why 1:** Why was the website down? → The application server ran out of memory and crashed. **Why 2:** Why did the server run out of memory? → A memory leak in the image processing service accumulated over time. **Why 3:** Why was there a memory leak? → The service wasn't releasing image buffers after processing. **Why 4:** Why weren't buffers being released? → The cleanup code had a bug introduced in last week's release. **Why 5:** Why wasn't the bug caught before release? → We don't have automated memory leak detection in our test suite. **Root Cause:** Missing automated memory leak testing **Action:** Add memory profiling to CI pipeline, add cleanup tests
5 Whys Best Practices
| Do | Don't |
|---|---|
| Base answers on evidence | Guess or assume |
| Stay focused on one causal chain | Branch too early |
| Keep asking until actionable | Stop at symptoms |
| Involve people closest to issue | Assign blame |
| Document your reasoning | Skip steps |
When 5 Whys Falls Short
- •Multiple contributing factors (use Fishbone)
- •Complex system interactions (use Fault Tree)
- •Organizational/process issues (need broader analysis)
Fishbone Diagram (Ishikawa)
Visualize multiple potential causes organized by category.
Standard Categories (6 M's)
code
┌─────────────┐
Methods ────┤ │
│ │
Machines ─────┤ │
│ ├──── PROBLEM
Materials ─────┤ │
│ │
Measurement ────┤ │
│ │
Environment ────┤ │
│ │
People ──────┤ │
└─────────────┘
Software-Specific Categories
code
┌─────────────┐
Code ─────┤ │
│ │
Infrastructure ────┤ │
│ ├──── BUG/INCIDENT
Dependencies ────┤ │
│ │
Configuration ───┤ │
│ │
Process ────┤ │
│ │
People ─────┤ │
└─────────────┘
Fishbone Example: API Latency Spike
code
┌─────────────────┐
│ │
Code ─────────────────┤ │
│ │ │
├─ N+1 query issue │ │
├─ Missing index │ API LATENCY │
└─ Sync blocking call│ SPIKE │
│ │
Infrastructure ─────────────┤ │
│ │ │
├─ DB connection pool│ │
├─ Network saturation│ │
└─ Insufficient RAM │ │
│ │
Dependencies ───────────────┤ │
│ │ │
├─ External API slow │ │
├─ Redis timeout │ │
└─ CDN cache miss │ │
└─────────────────┘
Fishbone Process
- •Define the problem clearly (the fish head)
- •Identify major categories (the bones)
- •Brainstorm causes for each category
- •Analyze relationships between causes
- •Prioritize most likely root causes
- •Verify with data/testing
- •Take action on confirmed causes
Fault Tree Analysis (FTA)
Top-down, deductive analysis for critical systems.
FTA Symbols
code
┌─────┐ │ TOP │ Top Event (the failure being analyzed) └──┬──┘ │ ┌──┴──┐ │ AND │ All inputs must occur for output └─────┘ ┌──┴──┐ │ OR │ Any input causes output └─────┘ ┌─────┐ │ ○ │ Basic Event (root cause) └─────┘ ┌─────┐ │ ◇ │ Undeveloped Event (needs more analysis) └─────┘
FTA Example: Authentication Failure
code
┌────────────────────┐
│ USER CANNOT │
│ AUTHENTICATE │
└─────────┬──────────┘
│
┌───┴───┐
│ OR │
└───┬───┘
┌──────────────────┼──────────────────┐
│ │ │
┌──────┴──────┐ ┌──────┴──────┐ ┌──────┴──────┐
│ Invalid │ │ Auth │ │ Account │
│ Credentials│ │ Service │ │ Locked │
│ │ │ Down │ │ │
└──────┬──────┘ └──────┬──────┘ └─────────────┘
│ │
┌───┴───┐ ┌───┴───┐
│ OR │ │ OR │
└───┬───┘ └───┬───┘
┌──────┼──────┐ ┌──────┼──────┐
│ │ │ │ │ │
○ ○ ○ ○ ○ ◇
Wrong Expired Token DB Redis External
Password Token Invalid Down Down Auth
When to Use FTA
- •Safety-critical systems
- •Complex failure modes
- •Need to identify all paths to failure
- •Regulatory compliance requirements
- •Post-incident analysis for serious outages
Timeline Analysis
Reconstruct sequence of events to identify causation.
Timeline Template
markdown
## Incident Timeline: [Incident Name] ### Summary - **Incident Start:** [Timestamp] - **Incident Detected:** [Timestamp] - **Incident Resolved:** [Timestamp] - **Total Duration:** [X hours Y minutes] - **Time to Detect:** [X minutes] - **Time to Resolve:** [X hours Y minutes] ### Detailed Timeline | Time (UTC) | Event | Source | Actor | |------------|-------|--------|-------| | 14:00 | Deployment started | CI/CD | automated | | 14:05 | Deployment completed | CI/CD | automated | | 14:15 | Error rate increased 10x | Monitoring | - | | 14:22 | Alert fired | PagerDuty | - | | 14:25 | On-call acknowledged | PagerDuty | @alice | | 14:30 | Root cause identified | Investigation | @alice | | 14:35 | Rollback initiated | Manual | @alice | | 14:40 | Services recovered | Monitoring | - | | 14:45 | Incident resolved | Manual | @alice | ### Analysis **Contributing Factors:** 1. [Factor 1] 2. [Factor 2] **What Went Well:** 1. [Positive observation] **What Could Improve:** 1. [Improvement area] ### Action Items | Action | Owner | Due Date | Status | |--------|-------|----------|--------| | | | | |
Debugging Decision Tree
code
Problem Reported
│
▼
Can you reproduce it?
│ │
Yes No
│ │
▼ ▼
Isolate the Gather more
conditions information
│ │
▼ ▼
Recent changes? Check logs,
│ monitoring
Yes │
│ │
▼ ▼
Review diffs Correlation
& deploys analysis
│ │
└─────┬─────┘
│
▼
Form hypothesis
│
▼
Test hypothesis
│
┌─────┴─────┐
│ │
Confirmed Rejected
│ │
▼ ▼
Fix and Next hypothesis
verify
RCA Documentation Template
markdown
## Root Cause Analysis: [Issue Title] ### Issue Summary **Reported:** [Date] **Severity:** P0 / P1 / P2 / P3 **Impact:** [Description of impact] ### Problem Statement [Clear, specific description of what went wrong] ### Investigation #### Timeline [Key events in sequence] #### Analysis Method Used [ ] 5 Whys [ ] Fishbone [ ] Fault Tree [ ] Timeline Analysis #### Findings [Detailed analysis results] ### Root Cause(s) 1. **Primary:** [Main root cause] 2. **Contributing:** [Secondary factors] ### Immediate Fix [What was done to resolve the immediate issue] ### Preventive Actions | Action | Owner | Due | Status | |--------|-------|-----|--------| | | | | | ### Lessons Learned 1. [Key takeaway] 2. [Process improvement] ### Appendix - [Links to logs, graphs, related tickets]
2026 Best Practices
- •Blameless postmortems: Focus on systems, not individuals
- •Automated correlation: Use AI to correlate signals across systems
- •Proactive RCA: Analyze near-misses, not just incidents
- •Knowledge sharing: Document and share RCA findings
- •Metrics-driven: Track time-to-detect, time-to-resolve trends
Related Skills
- •
observability-monitoring- Gathering data for RCA - •
errors- Error pattern analysis - •
resilience-patterns- Preventing future incidents
References
Version: 1.0.0 (January 2026)