Root Cause Tracing
When to Use This Skill
- •Investigating production errors
- •Debugging complex multi-step failures
- •Analyzing error chains and cascading failures
- •Understanding why a specific state occurred
- •Post-mortem analysis of incidents
Tracing Methodology
1. Start from the Symptom
markdown
## Error Chain Template SYMPTOM: [What the user/system reported] ↓ IMMEDIATE CAUSE: [Direct technical cause] ↓ CONTRIBUTING FACTOR: [What enabled the immediate cause] ↓ ROOT CAUSE: [The fundamental issue to fix]
Example Trace
markdown
SYMPTOM: User sees "500 Internal Server Error" ↓ IMMEDIATE CAUSE: Unhandled null pointer exception in UserService.getProfile() ↓ CONTRIBUTING FACTOR: Database returned null for user that should exist ↓ ROOT CAUSE: Race condition during user registration - DB write not committed before redirect
Trace Techniques
Stack Trace Analysis
python
# Given this stack trace: # Traceback (most recent call last): # File "api/handlers.py", line 45, in get_user # profile = user_service.get_profile(user_id) # File "services/user.py", line 23, in get_profile # return self.repo.find(user_id).to_dict() # File "models/user.py", line 67, in to_dict # 'email': self.email.lower() # AttributeError: 'NoneType' object has no attribute 'lower' # Trace backwards: # 1. self.email is None (immediate cause) # 2. User model was created without email validation # 3. API endpoint doesn't validate email before save # 4. ROOT CAUSE: Missing input validation
Log Correlation
bash
# Find related logs by request ID grep "req_abc123" /var/log/app/*.log | sort -t: -k2 # Timeline reconstruction grep -h "2024-01-15T10:3" error.log access.log | sort # Find first occurrence of error pattern grep -n "NullPointerException" app.log | head -1
State Inspection
python
# Add trace points to understand state flow
def process_order(order):
logger.debug(f"[TRACE] Input state: {order.__dict__}")
validated = validate_order(order)
logger.debug(f"[TRACE] After validation: {validated.__dict__}")
calculated = calculate_totals(validated)
logger.debug(f"[TRACE] After calculation: {calculated.__dict__}")
saved = save_order(calculated)
logger.debug(f"[TRACE] After save: {saved.__dict__}")
return saved
Debugging Patterns
Binary Search Debugging
python
# When you have a long process that fails somewhere
def long_process(data):
# Add checkpoint
step1_result = step1(data)
print(f"CHECKPOINT 1: {step1_result is not None}") # Pass
step2_result = step2(step1_result)
print(f"CHECKPOINT 2: {step2_result is not None}") # Pass
step3_result = step3(step2_result)
print(f"CHECKPOINT 3: {step3_result is not None}") # FAIL - narrow down here
step4_result = step4(step3_result)
# ...
Delta Debugging
bash
# Find which commit introduced a bug git bisect start git bisect bad HEAD git bisect good v1.0.0 # Git will binary search through commits # Mark each as good/bad until root cause commit is found
Rubber Duck Tracing
markdown
## Explain the flow out loud: 1. User clicks "Submit Order" 2. Frontend sends POST to /api/orders 3. Backend validates the payload... WAIT - Does it validate the discount code? - What if discount code is empty string vs null? 4. Found it: Empty string "" passes validation but fails lookup
Error Pattern Recognition
Null/Undefined Errors
markdown
SYMPTOM: Cannot read property 'X' of null/undefined TRACE QUESTIONS: 1. What variable is null? 2. Where was it supposed to be set? 3. What condition would leave it unset? 4. Is there a race condition? 5. Is there a missing await/callback? COMMON ROOT CAUSES: - Async operation not awaited - Conditional initialization with edge case - Object destructuring with missing keys - Database query returning no results
Race Conditions
markdown
SYMPTOM: Intermittent failures, works on retry TRACE QUESTIONS: 1. Are there multiple async operations? 2. Is there shared state? 3. Are there assumptions about order of execution? 4. Are database transactions being used? COMMON ROOT CAUSES: - Missing database transaction - Read-after-write without waiting - Multiple requests modifying same resource - Cache invalidation timing
Resource Exhaustion
markdown
SYMPTOM: System slows/crashes under load TRACE QUESTIONS: 1. What resources are being consumed? 2. Are connections being closed? 3. Are there memory leaks? 4. Is there unbounded growth? COMMON ROOT CAUSES: - Database connection pool exhaustion - Memory leaks in long-running processes - Unbounded queues or caches - Missing cleanup in error paths
Systematic Trace Template
markdown
## Root Cause Analysis: [Issue Title] ### 1. Incident Summary - **Date/Time**: - **Duration**: - **Impact**: - **Detected by**: ### 2. Timeline | Time | Event | |------|-------| | 10:00 | First error logged | | 10:05 | Alert triggered | | 10:10 | Investigation started | | 10:30 | Root cause identified | | 10:45 | Fix deployed | ### 3. Error Chain
[Symptom] ↓ [Immediate Cause] ↓ [Contributing Factor] ↓ [Root Cause]
code
### 4. Evidence - Log snippets - Stack traces - Metrics/graphs - Reproduction steps ### 5. Root Cause [Clear statement of the fundamental issue] ### 6. Fix [What was done to resolve] ### 7. Prevention - [ ] Add validation for X - [ ] Add monitoring for Y - [ ] Update documentation for Z
Tools for Tracing
Distributed Tracing
python
# Using OpenTelemetry
from opentelemetry import trace
tracer = trace.get_tracer(__name__)
def process_request(request):
with tracer.start_as_current_span("process_request") as span:
span.set_attribute("user_id", request.user_id)
with tracer.start_as_current_span("validate"):
validate(request)
with tracer.start_as_current_span("process"):
result = process(request)
span.set_attribute("result_count", len(result))
return result
Error Aggregation Query
sql
-- Find error patterns SELECT error_type, error_message, COUNT(*) as occurrences, MIN(timestamp) as first_seen, MAX(timestamp) as last_seen FROM error_logs WHERE timestamp > NOW() - INTERVAL '24 hours' GROUP BY error_type, error_message ORDER BY occurrences DESC LIMIT 20;
Checklist
- • Capture exact error message and stack trace
- • Identify timestamp and affected users/requests
- • Gather relevant logs around the timeframe
- • Reproduce in isolation if possible
- • Trace backwards from symptom to root
- • Document the error chain
- • Identify fix AND prevention
- • Create regression test