Root Cause Tracing

When to Use This Skill

•Investigating production errors
•Debugging complex multi-step failures
•Analyzing error chains and cascading failures
•Understanding why a specific state occurred
•Post-mortem analysis of incidents

Tracing Methodology

1. Start from the Symptom

markdown

## Error Chain Template

SYMPTOM: [What the user/system reported]
↓
IMMEDIATE CAUSE: [Direct technical cause]
↓
CONTRIBUTING FACTOR: [What enabled the immediate cause]
↓
ROOT CAUSE: [The fundamental issue to fix]

Example Trace

markdown

SYMPTOM: User sees "500 Internal Server Error"
↓
IMMEDIATE CAUSE: Unhandled null pointer exception in UserService.getProfile()
↓
CONTRIBUTING FACTOR: Database returned null for user that should exist
↓
ROOT CAUSE: Race condition during user registration - DB write not committed before redirect

Trace Techniques

Stack Trace Analysis

python

# Given this stack trace:
# Traceback (most recent call last):
#   File "api/handlers.py", line 45, in get_user
#     profile = user_service.get_profile(user_id)
#   File "services/user.py", line 23, in get_profile
#     return self.repo.find(user_id).to_dict()
#   File "models/user.py", line 67, in to_dict
#     'email': self.email.lower()
# AttributeError: 'NoneType' object has no attribute 'lower'

# Trace backwards:
# 1. self.email is None (immediate cause)
# 2. User model was created without email validation
# 3. API endpoint doesn't validate email before save
# 4. ROOT CAUSE: Missing input validation

Log Correlation

bash

# Find related logs by request ID
grep "req_abc123" /var/log/app/*.log | sort -t: -k2

# Timeline reconstruction
grep -h "2024-01-15T10:3" error.log access.log | sort

# Find first occurrence of error pattern
grep -n "NullPointerException" app.log | head -1

State Inspection

python

# Add trace points to understand state flow
def process_order(order):
    logger.debug(f"[TRACE] Input state: {order.__dict__}")

    validated = validate_order(order)
    logger.debug(f"[TRACE] After validation: {validated.__dict__}")

    calculated = calculate_totals(validated)
    logger.debug(f"[TRACE] After calculation: {calculated.__dict__}")

    saved = save_order(calculated)
    logger.debug(f"[TRACE] After save: {saved.__dict__}")

    return saved

Debugging Patterns

Binary Search Debugging

python

# When you have a long process that fails somewhere

def long_process(data):
    # Add checkpoint
    step1_result = step1(data)
    print(f"CHECKPOINT 1: {step1_result is not None}")  # Pass

    step2_result = step2(step1_result)
    print(f"CHECKPOINT 2: {step2_result is not None}")  # Pass

    step3_result = step3(step2_result)
    print(f"CHECKPOINT 3: {step3_result is not None}")  # FAIL - narrow down here

    step4_result = step4(step3_result)
    # ...

Delta Debugging

bash

# Find which commit introduced a bug
git bisect start
git bisect bad HEAD
git bisect good v1.0.0
# Git will binary search through commits
# Mark each as good/bad until root cause commit is found

Rubber Duck Tracing

markdown

## Explain the flow out loud:

1. User clicks "Submit Order"
2. Frontend sends POST to /api/orders
3. Backend validates the payload... WAIT
   - Does it validate the discount code?
   - What if discount code is empty string vs null?
4. Found it: Empty string "" passes validation but fails lookup

Error Pattern Recognition

Null/Undefined Errors

markdown

SYMPTOM: Cannot read property 'X' of null/undefined

TRACE QUESTIONS:
1. What variable is null?
2. Where was it supposed to be set?
3. What condition would leave it unset?
4. Is there a race condition?
5. Is there a missing await/callback?

COMMON ROOT CAUSES:
- Async operation not awaited
- Conditional initialization with edge case
- Object destructuring with missing keys
- Database query returning no results

Race Conditions

markdown

SYMPTOM: Intermittent failures, works on retry

TRACE QUESTIONS:
1. Are there multiple async operations?
2. Is there shared state?
3. Are there assumptions about order of execution?
4. Are database transactions being used?

COMMON ROOT CAUSES:
- Missing database transaction
- Read-after-write without waiting
- Multiple requests modifying same resource
- Cache invalidation timing

Resource Exhaustion

markdown

SYMPTOM: System slows/crashes under load

TRACE QUESTIONS:
1. What resources are being consumed?
2. Are connections being closed?
3. Are there memory leaks?
4. Is there unbounded growth?

COMMON ROOT CAUSES:
- Database connection pool exhaustion
- Memory leaks in long-running processes
- Unbounded queues or caches
- Missing cleanup in error paths

Systematic Trace Template

markdown

## Root Cause Analysis: [Issue Title]

### 1. Incident Summary
- **Date/Time**:
- **Duration**:
- **Impact**:
- **Detected by**:

### 2. Timeline
| Time | Event |
|------|-------|
| 10:00 | First error logged |
| 10:05 | Alert triggered |
| 10:10 | Investigation started |
| 10:30 | Root cause identified |
| 10:45 | Fix deployed |

### 3. Error Chain

[Symptom] ↓ [Immediate Cause] ↓ [Contributing Factor] ↓ [Root Cause]

code


### 4. Evidence
- Log snippets
- Stack traces
- Metrics/graphs
- Reproduction steps

### 5. Root Cause
[Clear statement of the fundamental issue]

### 6. Fix
[What was done to resolve]

### 7. Prevention
- [ ] Add validation for X
- [ ] Add monitoring for Y
- [ ] Update documentation for Z

Tools for Tracing

Distributed Tracing

python

# Using OpenTelemetry
from opentelemetry import trace

tracer = trace.get_tracer(__name__)

def process_request(request):
    with tracer.start_as_current_span("process_request") as span:
        span.set_attribute("user_id", request.user_id)

        with tracer.start_as_current_span("validate"):
            validate(request)

        with tracer.start_as_current_span("process"):
            result = process(request)
            span.set_attribute("result_count", len(result))

        return result

Error Aggregation Query

sql

-- Find error patterns
SELECT
  error_type,
  error_message,
  COUNT(*) as occurrences,
  MIN(timestamp) as first_seen,
  MAX(timestamp) as last_seen
FROM error_logs
WHERE timestamp > NOW() - INTERVAL '24 hours'
GROUP BY error_type, error_message
ORDER BY occurrences DESC
LIMIT 20;

Checklist

• Capture exact error message and stack trace
• Identify timestamp and affected users/requests
• Gather relevant logs around the timeframe
• Reproduce in isolation if possible
• Trace backwards from symptom to root
• Document the error chain
• Identify fix AND prevention
• Create regression test