Systematic Debugging Skill
Use when debugging errors, investigating failures, or troubleshooting data collection issues.
Based on obra/superpowers systematic-debugging (MIT License).
Core Principle
NO FIXES WITHOUT ROOT CAUSE INVESTIGATION FIRST.
Symptom-level patches create technical debt and mask underlying problems.
Four Mandatory Phases
Phase 1: Root Cause Investigation
- •
Read error messages thoroughly
- •Full stack traces, not just the last line
- •Error codes and their documentation
- •Timestamps and request context
- •
Reproduce consistently
- •Document exact reproduction steps
- •Identify conditions (specific app IDs, languages, etc.)
- •Check if issue is intermittent or consistent
- •
Check recent changes
- •Git log for recent commits
- •Configuration changes
- •External API changes (rate limits, response format)
- •
Gather evidence at component boundaries
python# Add diagnostic logging at boundaries logging.debug(f"API request: {url}, headers: {headers}") logging.debug(f"API response: status={resp.status_code}, body={resp.text[:500]}") - •
Trace data flow backward
- •Start from the error location
- •Follow the call stack up
- •Identify where data becomes invalid
Phase 2: Pattern Analysis
- •
Find working examples
- •Compare failing vs successful requests
- •Identify what's different (app ID, locale, timing)
- •
Compare against references
- •Check API documentation
- •Compare with known-good implementations
- •Review similar code in the codebase
- •
Identify all differences
- •Headers, parameters, encoding
- •Timing, ordering, state
- •
Understand dependencies
- •External services state
- •Database state
- •Network conditions
Phase 3: Hypothesis Testing
- •
Form a single clear hypothesis
- •"The error occurs because X"
- •Be specific, not vague
- •
Test with minimal changes
- •Change ONE variable at a time
- •Log before and after
- •
Verify results
- •Did the change fix the issue?
- •Did it introduce new issues?
- •Is it reproducible?
Phase 4: Implementation
- •
Create a failing test case first
pythondef test_handles_malformed_date(): # Reproduces the bug result = parse_date("Invalid Date Format") assert result is None # Should not crash - •
Implement single fix addressing root cause
- •Not a workaround
- •Not multiple changes at once
- •
Verify the fix
- •Original test passes
- •No regression in other tests
- •Manual verification if needed
Critical Guardrails
Red Flags - Stop and Restart
- •Proposing solutions before investigation
- •Attempting multiple simultaneous fixes
- •Making changes without understanding why
- •Ignoring error messages or logs
Three-Fix Rule
If three or more fix attempts fail:
STOP. This is NOT a failed hypothesis - this is wrong architecture.
- •Step back and reassess the problem
- •Consider if the approach is fundamentally flawed
- •Discuss with team before continuing
Debugging Data Collection Issues
Common Root Causes
- •
Rate limiting
- •Check response headers for rate limit info
- •Review request timing and patterns
- •
Response format changes
- •API may have changed without notice
- •Compare current vs expected response structure
- •
Network issues
- •Timeouts, connection resets
- •DNS resolution failures
- •Check network binding configuration
- •
Data parsing errors
- •Unexpected null values
- •Encoding issues (UTF-8, special characters)
- •Date format variations by locale
- •
State management
- •Database connection pool exhaustion
- •Cursor position issues
- •Transaction isolation problems
Diagnostic Checklist
# 1. Log the full request
logging.debug(f"Request: {method} {url}")
logging.debug(f"Headers: {headers}")
logging.debug(f"Body: {body}")
# 2. Log the full response
logging.debug(f"Response status: {resp.status_code}")
logging.debug(f"Response headers: {resp.headers}")
logging.debug(f"Response body: {resp.text[:1000]}")
# 3. Check database state
logging.debug(f"DB pool: active={pool.get_stats()}")
# 4. Track timing
start = time.time()
# ... operation ...
logging.debug(f"Operation took {time.time() - start:.2f}s")
Time Investment
Systematic debugging: 15-30 minutes, 95% first-time success Guess-and-check: 2-3 hours, 40% success rate
The systematic approach is faster even under time pressure.