Troubleshooting & Debugging Skill
Guide users through systematic problem diagnosis and resolution using structured methodologies applicable to any domain.
Quick Start Workflow
When a troubleshooting request arrives, follow this systematic approach:
- •Understand - What's expected vs actual, when started, what changed
- •Gather - Collect diagnostic information (logs, errors, environment, steps to reproduce)
- •Hypothesize - List possible causes ranked by probability
- •Design - Create tests that isolate variables
- •Test - Execute systematically, document results
- •Identify - Distinguish symptoms from root cause
- •Verify - Confirm fix resolves the issue
- •Document - Record solution for future reference
When to Use This Skill
Activate for requests involving:
- •"Debug this..." / "Troubleshoot..."
- •"Why isn't this working..." / "Getting an error..."
- •"Something's wrong with..." / "How do I fix..."
- •Any problem description or malfunction report
- •Unexpected behavior or failures
- •Performance issues or degradation
- •Intermittent problems
Problem Understanding Phase
Essential First Questions
Before diving into solutions, gather critical context:
Problem Description:
- •What are you trying to do? (expected behavior)
- •What's actually happening? (actual behavior)
- •What error messages or symptoms do you see?
- •Can you reproduce it consistently?
Timeline:
- •When did this start?
- •Did it ever work correctly?
- •What changed recently? (code, config, environment, data, dependencies)
Scope:
- •Does this happen every time or intermittently?
- •Does it happen for everyone or just some users?
- •Does it happen in all environments or just specific ones?
Environment:
- •What system/platform/version?
- •What's the configuration?
- •What are the dependencies and their versions?
Reproduction:
- •What are the exact steps to reproduce?
- •Can you provide a minimal example that demonstrates the issue?
- •What's the simplest case that fails?
Problem Type Classification
Categorize to guide troubleshooting approach:
By Consistency:
- •Consistent - Happens every time (easier to debug)
- •Intermittent - Happens sometimes (harder, often timing/race conditions)
- •Progressive - Gets worse over time (resource leaks, degradation)
By History:
- •Regression - Worked before, broke recently (what changed?)
- •Never worked - New feature/setup (design or setup issue)
By Scope:
- •Universal - Happens everywhere (likely code/logic issue)
- •Environmental - Happens in specific environment (config, dependencies, resources)
- •User-specific - Happens for specific users (permissions, data, state)
By Domain:
- •Code/Logic - Software bugs, algorithmic errors
- •Configuration - Settings, parameters, environment variables
- •Data - Input validation, corruption, format issues
- •Infrastructure - Network, storage, resources, services
- •Integration - API failures, service dependencies
- •Human process - Workflow, communication, missing steps
Classification helps prioritize hypotheses and diagnostic approaches.
Diagnostic Information Gathering
Log and Error Analysis
Error Messages:
- •Copy exact error text (don't paraphrase)
- •Note error code/type if provided
- •Identify stack trace or error location
- •Look for nested/chained errors (root cause often deeper)
Log Examination:
- •Check timestamps (when did issue occur)
- •Look for ERROR, WARN, EXCEPTION entries near failure time
- •Trace request/operation ID through logs
- •Check for patterns (repeating errors, sequences)
- •Review logs before failure (what was system doing)
System State:
- •Resource usage (CPU, memory, disk, network)
- •Process/service status
- •Recent restarts or crashes
- •System event logs
Environment Documentation
Capture complete environment picture:
Versions:
- •Application/service version
- •Runtime/interpreter version (Node, Python, Java, etc.)
- •OS version
- •Dependency versions (libraries, packages)
- •Database/cache versions
Configuration:
- •Config files and their values
- •Environment variables
- •Feature flags or toggles
- •Settings that differ from defaults
Infrastructure:
- •Network topology
- •Service endpoints and connectivity
- •Firewall/security rules
- •Load balancers, proxies
- •DNS configuration
Reproduction Case Development
Goal: Find minimal reproducible example (MRE)
Process:
- •Document exact steps from clean state
- •Strip away non-essential steps
- •Minimize data/input to simplest case
- •Remove dependencies where possible
- •Verify it still reproduces
Good reproduction case:
- •Anyone can follow steps and see same issue
- •Minimal (fewest steps, smallest data)
- •Self-contained (no external dependencies if possible)
- •Deterministic (reproduces consistently)
See references/information-gathering.md for detailed techniques.
Hypothesis Generation
Forming Testable Hypotheses
Generate possibilities:
- •Brainstorm what could cause observed symptoms
- •Consider each problem type (code, config, data, infrastructure)
- •Review known failure patterns for this domain
- •Don't self-censor - list everything plausible
Rank by probability:
- •Most likely causes first
- •Consider base rates (common vs rare failures)
- •Factor in recent changes
- •Weight by evidence strength
Make hypotheses testable:
- •"If X is the cause, then Y should be true"
- •Design specific test that would confirm/refute
- •Ensure test is practical to execute
Example hypotheses (ranked):
- •High probability: Recent config change introduced typo → Check config against previous version
- •Medium probability: Dependency version mismatch → Verify all dependencies match working environment
- •Low probability: Hardware failure → Run diagnostics, check system logs for hardware errors
Common Failure Patterns
Code/Logic: Off-by-one, null/undefined, type mismatch, logic errors, unhandled edge cases, race conditions
Configuration: Typos, wrong environment variables, permissions, path issues, missing settings
Data: Invalid format, null/empty values, corruption, encoding issues, schema mismatch
Infrastructure: Network connectivity, port conflicts, insufficient resources, permissions, services, firewall
Integration: API changes, expired credentials, rate limiting, timeouts, version incompatibility
See references/hypothesis-generation.md and references/domain-specific-patterns.md for comprehensive patterns.
Systematic Testing
Scientific Method Principles
Change one variable at a time:
- •Modify single factor per test
- •Know exactly what you changed
- •Document change and result
- •Reset to baseline between tests
Verify assumptions:
- •Don't assume anything works
- •Test each layer (can you reach service? is it responding? is auth working?)
- •Question "obvious" things
- •Build confidence from known-good baseline
Isolate variables:
- •Remove unnecessary components
- •Test in controlled environment
- •Disable features to narrow scope
- •Use minimal configuration
Document everything:
- •What you tested
- •What you changed
- •What happened (exact result)
- •Conclusions drawn
- •Next steps
Use scripts/hypothesis_tracker.py to maintain investigation log.
Binary Search Troubleshooting
Concept: Divide problem space in half, determine which half contains issue, repeat.
Applications:
- •Version bisect: Which commit introduced bug? (git bisect)
- •Config narrowing: Which of 50 config settings causes issue?
- •Data range: Which records in dataset cause failure?
- •Code sections: Which function/module has the bug?
- •Time range: When did problem start occurring?
Process:
- •Identify search space boundaries (working ↔ broken)
- •Test midpoint
- •Determine which half contains issue
- •Repeat with smaller range
- •Continue until isolated to specific element
Example - Config debugging:
- •Start: All 50 config settings (broken)
- •Disable half (25 settings) → Still broken
- •Issue is in disabled half
- •Disable half of those (12-13) → Works!
- •Issue is in recently disabled set
- •Continue narrowing...
- •Result: Isolated to single problematic setting
Use scripts/binary_search_helper.py for interactive guidance.
Differential Diagnosis
Compare working vs non-working cases to identify differences:
Compare:
- •Configurations
- •Environments
- •Data inputs
- •Dependencies
- •Timing/sequence
- •System state
Look for:
- •What's present in failing case but not working case
- •What's different between the two
- •What versions differ
- •What settings differ
Technique:
- •Start with known-good case
- •Progressively make it more like failing case
- •Stop when you introduce the failure
- •That change is the cause
Root Cause Analysis
5 Whys Technique
Dig deeper by asking "why" repeatedly (typically 5 times):
Example:
- •Problem: Website is down
- •Why? Database connection failed
- •Why? Connection pool exhausted
- •Why? Connections not being released
- •Why? Code missing connection.close() in error path
- •Why? Developer didn't know about finally blocks
- •Root cause: Missing error-handling training, missing code review checklist
Guidelines:
- •Each "why" should be factual (not speculative)
- •Stop when you reach actionable root cause
- •May need more or fewer than 5 whys
- •Focus on process/system causes, not blaming people
Distinguishing Symptoms from Root Cause
Symptom: Observable manifestation of problem Root Cause: Underlying reason problem exists
Example:
- •Symptom: Application crashes
- •Symptom: Out of memory error
- •Symptom: Memory leak in caching module
- •Root Cause: Cache eviction policy not implemented
Test: Would fixing this prevent all instances of the problem?
- •If no → It's a symptom, keep digging
- •If yes → Likely root cause
Verify:
- •Fix the root cause
- •Test that all symptoms disappear
- •Verify issue doesn't recur
See references/diagnostic-frameworks.md for Fishbone diagrams and FMEA.
Troubleshooting Principles
Core Principles:
- •
Reproduce before fixing - If you can't reproduce it, you can't verify the fix
- •
Change one thing at a time - Otherwise you won't know what fixed it
- •
Verify assumptions - "It should work" ≠ "It does work"
- •
Start with simple explanations - Occam's Razor (common problems are common)
- •
Follow the data - Believe logs/evidence over intuition
- •
Work from known-good baseline - Establish what works, then build up
- •
Document as you go - Your future self will thank you
- •
Know when to stop - Time-box investigation, escalate if stuck
- •
Avoid confirmation bias - Look for evidence that disproves your hypothesis too
- •
Treat the disease, not symptoms - Fix root cause, not just visible symptoms
Domain-Specific Guidance
Software/Code: Syntax errors, logic bugs, runtime errors, concurrency issues, memory problems
- •Use debugger, add logging, rubber duck debugging, review changes, check typos
System/Infrastructure: Network, permissions, services, ports, resources, configuration
- •Check service status, test connectivity, review logs, verify config, check resources
Process/Workflow: Missing steps, unmet dependencies, communication gaps, timing issues
- •Map actual vs expected process, identify handoffs, check prerequisites
See references/domain-specific-patterns.md for comprehensive diagnostic approaches by domain.
Common Anti-Patterns to Avoid
❌ Change multiple things at once - You won't know what fixed it ❌ Assume without verifying - "It worked last week" doesn't mean it works now ❌ Skip reproduction - Can't verify fix without reproducing first ❌ Pursue only favorite hypothesis - Confirmation bias blinds you ❌ Give up on intermittent issues - Often most critical ❌ Forget to document - Others will duplicate your work ❌ Treat symptoms not root cause - Suppressing errors doesn't fix problems ❌ Blame the user - Stops investigation prematurely ❌ Rush to solution - Premature fixes waste time
Solution Verification
After implementing fix:
- •
Test reproduction case - Does it now work?
- •
Test edge cases - Does fix work in all scenarios?
- •
Regression check - Did fix break anything else?
- •
Performance check - Did fix introduce performance issues?
- •
Monitor in production - Watch for recurrence or side effects
- •
Verify with users - Confirm issue is resolved from their perspective
Documentation
Document for future:
Troubleshooting Log:
- •Problem description
- •Investigation steps
- •Hypotheses tested
- •Results of each test
- •Root cause identified
- •Solution implemented
Use assets/templates/troubleshooting_log_template.md
Root Cause Analysis Report:
- •Executive summary
- •Problem statement
- •Timeline of events
- •Root cause analysis (5 Whys, Fishbone)
- •Corrective actions
- •Preventive measures
Use assets/templates/rca_report_template.md
Reproduction Steps:
- •Minimal reproduction example
- •Prerequisites
- •Step-by-step instructions
- •Expected vs actual results
Use assets/templates/reproduction_steps_template.md
When to Escalate
Know when to ask for help:
- •You've exhausted reasonable hypotheses
- •Issue is outside your domain expertise
- •Problem is time-critical and investigation is taking too long
- •You need access/permissions you don't have
- •Problem appears to be in external system
- •You've been going in circles (revisiting same hypotheses)
Before escalating:
- •Document what you've tried
- •Share reproduction case
- •Provide all diagnostic information
- •Explain your hypothesis and reasoning
- •Suggest what help you need
Using Supporting Resources
Additional resources in this skill:
- •references/diagnostic-frameworks.md: 5 Whys, Fishbone, FMEA, Binary Search, Comparative Analysis
- •references/information-gathering.md: Log analysis, error interpretation, environment documentation
- •references/hypothesis-generation.md: Failure patterns, hypothesis ranking, test design
- •references/domain-specific-patterns.md: Common issues by domain (software, systems, hardware, process, data)
- •scripts/hypothesis_tracker.py: Track hypotheses and test results
- •scripts/binary_search_helper.py: Interactive binary search guidance
- •assets/templates/: Documentation templates for investigation logs, RCA reports, reproduction cases
Reference these for deeper guidance on specific techniques.
Remember: The best debuggers are methodical, patient, and systematic. Resist the urge to jump to solutions. Follow the data. Document your process. The answer will reveal itself through systematic elimination.