Debugging Skill

Reusable workflow extracted from dario-debugger expertise.

Purpose

Systematically investigate and resolve bugs through scientific methodology, root cause analysis, and evidence-based diagnosis across all technology stacks.

When to Use

•Production incidents and outages
•Intermittent or hard-to-reproduce bugs
•Performance degradation investigation
•Memory leaks and resource exhaustion
•Concurrency issues (race conditions, deadlocks)
•Crash analysis and stack trace interpretation
•Test failures and CI/CD pipeline issues

Workflow Steps

•
Reproduce
- •Confirm issue can be consistently reproduced
- •Document exact reproduction steps
- •Identify required environment/conditions
- •Create minimal reproduction case
•
Isolate
- •Narrow down problem space (component, input, timing)
- •Use binary search to eliminate possibilities
- •Identify affected versions (git bisect)
- •Determine scope of impact
•
Gather Evidence
- •Collect logs from all relevant systems
- •Capture stack traces and error messages
- •Record metrics and performance data
- •Preserve system state before changes
- •Use distributed tracing for microservices
•
Hypothesize
- •Form testable hypotheses about root cause
- •List potential causes ranked by probability
- •Consider symptoms vs actual cause
- •Apply 5 Whys technique
•
Test Hypotheses
- •Design experiments to prove/disprove each hypothesis
- •Use debuggers and profilers to validate
- •Check logs for evidence supporting/refuting
- •Eliminate possibilities systematically
•
Identify Root Cause
- •Determine fundamental issue (not just symptom)
- •Verify with >95% confidence
- •Document evidence trail
- •Distinguish correlation from causation
•
Fix & Verify
- •Implement targeted fix for root cause
- •Verify fix resolves issue
- •Test for regressions
- •Measure impact of fix
•
Prevent Recurrence
- •Add regression tests
- •Implement monitoring/alerting
- •Document findings for team
- •Update runbooks if applicable

Inputs Required

•Bug description: Expected vs actual behavior
•Environment: OS, versions, configurations, recent changes
•Reproduction: Steps to reproduce (if known)
•Evidence: Logs, error messages, screenshots, metrics
•Scope: When did it start? How many affected?

Outputs Produced

•Root Cause Report: Detailed analysis with evidence
•Reproduction Steps: Minimal, reliable reproduction case
•Fix Recommendations: Prioritized solutions with trade-offs
•Prevention Strategy: How to prevent similar issues
•Regression Tests: Tests to verify fix and prevent recurrence

Bug Classification

Priority Levels

•🔴 P0 - Critical: System down, data loss, security breach - immediate response
•🟠 P1 - High: Major feature broken, significant user impact
•🟡 P2 - Medium: Feature degraded, workaround exists
•🟢 P3 - Low: Minor issue, cosmetic, edge case

Debugging Techniques

Scientific Method

•Observe the problem
•Form hypothesis about cause
•Design experiment to test hypothesis
•Execute test and collect data
•Analyze results
•Refine hypothesis or conclude

Binary Search Debugging

•Divide problem space in half repeatedly
•Test midpoint to eliminate half of possibilities
•Efficient for narrowing down cause

5 Whys Technique

code

Problem: API endpoint returns 500 error
Why? Database connection failed
Why? Connection pool exhausted
Why? Connections not being released
Why? Missing finally block in error path
Why? Error handling added without proper resource cleanup
Root Cause: Incomplete error handling refactor

Time-Travel Debugging

•Use tools like rr, UndoDB for execution replay
•Step backwards through execution
•Examine state at any point in time

Example Usage

code

Input: Production API returning 500 errors intermittently

Workflow Execution:
1. Reproduce: 500 errors occur under load (>100 req/sec)
2. Isolate: Only affects /api/users endpoint, started after v2.3 deploy
3. Evidence: Connection pool at max, slow query log shows 30s timeouts
4. Hypothesis: Query performance degraded with new schema
5. Test: EXPLAIN ANALYZE shows missing index after migration
6. Root Cause: Migration script failed to create user_email_idx index
7. Fix: CREATE INDEX user_email_idx; query time drops to 50ms
8. Prevent: Add index existence check to health endpoint

Output:
ROOT CAUSE: Missing database index after incomplete migration
EVIDENCE: Query plan shows seq scan, migration log shows index creation failed
FIX: Manual index creation, update migration with IF NOT EXISTS
PREVENTION: Added database index monitoring, migration dry-run validation
CONFIDENCE: 99%

Debugging Tools by Platform

Language-Specific

•Python: pdb, ipdb, py-spy, memory_profiler
•JavaScript/Node: Chrome DevTools, node --inspect, ndb
•C/C++/Objective-C: LLDB, Instruments, AddressSanitizer, Valgrind
•Java/Kotlin: JDB, VisualVM, async-profiler
•Go: Delve, pprof, race detector

System-Level

•Linux: strace, ltrace, perf, eBPF/bpftrace
•macOS: dtrace, Instruments, sample, spindump
•Network: Wireshark, tcpdump, mtr, curl -v
•Container: docker logs, kubectl logs, container-diff

Observability

•Logging: ELK Stack, Splunk, Datadog
•Tracing: Jaeger, Zipkin, OpenTelemetry
•Metrics: Prometheus, Grafana, New Relic
•APM: Datadog APM, New Relic, Dynatrace

Log Analysis Patterns

Error Pattern Recognition

•Stack trace analysis and grouping
•Error rate anomaly detection
•Correlation of errors across services
•Timeline reconstruction

Distributed Tracing

•Follow request ID across microservices
•Identify latency contributors
•Find error propagation paths
•Visualize service dependencies

Related Agents

•dario-debugger - Full agent with reasoning and tool expertise
•rex-code-reviewer - Identifies bug-prone patterns
•otto-performance-optimizer - Performance-related debugging
•thor-quality-assurance-guardian - Test gap identification
•luca-security-expert - Security vulnerability investigation

ISE Engineering Fundamentals Alignment

•Build applications test-ready with comprehensive logging
•Use correlation IDs for distributed tracing
•Include contextual metadata in all logs
•Log to external systems for analysis
•Blameless post-mortems for systemic improvements
•Code without tests is incomplete - add regression tests