Debugging Skill
Provides comprehensive debugging capabilities with integrated extended thinking for complex scenarios.
When to Use This Skill
Activate this skill when working with:
- •Error troubleshooting
- •Log analysis
- •Performance debugging
- •Distributed system debugging
- •Memory and resource issues
- •Complex, multi-layered bugs requiring deep reasoning
Extended Thinking for Complex Debugging
When to Enable Extended Thinking
Use extended thinking (Claude's deeper reasoning mode) for debugging when:
- •Root Cause Unknown: Multiple possible causes, unclear failure patterns
- •Intermittent Issues: Race conditions, timing issues, non-deterministic failures
- •Multi-System Failures: Distributed system bugs spanning multiple services
- •Performance Mysteries: Unexpected slowdowns without obvious bottlenecks
- •Complex State Issues: Bugs involving intricate state transitions or side effects
- •Security Vulnerabilities: Subtle security issues requiring careful analysis
How to Activate Extended Thinking
# In your debugging prompt Claude, please use extended thinking to help debug this issue: [Describe the problem with symptoms, context, and what you've tried]
Extended thinking will provide:
- •Systematic hypothesis generation
- •Multi-path investigation strategies
- •Deeper pattern recognition
- •Cross-domain insights (e.g., network + application + infrastructure)
Hypothesis-Driven Debugging Framework
Use this structured approach for complex bugs:
1. Observation Phase
What happened? - Error message/stack trace - Frequency (always/intermittent) - When it started - Environmental context - Recent changes
2. Hypothesis Generation
Generate 3-5 plausible hypotheses: H1: [Most likely cause based on symptoms] Evidence for: [...] Evidence against: [...] Test: [How to validate/invalidate] H2: [Alternative explanation] Evidence for: [...] Evidence against: [...] Test: [How to validate/invalidate] H3: [Edge case or rare scenario] Evidence for: [...] Evidence against: [...] Test: [How to validate/invalidate]
3. Systematic Testing
Priority order (high to low confidence): 1. Test H1 → Result: [Pass/Fail/Inconclusive] 2. Test H2 → Result: [Pass/Fail/Inconclusive] 3. Test H3 → Result: [Pass/Fail/Inconclusive] New evidence discovered: - [Finding 1] - [Finding 2] Revised hypotheses if needed: - [...]
4. Root Cause Identification
Confirmed root cause: [...] Contributing factors: [...] Why it wasn't caught earlier: [...]
5. Fix + Validation
Fix implemented: [...] Tests added: [...] Validation: [...] Prevention: [...]
Structured Debugging Templates
Template 1: MECE Bug Analysis (Mutually Exclusive, Collectively Exhaustive)
## Bug: [Title] ### Problem Statement - **What**: [Precise description] - **Where**: [System/component] - **When**: [Conditions/triggers] - **Impact**: [Severity/scope] ### MECE Hypothesis Tree **Layer 1: System Boundaries** - [ ] Frontend issue - [ ] Backend API issue - [ ] Database issue - [ ] Infrastructure/network issue - [ ] External dependency issue **Layer 2: Component-Specific** (based on Layer 1 finding) - [ ] [Sub-component A] - [ ] [Sub-component B] - [ ] [Sub-component C] **Layer 3: Code-Level** (based on Layer 2 finding) - [ ] Logic error - [ ] State management - [ ] Resource handling - [ ] Configuration ### Investigation Log | Time | Action | Result | Next Step | |------|--------|--------|-----------| | [HH:MM] | [What you tested] | [Finding] | [Decision] | ### Root Cause [Final determination with evidence] ### Fix [Solution with rationale]
Template 2: 5 Whys Analysis
## Issue: [Brief description] **Symptom**: [Observable problem] **Why 1**: Why did this happen? → [Answer] **Why 2**: Why did [answer from Why 1] occur? → [Answer] **Why 3**: Why did [answer from Why 2] occur? → [Answer] **Why 4**: Why did [answer from Why 3] occur? → [Answer] **Why 5**: Why did [answer from Why 4] occur? → [Root cause] **Fix**: [Addresses root cause] **Prevention**: [Process/check to prevent recurrence]
Template 3: Timeline Reconstruction
## Incident Timeline: [Event] **Goal**: Reconstruct exact sequence leading to failure | Time | Event | System State | Evidence | |------|-------|--------------|----------| | T-5min | [Normal operation] | [State] | [Logs] | | T-2min | [Trigger event] | [State change] | [Logs/metrics] | | T-30s | [Cascade starts] | [Degraded] | [Alerts] | | T-0 | [Failure] | [Failed state] | [Error logs] | | T+5min | [Recovery action] | [Recovering] | [Actions taken] | **Critical Path**: [Sequence of events that led to failure] **Alternative Scenarios**: [What could have prevented it at each step]
Python Debugging Patterns
Hypothesis-Driven Python Debugging Example
```python """ Bug: API endpoint returns 500 error intermittently Symptoms: 1 in 10 requests fail, always with same user IDs Hypothesis: Race condition in user data caching """
H1: Cache key collision between users
Test: Add detailed logging around cache operations
import logging logging.basicConfig(level=logging.DEBUG)
def get_user(user_id): cache_key = f"user:{user_id}" logging.debug(f"Fetching cache key: {cache_key} for user {user_id}")
cached = cache.get(cache_key)
if cached:
logging.debug(f"Cache hit: {cache_key} -> {cached}")
return cached
user = db.query(User).filter_by(id=user_id).first()
logging.debug(f"DB fetch for user {user_id}: {user}")
cache.set(cache_key, user, timeout=300)
logging.debug(f"Cache set: {cache_key} -> {user}")
return user
Result: Discovered cache_key had different format in different code paths
Root cause: String formatting inconsistency (f"user:{id}" vs f"user_{id}")
```
Advanced Debugging with Context Managers
```python import time from contextlib import contextmanager
@contextmanager def debug_timer(operation_name): """Time operations and log if slow""" start = time.perf_counter() try: yield finally: duration = time.perf_counter() - start if duration > 1.0: # Slow operation threshold logging.warning( f"{operation_name} took {duration:.2f}s", extra={'operation': operation_name, 'duration': duration} )
Usage
with debug_timer("database_query"): results = db.query(User).filter(...).all()
@contextmanager def hypothesis_test(hypothesis_name, expected_outcome): """Test and validate debugging hypotheses""" print(f"\n=== Testing: {hypothesis_name} ===") print(f"Expected: {expected_outcome}") start_state = capture_state() try: yield finally: end_state = capture_state() outcome = compare_states(start_state, end_state) print(f"Actual: {outcome}") print(f"Hypothesis {'CONFIRMED' if outcome == expected_outcome else 'REJECTED'}")
Usage
with hypothesis_test( "H1: Database connection pool exhaustion", expected_outcome="pool_size increases during load" ): # Run load test for i in range(100): api_call() ```
pdb Debugger with Advanced Techniques
```python
Basic breakpoint
import pdb; pdb.set_trace()
Python 3.7+
breakpoint()
Conditional breakpoint
if user_id == 12345: breakpoint()
Post-mortem debugging (debug after crash)
import pdb try: risky_function() except Exception: pdb.post_mortem()
Common pdb commands
n(ext) - Execute next line
s(tep) - Step into function
c(ontinue) - Continue execution
p expr - Print expression
pp expr - Pretty print
l(ist) - Show source code
w(here) - Show stack trace
u(p) - Move up stack frame
d(own) - Move down stack frame
b(reak) - Set breakpoint
cl(ear) - Clear breakpoint
q(uit) - Quit debugger
Advanced: Programmatic debugging
import pdb pdb.run('my_function()', globals(), locals()) ```
Logging
```python import logging
logging.basicConfig( level=logging.DEBUG, format='%(asctime)s - %(name)s - %(levelname)s - %(message)s', handlers=[ logging.FileHandler('debug.log'), logging.StreamHandler() ] )
logger = logging.getLogger(name)
logger.debug("Debug message") logger.info("Info message") logger.warning("Warning message") logger.error("Error message", exc_info=True) ```
Exception Handling
```python import traceback
try: result = risky_operation() except Exception as e: # Log full traceback logger.error(f"Operation failed: {e}") logger.error(traceback.format_exc())
# Or get traceback as string tb = traceback.format_exception(type(e), e, e.__traceback__) error_details = ''.join(tb)
```
JavaScript/Node.js Debugging
Hypothesis-Driven JavaScript Debugging Example
```javascript /**
- •Bug: Memory leak in websocket connections
- •Symptoms: Memory grows over time, eventually crashes
- •Hypothesis: Event listeners not cleaned up on disconnect */
// H1: Event listeners accumulating // Test: Track listener counts class WebSocketManager { constructor() { this.connections = new Map(); this.debugListenerCounts = true; }
addConnection(userId, socket) { console.debug(`[H1 Test] Adding connection for user ${userId}`);
if (this.debugListenerCounts) {
console.debug(\`[H1] Listener count before: \${socket.listenerCount('message')}\`);
}
socket.on('message', (data) => this.handleMessage(userId, data));
socket.on('close', () => this.removeConnection(userId));
if (this.debugListenerCounts) {
console.debug(\`[H1] Listener count after: \${socket.listenerCount('message')}\`);
}
this.connections.set(userId, socket);
}
removeConnection(userId) { console.debug(`[H1 Test] Removing connection for user ${userId}`);
const socket = this.connections.get(userId);
if (socket) {
const messageListenerCount = socket.listenerCount('message');
console.debug(\`[H1] Listeners still attached: \${messageListenerCount}\`);
// Result: Found 3+ listeners on same event!
// Root cause: Not removing listeners on reconnect
socket.removeAllListeners();
this.connections.delete(userId);
}
} } ```
Advanced Console Debugging
```javascript // Basic logging console.log('Basic log'); console.error('Error message'); console.warn('Warning');
// Object inspection with depth console.dir(object, { depth: null, colors: true }); console.table(array);
// Performance timing console.time('operation'); // ... code ... console.timeEnd('operation');
// Memory usage console.memory; // Chrome only
// Stack trace console.trace('Trace point');
// Grouping for organized logs console.group('User Authentication Flow'); console.log('Step 1: Validate credentials'); console.log('Step 2: Generate token'); console.groupEnd();
// Conditional logging const debug = (label, data) => { if (process.env.DEBUG) { console.log(`[DEBUG] ${label}:`, JSON.stringify(data, null, 2)); } };
// Hypothesis testing helper function testHypothesis(name, test, expected) { console.group(`Testing: ${name}`); console.log(`Expected: ${expected}`); const actual = test(); console.log(`Actual: ${actual}`); console.log(`Result: ${actual === expected ? 'PASS' : 'FAIL'}`); console.groupEnd(); return actual === expected; }
// Usage testHypothesis( 'H1: Cache returns stale data', () => cache.get('key').timestamp, Date.now() ); ```
Debugging Async/Promise Issues
```javascript // Track promise chains const debugPromise = (label, promise) => { console.log(`[${label}] Started`); return promise .then(result => { console.log(`[${label}] Resolved:`, result); return result; }) .catch(error => { console.error(`[${label}] Rejected:`, error); throw error; }); };
// Usage await debugPromise('DB Query', db.users.findOne({ id: 123 }));
// Debugging race conditions async function debugRaceCondition() { const operations = [ { name: 'Op1', fn: async () => { await delay(100); return 'A'; } }, { name: 'Op2', fn: async () => { await delay(50); return 'B'; } }, { name: 'Op3', fn: async () => { await delay(150); return 'C'; } } ];
const results = await Promise.allSettled( operations.map(async op => { const start = Date.now(); const result = await op.fn(); const duration = Date.now() - start; console.log(`${op.name} completed in ${duration}ms: ${result}`); return { op: op.name, result, duration }; }) );
console.table(results.map(r => r.value)); }
// Debugging memory leaks with weak references class DebugMemoryLeaks { constructor() { this.weakMap = new WeakMap(); this.strongRefs = new Map(); }
trackObject(id, obj) { // Weak reference - will be GC'd if no other references this.weakMap.set(obj, { id, created: Date.now() });
// Strong reference - prevents GC (potential leak source)
this.strongRefs.set(id, obj);
console.log(\`Tracking \${id}: Strong refs=\${this.strongRefs.size}\`);
}
release(id) { this.strongRefs.delete(id); console.log(`Released ${id}: Strong refs=${this.strongRefs.size}`); }
checkLeaks() { console.log(`Potential leaks: ${this.strongRefs.size} strong references`); return Array.from(this.strongRefs.keys()); } } ```
Node.js Inspector
```bash
Start with inspector
node --inspect app.js node --inspect-brk app.js # Break on first line
Debug with Chrome DevTools
Open chrome://inspect
```
VS Code Debug Configuration
```json { "version": "0.2.0", "configurations": [ { "type": "node", "request": "launch", "name": "Debug Agent", "program": "${workspaceFolder}/src/index.js", "env": { "NODE_ENV": "development" } } ] } ```
Container Debugging
Docker
```bash
View logs
docker logs <container> --tail=100 -f
Execute shell
docker exec -it <container> /bin/sh
Inspect container
docker inspect <container>
Resource usage
docker stats <container>
Debug running container
docker run -it --rm
--network=container:<target>
nicolaka/netshoot
```
Kubernetes
```bash
Pod logs
kubectl logs <pod> -n agents -f kubectl logs <pod> -n agents --previous # Previous crash
Execute in pod
kubectl exec -it <pod> -n agents -- /bin/sh
Debug with ephemeral container
kubectl debug <pod> -n agents -it --image=busybox
Port forward for local debugging
kubectl port-forward <pod> 8080:8080 -n agents
Events
kubectl get events -n agents --sort-by='.lastTimestamp'
Resource usage
kubectl top pods -n agents ```
Log Analysis
Pattern Matching
```bash
Search logs for errors
grep -i "error|exception|failed" app.log
Count occurrences
grep -c "ERROR" app.log
Context around matches
grep -B 5 -A 5 "OutOfMemory" app.log
Filter by time range
awk '/2024-01-15 10:00/,/2024-01-15 11:00/' app.log ```
JSON Logs
```bash
Parse JSON logs with jq
cat app.log | jq 'select(.level == "error")' cat app.log | jq 'select(.timestamp > "2024-01-15T10:00:00")'
Extract specific fields
cat app.log | jq -r '[.timestamp, .level, .message] | @tsv' ```
Performance Debugging
Python Profiling
```python
cProfile
import cProfile cProfile.run('main()', 'output.prof')
Line profiler
@profile def slow_function(): pass
Memory profiler
from memory_profiler import profile
@profile def memory_heavy(): pass ```
Network Debugging
```bash
Check connectivity
ping <host> telnet <host> <port> nc -zv <host> <port>
DNS resolution
nslookup <host> dig <host>
HTTP debugging
curl -v http://localhost:8080/health curl -X POST -d '{"test": true}' -H "Content-Type: application/json" http://localhost:8080/api ```
Common Debug Checklist
- •Check Logs: Application, system, container logs
- •Verify Configuration: Environment variables, config files
- •Test Connectivity: Network, database, external services
- •Check Resources: CPU, memory, disk space
- •Review Recent Changes: Git log, deployment history
- •Reproduce Locally: Same environment, same data
- •Binary Search: Isolate the problem scope
Debugging Decision Tree
Use this decision tree to determine the right debugging approach:
START: What kind of bug? │ ├─ Known error message/stack trace │ └─ Use: Direct log analysis + Stack trace walkthrough │ ├─ Intermittent/Race condition │ └─ Use: Extended thinking + Timeline reconstruction + Hypothesis-driven │ ├─ Performance degradation │ └─ Use: Profiling + Hypothesis-driven + MECE analysis │ ├─ Distributed system failure │ └─ Use: Extended thinking + Timeline reconstruction + Multi-system tracing │ ├─ Complex state bug │ └─ Use: Extended thinking + Hypothesis-driven + pdb/debugger │ ├─ Memory leak │ └─ Use: Memory profiling + Hypothesis-driven + Weak reference analysis │ └─ Unknown root cause └─ Use: Extended thinking + MECE analysis + 5 Whys
Best Practices for Complex Debugging
1. Document Your Investigation
Always maintain a debugging log:
## Bug Investigation: [Title] **Start Time**: 2024-01-15 10:00 **Investigator**: [Name] ### Timeline - 10:00 - Started investigation, checked logs - 10:15 - Found error pattern in auth service - 10:30 - Hypothesis: Cache expiration race condition - 10:45 - Added debug logging, confirmed hypothesis - 11:00 - Implemented fix, testing ### Hypotheses Tested - [x] H1: Cache race condition (CONFIRMED) - [ ] H2: Database connection pool (REJECTED) - [ ] H3: Network timeout (NOT TESTED) ### Root Cause [Final determination] ### Fix Applied [Solution details] ### Prevention [How to prevent recurrence]
2. Use the Scientific Method
- •Observe: Gather symptoms, error messages, logs
- •Hypothesize: Generate 3-5 plausible explanations
- •Predict: What would you see if hypothesis is true?
- •Test: Design experiments to validate/invalidate
- •Analyze: Compare predictions vs actual results
- •Conclude: Confirm root cause with evidence
3. Leverage Extended Thinking
When to activate extended thinking:
- •Complexity threshold: More than 3 interacting systems
- •Uncertainty high: Multiple equally plausible causes
- •Stakes high: Production outage, security issue, data loss
- •Pattern unclear: No obvious error messages or logs
- •Time-sensitive: Need systematic approach under pressure
4. Avoid Common Pitfalls
AVOID: - ❌ Changing multiple things at once (can't isolate cause) - ❌ Assuming first hypothesis is correct (confirmation bias) - ❌ Debugging without logs/evidence (guessing) - ❌ Not documenting what you tried (repeating failed attempts) - ❌ Skipping reproduction step (fix might not work) DO: - ✅ Change one variable at a time - ✅ Test multiple hypotheses systematically - ✅ Add instrumentation before debugging - ✅ Keep investigation log - ✅ Write regression test after fix
5. Debugging Instrumentation Patterns
# Python: Comprehensive debugging decorator
import functools
import time
import logging
def debug_trace(func):
"""Decorator to trace function execution with timing and state"""
@functools.wraps(func)
def wrapper(*args, **kwargs):
func_name = func.__qualname__
logger.debug(f"→ Entering {func_name}")
logger.debug(f" Args: {args}")
logger.debug(f" Kwargs: {kwargs}")
start = time.perf_counter()
try:
result = func(*args, **kwargs)
duration = time.perf_counter() - start
logger.debug(f"← Exiting {func_name} ({duration:.3f}s)")
logger.debug(f" Result: {result}")
return result
except Exception as e:
duration = time.perf_counter() - start
logger.error(f"✗ Exception in {func_name} ({duration:.3f}s): {e}")
raise
return wrapper
# Usage
@debug_trace
def complex_operation(user_id, data):
# Your code here
pass
// JavaScript: Comprehensive debugging wrapper
function debugTrace(label) {
return function(target, propertyKey, descriptor) {
const originalMethod = descriptor.value;
descriptor.value = async function(...args) {
console.log(\`→ Entering \${label || propertyKey}\`);
console.log(\` Args:\`, args);
const start = performance.now();
try {
const result = await originalMethod.apply(this, args);
const duration = performance.now() - start;
console.log(\`← Exiting \${label || propertyKey} (\${duration.toFixed(2)}ms)\`);
console.log(\` Result:\`, result);
return result;
} catch (error) {
const duration = performance.now() - start;
console.error(\`✗ Exception in \${label || propertyKey} (\${duration.toFixed(2)}ms):\`, error);
throw error;
}
};
return descriptor;
};
}
// Usage
class UserService {
@debugTrace('UserService.getUser')
async getUser(userId) {
// Your code here
}
}
Cross-References and Related Skills
Related Skills
This debugging skill integrates with:
- •
extended-thinking (
.claude/skills/extended-thinking/SKILL.md)- •Use for: Complex bugs with unknown root causes
- •Activation: Add "use extended thinking" to your debugging prompt
- •Benefit: Deeper pattern recognition, systematic hypothesis generation
- •
complex-reasoning (
.claude/skills/complex-reasoning/SKILL.md)- •Use for: Multi-step debugging requiring logical chains
- •Patterns: Chain-of-thought, tree-of-thought for bug investigation
- •Benefit: Structured reasoning through complex bug scenarios
- •
deep-analysis (
.claude/skills/deep-analysis/SKILL.md)- •Use for: Post-mortem analysis, root cause investigation
- •Patterns: Comprehensive code review, architectural analysis
- •Benefit: Identifies systemic issues beyond surface bugs
- •
testing (
.claude/skills/testing/SKILL.md)- •Use for: Writing regression tests after bug fix
- •Integration: Bug → Debug → Fix → Test → Validate
- •Benefit: Ensures bug doesn't recur
- •
kubernetes (
.claude/skills/kubernetes/SKILL.md)- •Use for: Distributed system debugging in K8s
- •Tools: kubectl logs, exec, debug, events
- •Integration: Container debugging patterns
When to Combine Skills
| Scenario | Skills to Combine | Reasoning |
|---|---|---|
| Production outage | debugging + extended-thinking + kubernetes | Complex distributed system requires deep reasoning |
| Intermittent test failure | debugging + testing + complex-reasoning | Need systematic hypothesis testing |
| Performance regression | debugging + deep-analysis | Root cause may be architectural |
| Security vulnerability | debugging + extended-thinking + deep-analysis | Requires careful, thorough analysis |
| Memory leak | debugging + complex-reasoning | Multi-step investigation needed |
Integration Examples
Example 1: Complex Production Bug
# Prompt combining skills Claude, I have a complex production bug affecting multiple services. Please use extended thinking and the debugging skill to help investigate. Symptoms: - API requests timeout intermittently (1 in 50 requests) - Only affects authenticated users - Started after recent deployment - No obvious errors in logs Please use: 1. MECE analysis to categorize possible causes 2. Hypothesis-driven debugging framework 3. Timeline reconstruction of recent changes
Example 2: Memory Leak Investigation
# Prompt combining skills Claude, use complex reasoning and debugging skills to investigate a memory leak. Context: - Node.js service memory grows from 200MB to 2GB over 6 hours - No errors logged - Happens only in production, not staging Apply: 1. Hypothesis-driven framework (generate 5 hypotheses) 2. Memory leak detection patterns (weak references) 3. Extended thinking for pattern recognition across codebase
Quick Reference Card
Debugging Workflow Summary
1. OBSERVE - Collect error messages, logs, metrics - Identify patterns (frequency, conditions, scope) - Document symptoms 2. HYPOTHESIZE (use extended thinking if complex) - Generate 3-5 plausible hypotheses - Rank by likelihood - Design tests for each 3. TEST - Change one variable at a time - Add instrumentation (logging, tracing) - Collect evidence 4. ANALYZE - Compare predictions vs results - Eliminate invalidated hypotheses - Refine remaining hypotheses 5. FIX - Implement solution - Add regression test - Document root cause 6. VALIDATE - Verify fix in affected environment - Monitor metrics - Update documentation
Tool Selection Guide
| Problem Type | Primary Tool | Secondary Tools |
|---|---|---|
| Logic error | pdb/debugger | Logging, unit tests |
| Performance | Profiler | Hypothesis testing, metrics |
| Memory leak | Memory profiler | Weak references, heap dumps |
| Async/timing | Timeline reconstruction | Extended thinking, logging |
| Distributed | Tracing (logs) | Kubernetes tools, MECE analysis |
| Unknown cause | Extended thinking | MECE, 5 Whys, hypothesis-driven |
Skill version: 2.0 (Enhanced with extended thinking integration) Last updated: 2024-01-15 Maintained by: Golden Armada AI Agent Fleet