Debugging Best Practices

Systematic approaches to debugging from the Team Argonauts' FAANG experience.

The Debugging Mindset

"A debugger is not a debugging strategy. Understanding the system is."

Core Principles

•Reproduce First - Can't fix what you can't see
•Hypothesize, Don't Guess - Form theories based on evidence
•Change One Thing - Multiple changes obscure the fix
•Verify the Fix - Ensure you fixed the root cause, not a symptom
•Document - Future you will forget why this broke

Systematic Debugging Process

1. Gather Evidence

What to collect:

•Exact error messages (full stack traces)
•Reproduction steps
•When it started happening
•Environment details (OS, browser, versions)
•Logs around the time of failure
•User reports (what they were trying to do)

Questions to ask:

•Did this ever work?
•What changed recently? (code, config, dependencies, traffic)
•Can you reproduce it?
•Does it happen every time or intermittently?

2. Form Hypotheses

Common culprits by symptom:

Symptom	Likely Causes
Works locally, fails in prod	Environment differences, secrets, scale, async timing
Intermittent failures	Race conditions, timeouts, external dependencies
Slow performance	N+1 queries, missing indexes, network latency, memory
Memory leaks	Unclosed connections, event listener leaks, circular refs
Crashes	Null pointer, out of bounds, stack overflow, OOM

3. Test Hypotheses

Isolation techniques:

•Binary search (disable half the code, which half fails?)
•Minimal reproduction (simplest code that triggers bug)
•Change one variable at a time
•Add logging/tracing at boundaries
•Use debugger breakpoints strategically

Don't just:

•Add random console.logs everywhere
•Try things until something works
•Guess at the problem

4. Verify and Document

Before closing:

• Root cause identified (not just symptoms)
• Fix verified in environment where it failed
• Regression test added
• Monitoring/alerting added if needed
• Postmortem written for production issues

Debugging by Domain

Backend Debugging

Timeouts:

code

Symptoms: Requests hanging, 504 errors
Check: Connection pool exhaustion, slow queries, downstream timeouts
Tools: Query explain plans, connection metrics, distributed traces

Data corruption:

code

Symptoms: Wrong data in DB, inconsistent state
Check: Transaction boundaries, race conditions, concurrent updates
Tools: DB logs, transaction isolation levels, locks

Memory leaks:

code

Symptoms: Gradual performance degradation, OOM crashes
Check: Connection leaks, cache unbounded growth, event listeners
Tools: Heap dumps, memory profilers, connection pool metrics

Frontend Debugging

Render issues:

code

Symptoms: UI not updating, wrong data displayed
Check: State mutation, stale closures, missing dependencies
Tools: React DevTools, Redux DevTools, component profiler

Performance:

code

Symptoms: Janky UI, slow interactions, high CPU
Check: Unnecessary re-renders, large lists, layout thrashing
Tools: Performance tab, React Profiler, Lighthouse

Memory leaks:

code

Symptoms: Page slower over time, tabs crash
Check: Event listeners not removed, subscriptions not cleaned up
Tools: Memory profiler, heap snapshots, retention analysis

Security Debugging

Auth failures:

code

Symptoms: Logged out unexpectedly, can't access resources
Check: Token expiry, CORS, cookie settings, session storage
Tools: Browser DevTools, network tab, storage inspector

Data leaks:

code

Symptoms: Users seeing wrong data
Check: Missing authz checks, cache key collisions, IDOR
Tools: Request inspection, auth middleware logs

Debugging Production Issues

Incident Response Flow

•Mitigate - Stop the bleeding (rollback, feature flag, scale up)
•Triage - How many users? What severity?
•Diagnose - Logs, metrics, traces
•Fix - Deploy fix or workaround
•Verify - Confirm resolution
•Postmortem - Learn and prevent recurrence

Essential Observability

Logs:

•Structured (JSON) with correlation IDs
•Appropriate log levels
•No PII or secrets
•Searchable and aggregated

Metrics:

•RED: Rate, Errors, Duration
•Resource: CPU, memory, connections
•Business: User actions, conversions

Traces:

•Distributed tracing for multi-service
•Critical path instrumentation
•Span context propagation

Common Anti-Patterns

❌ Print Debugging Gone Wrong

javascript

// Don't
console.log("here 1")
console.log("here 2")
console.log("data", data) // data is 400KB object

✅ Strategic Logging

javascript

// Do
logger.info("Processing order", {
  orderId: order.id,
  userId: user.id,
  stage: "payment"
})

❌ Changing Multiple Things

javascript

// Don't: Changed timeout AND retry logic AND added cache
// Which one fixed it? Who knows!

✅ Isolate Variables

javascript

// Do: Test timeout change alone first
// Then test retry logic alone
// Then test cache alone

❌ Assuming Without Evidence

code

"It's probably the database"
"Must be a network issue"
"The user did something wrong"

✅ Verify Assumptions

code

Check DB query times → Normal
Check network latency → Normal
Check user input → Found invalid data!

Tools by Problem Type

Performance Issues

•Backend: APM tools (DataDog, New Relic), query profilers
•Frontend: Lighthouse, Chrome DevTools Performance tab
•Database: EXPLAIN ANALYZE, slow query log

Memory Issues

•Backend: Heap dumps, memory profilers
•Frontend: Chrome Memory profiler, heap snapshots

Network Issues

•Browser: Network tab, HAR files
•Server: tcpdump, Wireshark, request logs

Concurrency Issues

•Backend: Thread dumps, distributed tracing
•Frontend: React concurrent mode profiler

The "Can't Reproduce" Problem

Strategies:

•Environment Parity - Make local match production
•Synthetic Data - Use production data shapes (not values)
•Traffic Replay - Replay production requests locally
•Feature Flags - Enable debugging in production safely
•Observability - If can't repro, observe it happening

Data-Dependent Bugs:

•Get sanitized production data
•Check edge cases (null, empty, max values)
•Look for specific user IDs in logs

Timing-Dependent Bugs:

•Add artificial delays
•Increase load/concurrency
•Use race condition detectors

Postmortem Template

markdown

# Incident Postmortem: [Title]

**Date:** [Date]
**Duration:** [Start - End]
**Impact:** [Users affected, revenue lost]
**Severity:** [P0-P4]

## Summary
[What happened in 2-3 sentences]

## Timeline
- HH:MM - Issue detected
- HH:MM - Team alerted
- HH:MM - Root cause identified
- HH:MM - Fix deployed
- HH:MM - Verified resolved

## Root Cause
[Technical explanation of what broke and why]

## Resolution
[What fixed it]

## Action Items
- [ ] Fix 1 (Owner, Due date)
- [ ] Fix 2 (Owner, Due date)

## What Went Well
[Positives to reinforce]

## What Could Be Better
[Improvements without blame]

When to Escalate

Escalate when:

•User impact is severe (data loss, security, outage)
•Issue has been open >2 hours without progress
•Need domain expertise you don't have
•Workaround exists but root cause unclear

References

For domain-specific debugging techniques, see references/ directory.