Debugging Production Incidents (Java) — SRE-first + JVM-safe tooling
Intent
During incidents, speed matters—but unstructured debugging causes more damage. This skill provides:
- •An SRE-style incident workflow (roles, comms, timeline)
- •A logs/metrics/traces-first diagnosis approach
- •Safe JVM diagnostics: thread dumps, JFR snippets, jcmd snapshots
- •A rollback / mitigate decision tree
- •A blameless postmortem template and “next guardrails” checklist
Scope
In scope
- •Incident triage and mitigation loop
- •Observability-first debugging
- •JVM diagnostics:
- •thread dumps
- •JFR capture
- •GC/heap snapshots
- •Hypothesis-driven investigation
- •Rollback / feature-flag mitigation strategy
- •Postmortem and action items (prevention)
Out of scope
- •Full infra incident response for Kubernetes/network (separate ops skill)
- •Pen-test level forensic analysis (separate security response playbook)
When to use
Triggers:
- •production outage
- •SLO burn / severe latency spike
- •error rate spike
- •memory leak suspected
- •queue lag runaway
- •deadlocks / thread pool starvation
- •“works in staging, fails in prod” mystery
Required inputs (context to attach in Cursor)
- •Links or snapshots (not raw secrets):
- •dashboard panels (latency, CPU, GC, error rate)
- •recent deploys/config changes
- •key logs around start time
- •traces for representative failing requests
- •Service metadata:
- •version / commit
- •runtime (container/VM), JDK version
- •traffic shape changes (if any)
Roles and workflow (SRE-style)
Step 1 — Declare incident and assign roles
At minimum:
- •Incident Commander (IC): owns decisions and comms
- •Tech Lead: drives technical investigation
- •Comms: updates stakeholders/users
- •Scribe: writes timeline and captures actions
Deliverable: incident channel + timeline doc started.
Step 2 — Stabilize first (stop the bleeding)
Prioritize mitigations:
- •rollback to last known good
- •disable feature via flag
- •reduce load (rate limit, shed traffic)
- •scale out if safe and helps (not always)
- •isolate failing dependency (circuit breaker)
Rule: prefer reversible mitigations over risky live fixes.
Deliverable: mitigation chosen + tracked in timeline.
Observability-first diagnosis
Step 3 — Define the impact and the symptom precisely
- •Who is impacted? which endpoints/tenants/regions?
- •What changed? (deploy/config/traffic/dependency)
- •Which SLO is burning? (latency, availability)
Deliverable: a one-paragraph symptom statement.
Step 4 — Use the “3 signals” triage order
- •Metrics (what is broken and when)
- •Logs (why requests fail, errors, timeouts)
- •Traces (where time is spent across dependencies)
Common patterns:
- •latency increases + CPU flat: likely I/O waits, downstream slowness, locks
- •CPU spikes: hot loop, serialization, logging overhead, contention
- •GC spikes: allocation storms, memory leak, heap too small
- •errors spike: upstream/downstream change, auth expiry, config drift
Deliverable: top 2 hypotheses with supporting signals.
JVM diagnostics (safe playbook)
Use these only if:
- •observability is insufficient, OR
- •you need thread/heap evidence, OR
- •the service is “alive but stuck”.
Step 5 — Thread dump (fast, low risk)
Use a safe method (depends on permissions):
- •
jcmd <pid> Thread.printis often preferred over legacy tools.
What to look for:
- •deadlocks
- •thread pool starvation
- •many threads blocked on the same lock
- •many threads waiting for DB connections
- •runaway retries/backoff loops
Deliverable: thread dump snippet + interpretation.
Step 6 — JFR snippet (bounded capture)
Capture 30–120s around peak symptoms:
- •CPU + allocation + locks + thread states This often answers “what is actually happening” quickly.
Deliverable: JFR file + short summary.
Step 7 — Heap/GC snapshots (only if needed)
If memory/GC suspicion:
- •capture GC log window
- •capture class histogram snapshot via jcmd
- •only capture heap dump if you have storage/privacy plan
Deliverable: evidence bundle for memory hypothesis.
Hypothesis-driven loop (fast iterations)
Step 8 — Rank hypotheses and test the cheapest first
For each hypothesis:
- •Expected observation if true
- •Cheap test (canary, toggle, single node restart, config revert)
- •Risk assessment
Avoid:
- •making 10 changes at once
- •“SSH and tweak random flags”
- •“fixing” without evidence
Deliverable: hypothesis table (in timeline).
Rollback / mitigation decision tree
Step 9 — Decide: mitigate now vs fix forward
Prefer rollback/flag-off if:
- •change is recent and correlated
- •fix is uncertain
- •impact is high
Fix-forward only if:
- •rollback is impossible or too risky
- •you have a high-confidence minimal patch
- •you can canary safely
Deliverable: decision + rationale + next checkpoint time.
After recovery: verification and monitoring
Step 10 — Verify recovery
- •confirm error rate normal
- •confirm latency p95/p99 stable
- •confirm downstream health
- •confirm no hidden queue lag or retry storms
Deliverable: “recovery confirmation” entry in timeline.
Postmortem (blameless) + next guardrails
Step 11 — Write a blameless postmortem
Use a standard structure:
- •Summary + customer impact
- •Timeline (UTC + local time if needed)
- •Root cause and contributing factors
- •Detection and response analysis
- •What went well / what went poorly
- •Action items with owners and deadlines
Step 12 — “Next guardrails” checklist (make incidents less likely)
Examples:
- •add missing timeouts and retries limits
- •add bulkheads / rate limits
- •add better alerts (SLO-based)
- •add regression tests
- •add runbooks for the failure mode
- •add feature flags for risky paths
- •enforce safer deploy practices (canary, bake time)
Deliverable: postmortem doc + action item tracker.
Outputs / Artifacts
- •Incident timeline doc (scribe notes)
- •Mitigation decision log
- •Evidence bundle (dashboards/logs/traces + optional JVM artifacts)
- •Postmortem document (blameless)
- •“Next guardrails” action list
Definition of Done (DoD)
- • Service recovered and verified
- • Timeline is complete enough to reconstruct actions
- • Root cause identified (or bounded) with evidence
- • Postmortem written and shared
- • Action items created with owners and deadlines
- • Runbooks and alerts updated
Common failure modes & fixes
- •
Symptom: incident drags on with random changes
- •Cause: no hypotheses, no IC role, no timeline
- •Fix: establish IC + scribe, hypothesis loop, safe mitigations
- •
Symptom: recovery but reoccurs
- •Cause: no guardrails added; missing timeouts/backpressure
- •Fix: convert root cause into specific engineering controls
- •
Symptom: debugging actions cause more outage
- •Cause: high-risk production changes and poor rollback
- •Fix: prefer reversible mitigations and canary; keep changes minimal
Guardrails (What NOT to do)
- •Do NOT paste secrets/tokens in incident channels.
- •Do NOT take heap dumps without privacy review and storage plan.
- •Do NOT run destructive commands on production hosts without explicit approval.
- •Do NOT “restart everything” without understanding cascading effects.
References (primary)
- •Google SRE — Incident Response / Incident Management (online book): https://sre.google/sre-book/managing-incidents/
- •Google Cloud — Postmortems / learning culture: https://cloud.google.com/blog/products/management-tools/incident-management-for-real-life
- •Oracle Java Diagnostic Tools (jcmd/JFR guidance): https://docs.oracle.com/en/java/javase/21/troubleshoot/diagnostic-tools.html