Postmortem Writing
Comprehensive guide to writing effective, blameless postmortems that drive organizational learning and prevent incident recurrence.
Do not use this skill when
- •The task is unrelated to postmortem writing
- •You need a different domain or tool outside this scope
Instructions
- •Clarify goals, constraints, and required inputs.
- •Apply relevant best practices and validate outcomes.
- •Provide actionable steps and verification.
- •If detailed examples are required, open
resources/implementation-playbook.md.
Use this skill when
- •Conducting post-incident reviews
- •Writing postmortem documents
- •Facilitating blameless postmortem meetings
- •Identifying root causes and contributing factors
- •Creating actionable follow-up items
- •Building organizational learning culture
Core Concepts
1. Blameless Culture
| Blame-Focused | Blameless |
|---|---|
| "Who caused this?" | "What conditions allowed this?" |
| "Someone made a mistake" | "The system allowed this mistake" |
| Punish individuals | Improve systems |
| Hide information | Share learnings |
| Fear of speaking up | Psychological safety |
2. Postmortem Triggers
- •SEV1 or SEV2 incidents
- •Customer-facing outages > 15 minutes
- •Data loss or security incidents
- •Near-misses that could have been severe
- •Novel failure modes
- •Incidents requiring unusual intervention
Quick Start
Postmortem Timeline
code
Day 0: Incident occurs Day 1-2: Draft postmortem document Day 3-5: Postmortem meeting Day 5-7: Finalize document, create tickets Week 2+: Action item completion Quarterly: Review patterns across incidents
Templates
Template 1: Standard Postmortem
markdown
# Postmortem: [Incident Title] **Date**: 2024-01-15 **Authors**: @alice, @bob **Status**: Draft | In Review | Final **Incident Severity**: SEV2 **Incident Duration**: 47 minutes ## Executive Summary On January 15, 2024, the payment processing service experienced a 47-minute outage affecting approximately 12,000 customers. The root cause was a database connection pool exhaustion triggered by a configuration change in deployment v2.3.4. The incident was resolved by rolling back to v2.3.3 and increasing connection pool limits. **Impact**: - 12,000 customers unable to complete purchases - Estimated revenue loss: $45,000 - 847 support tickets created - No data loss or security implications ## Timeline (All times UTC) | Time | Event | |------|-------| | 14:23 | Deployment v2.3.4 completed to production | | 14:31 | First alert: `payment_error_rate > 5%` | | 14:33 | On-call engineer @alice acknowledges alert | | 14:35 | Initial investigation begins, error rate at 23% | | 14:41 | Incident declared SEV2, @bob joins | | 14:45 | Database connection exhaustion identified | | 14:52 | Decision to rollback deployment | | 14:58 | Rollback to v2.3.3 initiated | | 15:10 | Rollback complete, error rate dropping | | 15:18 | Service fully recovered, incident resolved | ## Root Cause Analysis ### What Happened The v2.3.4 deployment included a change to the database query pattern that inadvertently removed connection pooling for a frequently-called endpoint. Each request opened a new database connection instead of reusing pooled connections. ### Why It Happened 1. **Proximate Cause**: Code change in `PaymentRepository.java` replaced pooled `DataSource` with direct `DriverManager.getConnection()` calls. 2. **Contributing Factors**: - Code review did not catch the connection handling change - No integration tests specifically for connection pool behavior - Staging environment has lower traffic, masking the issue - Database connection metrics alert threshold was too high (90%) 3. **5 Whys Analysis**: - Why did the service fail? → Database connections exhausted - Why were connections exhausted? → Each request opened new connection - Why did each request open new connection? → Code bypassed connection pool - Why did code bypass connection pool? → Developer unfamiliar with codebase patterns - Why was developer unfamiliar? → No documentation on connection management patterns ### System Diagram
[Client] → [Load Balancer] → [Payment Service] → [Database] ↓ Connection Pool (broken) ↓ Direct connections (cause)
code
## Detection ### What Worked - Error rate alert fired within 8 minutes of deployment - Grafana dashboard clearly showed connection spike - On-call response was swift (2 minute acknowledgment) ### What Didn't Work - Database connection metric alert threshold too high - No deployment-correlated alerting - Canary deployment would have caught this earlier ### Detection Gap The deployment completed at 14:23, but the first alert didn't fire until 14:31 (8 minutes). A deployment-aware alert could have detected the issue faster. ## Response ### What Worked - On-call engineer quick