Incident Management

Severity Framework

Severity	Impact	Response Time	Example
SEV1	Complete outage, data loss	15 min	Production down
SEV2	Major degradation	30 min	Critical feature broken
SEV3	Minor impact	2 hours	Non-critical bug
SEV4	Minimal impact	Next business day	Cosmetic issue

Runbook Structure

Every runbook needs: Overview, Detection, Triage, Mitigation, Root Cause, Resolution, Verification, Escalation.

Triage Decision Table

Symptom	Likely Cause	Action
All requests failing	Service down	Rollback
High latency	Database/dependency	Check connections
Partial failures	Code bug	Feature flag disable
Spike in errors	Traffic surge	Scale up

Escalation Triggers

Immediate: SEV1, data breach, unable to diagnose within 30 min Consider: Spans multiple teams, requires expertise you lack, uncertain about next steps

Condition	Escalate To
> 15 min unresolved SEV1	Engineering Manager
Data breach suspected	Security Team
Customer communication needed	Support Lead

On-Call Handoff

Required Components

Component	Purpose
Active Incidents	What's currently broken
Ongoing Investigations	Issues being debugged
Recent Changes	Deployments, configs
Known Issues	Workarounds in place
Upcoming Events	Maintenance, releases

Handoff Timing

30 min overlap: outgoing writes handoff (15 min) + sync call (15 min). Incoming reviews + verifies alerting.

Pre-Shift Checklist

• VPN, kubectl, database, log aggregator access
• PagerDuty shows you as primary
• Phone notifications enabled
• Test alert received
• Review recent incidents (past 2 weeks)

Mid-Incident Handoff (Critical)

Must transfer: current state + metrics, what's been tried, root cause theories, next steps with escalation triggers, key people involved.

Postmortem Writing

Blameless Culture

Blame-Focused	Blameless
"Who caused this?"	"What conditions allowed this?"
Punish individuals	Improve systems
Hide information	Share learnings

Timeline

code

Day 0: Incident occurs
Day 1-2: Draft postmortem
Day 3-5: Postmortem meeting (60 min)
Day 5-7: Finalize, create tickets
Week 2+: Action item completion
Quarterly: Review patterns

Required Sections

•Executive Summary -- 1-2 sentences: what, impact, resolution
•Timeline (UTC) -- timestamped events
•Root Cause (5 Whys) -- keep asking "why" until you hit a systemic issue
•Detection -- what worked, what didn't
•Response -- what worked, what could improve
•Lessons Learned -- went well, went wrong, got lucky
•Action Items -- priority, owner, due date, ticket (always concrete)

Meeting Structure (60 min)

•Opening (5 min) -- remind blameless culture
•Timeline review (15 min)
•Analysis (20 min) -- what failed, why, prevention
•Action items (15 min) -- prioritize, assign owners
•Closing (5 min) -- confirm owners

Postmortem Anti-Patterns

•Blame game instead of systems focus
•Shallow analysis (ask "why" 5 times)
•No action items or unrealistic ones
•No follow-up -- track in ticketing system

Communication Templates

Initial:

code

INCIDENT: [Service] Degradation
Severity: [SEV] | Status: Investigating | Impact: [description]
Start Time: [TIME] | Incident Commander: [NAME]
Updates in [channel]

Resolution:

code

RESOLVED: [Service] Incident
Duration: [X] minutes | Impact: [affected users/transactions]
Root Cause: [brief] | Resolution: [what was done]
Follow-up: Postmortem scheduled [DATE]