Incident Response
Severity Levels
| Level | Description | Response Time |
|---|---|---|
| P1 | Service down | 15 min |
| P2 | Major degradation | 30 min |
| P3 | Minor impact | 4 hours |
| P4 | No impact | Next business day |
Incident Flow
code
Alert → Acknowledge → Assess → Mitigate → Resolve → Postmortem
│ │ │
└── Page ─────┴── Communicate
On-Call Checklist
- •Acknowledge alert within SLA
- •Assess impact and severity
- •Communicate status to stakeholders
- •Mitigate - Stop the bleeding
- •Investigate root cause
- •Resolve underlying issue
- •Document in postmortem
Communication Template
code
🔴 INCIDENT: [Brief description] Impact: [Who/what is affected] Status: [Investigating/Mitigating/Resolved] ETA: [Expected resolution time] Updates: [Channel/page]
Common Runbooks
High CPU
- •Identify process:
top -c - •Check for runaway processes
- •Scale horizontally if needed
- •Investigate root cause
Out of Disk
- •Check usage:
df -h - •Find large files:
du -sh /* | sort -h - •Clear logs/temp files
- •Add storage or archive
Database Slow
- •Check connections:
SHOW PROCESSLIST - •Identify slow queries
- •Kill blocking queries if needed
- •Scale or optimize
Escalation Path
code
On-Call Engineer (15 min)
↓
Team Lead (30 min)
↓
Engineering Manager (1 hour)
↓
VP Engineering (2 hours)