Incident Response

Name: incident-response
Rating: 78
Author: timequity

Severity Levels

Level	Description	Response Time
P1	Service down	15 min
P2	Major degradation	30 min
P3	Minor impact	4 hours
P4	No impact	Next business day

Incident Flow

code

Alert → Acknowledge → Assess → Mitigate → Resolve → Postmortem
          │             │         │
          └── Page ─────┴── Communicate

On-Call Checklist

•Acknowledge alert within SLA
•Assess impact and severity
•Communicate status to stakeholders
•Mitigate - Stop the bleeding
•Investigate root cause
•Resolve underlying issue
•Document in postmortem

Communication Template

code

🔴 INCIDENT: [Brief description]
Impact: [Who/what is affected]
Status: [Investigating/Mitigating/Resolved]
ETA: [Expected resolution time]
Updates: [Channel/page]

Common Runbooks

High CPU

•Identify process: top -c
•Check for runaway processes
•Scale horizontally if needed
•Investigate root cause

Out of Disk

•Check usage: df -h
•Find large files: du -sh /* | sort -h
•Clear logs/temp files
•Add storage or archive

Database Slow

•Check connections: SHOW PROCESSLIST
•Identify slow queries
•Kill blocking queries if needed
•Scale or optimize

Escalation Path

code

On-Call Engineer (15 min)
    ↓
Team Lead (30 min)
    ↓
Engineering Manager (1 hour)
    ↓
VP Engineering (2 hours)