Incident Management
Incident Severity
| Level | Impact | Response Time |
|---|---|---|
| SEV1 | Complete outage | Immediate |
| SEV2 | Major degradation | < 15 min |
| SEV3 | Minor degradation | < 1 hour |
| SEV4 | Low impact | Next business day |
Incident Response
1. Detect
- •Monitoring alerts
- •Customer reports
- •Error logs
2. Triage
- •Assess severity
- •Assign incident commander
- •Create communication channel
3. Investigate
- •Check recent changes
- •Review logs and metrics
- •Identify root cause
4. Mitigate
- •Apply quick fix
- •Rollback if needed
- •Communicate status
5. Resolve
- •Confirm fix
- •Monitor for recurrence
- •Close incident
6. Learn
- •Post-mortem meeting
- •Document findings
- •Create action items
Post-Mortem Template
markdown
# Post-Mortem: [Incident Title] ## Summary [Brief description of what happened] ## Timeline - HH:MM - [Event] - HH:MM - [Event] - HH:MM - [Resolution] ## Impact - Duration: [X hours] - Users affected: [X] - Revenue impact: [if applicable] ## Root Cause [What caused this incident] ## Contributing Factors - [Factor 1] - [Factor 2] ## What Went Well - [Positive 1] - [Positive 2] ## What Could Be Improved - [Improvement 1] - [Improvement 2] ## Action Items - [ ] [Action 1] - Owner: [Name] - [ ] [Action 2] - Owner: [Name]
Blameless Culture
- •Focus on systems, not people
- •"What failed?" not "Who failed?"
- •Share learnings openly
- •Celebrate near-misses