Incident Management

Name: incident-management
Rating: 76
Author: nguyenhuuca

Incident Severity

Level	Impact	Response Time
SEV1	Complete outage	Immediate
SEV2	Major degradation	< 15 min
SEV3	Minor degradation	< 1 hour
SEV4	Low impact	Next business day

Incident Response

1. Detect

•Monitoring alerts
•Customer reports
•Error logs

2. Triage

•Assess severity
•Assign incident commander
•Create communication channel

3. Investigate

•Check recent changes
•Review logs and metrics
•Identify root cause

4. Mitigate

•Apply quick fix
•Rollback if needed
•Communicate status

5. Resolve

•Confirm fix
•Monitor for recurrence
•Close incident

6. Learn

•Post-mortem meeting
•Document findings
•Create action items

Post-Mortem Template

markdown

# Post-Mortem: [Incident Title]

## Summary
[Brief description of what happened]

## Timeline
- HH:MM - [Event]
- HH:MM - [Event]
- HH:MM - [Resolution]

## Impact
- Duration: [X hours]
- Users affected: [X]
- Revenue impact: [if applicable]

## Root Cause
[What caused this incident]

## Contributing Factors
- [Factor 1]
- [Factor 2]

## What Went Well
- [Positive 1]
- [Positive 2]

## What Could Be Improved
- [Improvement 1]
- [Improvement 2]

## Action Items
- [ ] [Action 1] - Owner: [Name]
- [ ] [Action 2] - Owner: [Name]

Blameless Culture

•Focus on systems, not people
•"What failed?" not "Who failed?"
•Share learnings openly
•Celebrate near-misses