Incident Response - Production Issue Management
When to use this skill
- •Responding to production outages
- •Triaging critical incidents
- •Investigating high-severity bugs
- •Coordinating incident response teams
- •Implementing emergency hotfixes
- •Conducting post-mortem analyses
- •Establishing incident response procedures
- •Communicating status during incidents
- •Creating runbooks for common issues
- •Implementing rollback strategies
- •Documenting incident timelines
- •Preventing incident recurrence
When to use this skill
- •Responding to outages, managing incidents, conducting postmortems.
- •When working on related tasks or features
- •During development that requires this expertise
Use when: Responding to outages, managing incidents, conducting postmortems.
Incident Response Process
1. Detect
- •Monitoring alerts
- •User reports
- •Automated checks
2. Triage
- •Assess severity (P0-P4)
- •Page on-call engineer
- •Create incident channel
3. Mitigate
- •Rollback to last known good
- •Scale resources
- •Apply hotfix
- •Communicate status
4. Resolve
- •Verify fix
- •Monitor metrics
- •Update status page
- •Close incident
5. Postmortem
- •Timeline of events
- •Root cause analysis
- •Action items
- •Follow-up tasks
Severity Levels
- •P0 (Critical): Complete outage, data loss
- •P1 (High): Major feature broken, revenue impact
- •P2 (Medium): Degraded performance, workaround exists
- •P3 (Low): Minor bug, cosmetic issue
- •P4 (Informational): Enhancement request
Example Runbook
```markdown
High CPU Usage Runbook
Symptoms
- •Server CPU > 90%
- •Slow response times
- •Request timeouts
Investigation
- •Check top processes: `top`
- •Check memory: `free -h`
- •Check logs: `tail -f app.log`
Mitigation
- •Scale horizontally: Add servers
- •Restart service: `systemctl restart app`
- •Rate limit: Enable aggressive rate limiting
Resolution
- •Identify root cause (N+1 query, memory leak, etc.)
- •Deploy fix
- •Monitor for 1 hour ```
Communication Template
``` [INCIDENT] Service X degraded
Status: Investigating Impact: 20% of users seeing slow load times ETA: 30 minutes
Updates:
- •10:00 AM: Issue detected
- •10:05 AM: On-call paged, investigation started
- •10:15 AM: Root cause identified (database bottleneck)
- •10:30 AM: Fix deployed, monitoring
Next update: 11:00 AM ```