Incident Response - Production Issue Management

When to use this skill

•Responding to production outages
•Triaging critical incidents
•Investigating high-severity bugs
•Coordinating incident response teams
•Implementing emergency hotfixes
•Conducting post-mortem analyses
•Establishing incident response procedures
•Communicating status during incidents
•Creating runbooks for common issues
•Implementing rollback strategies
•Documenting incident timelines
•Preventing incident recurrence

When to use this skill

•Responding to outages, managing incidents, conducting postmortems.
•When working on related tasks or features
•During development that requires this expertise

Use when: Responding to outages, managing incidents, conducting postmortems.

Incident Response Process

1. Detect

•Monitoring alerts
•User reports
•Automated checks

2. Triage

•Assess severity (P0-P4)
•Page on-call engineer
•Create incident channel

3. Mitigate

•Rollback to last known good
•Scale resources
•Apply hotfix
•Communicate status

4. Resolve

•Verify fix
•Monitor metrics
•Update status page
•Close incident

5. Postmortem

•Timeline of events
•Root cause analysis
•Action items
•Follow-up tasks

Severity Levels

•P0 (Critical): Complete outage, data loss
•P1 (High): Major feature broken, revenue impact
•P2 (Medium): Degraded performance, workaround exists
•P3 (Low): Minor bug, cosmetic issue
•P4 (Informational): Enhancement request

Example Runbook

```markdown

High CPU Usage Runbook

Symptoms

•Server CPU > 90%
•Slow response times
•Request timeouts

Investigation

•Check top processes: `top`
•Check memory: `free -h`
•Check logs: `tail -f app.log`

Mitigation

•Scale horizontally: Add servers
•Restart service: `systemctl restart app`
•Rate limit: Enable aggressive rate limiting

Resolution

•Identify root cause (N+1 query, memory leak, etc.)
•Deploy fix
•Monitor for 1 hour ```

Communication Template

``` [INCIDENT] Service X degraded

Status: Investigating Impact: 20% of users seeing slow load times ETA: 30 minutes

Updates:

•10:00 AM: Issue detected
•10:05 AM: On-call paged, investigation started
•10:15 AM: Root cause identified (database bottleneck)
•10:30 AM: Fix deployed, monitoring

Next update: 11:00 AM ```

incident-response

Incident Response - Production Issue Management

When to use this skill

When to use this skill

Incident Response Process

1. Detect

2. Triage

3. Mitigate

4. Resolve

5. Postmortem

Severity Levels

Example Runbook

High CPU Usage Runbook

Symptoms

Investigation

Mitigation

Resolution

Communication Template

Resources