Incident Response
When to Use
Activate this skill when:
- •Production service is down or returning errors to users
- •Error rate has spiked beyond normal thresholds
- •Performance has degraded significantly (latency increase, timeouts)
- •An alert has fired from the monitoring system
- •Users are reporting issues that indicate a systemic problem
- •A failed deployment needs investigation and remediation
- •Conducting a post-mortem or root cause analysis after an incident
Do NOT use this skill for:
- •Setting up monitoring or alerting rules (use
monitoring-setup) - •Performing routine deployments (use
deployment-pipeline) - •Docker image or infrastructure issues (use
docker-best-practices) - •Feature development or code changes (use
python-backend-expertorreact-frontend-expert)
Instructions
Severity Classification
Classify every incident immediately. Severity determines response urgency, communication cadence, and escalation path.
| Severity | Impact | Examples | Response Time | Update Cadence |
|---|---|---|---|---|
| SEV1 (P1) | Complete outage, all users affected | Service down, data loss, security breach | Immediate (< 5 min) | Every 15 min |
| SEV2 (P2) | Major degradation, most users affected | Core feature broken, severe latency | < 15 min | Every 30 min |
| SEV3 (P3) | Partial degradation, some users affected | Non-critical feature broken, intermittent errors | < 1 hour | Every 2 hours |
| SEV4 (P4) | Minor issue, few users affected | Cosmetic bug, edge case error | < 4 hours | Daily |
Escalation rules:
- •SEV1: Page on-call engineer + engineering manager immediately
- •SEV2: Page on-call engineer, notify engineering manager
- •SEV3: Notify on-call engineer via Slack
- •SEV4: Create ticket, address during normal working hours
See references/escalation-contacts.md for the contact matrix.
5-Minute Triage Workflow
When an incident is detected, follow this triage workflow within the first 5 minutes.
┌─────────────────────────────────────────────────────────┐ │ MINUTE 0-1: Acknowledge and Classify │ │ • Acknowledge the alert or report │ │ • Assign severity (SEV1-SEV4) │ │ • Designate incident commander │ ├─────────────────────────────────────────────────────────┤ │ MINUTE 1-2: Assess Scope │ │ • Check health endpoints for all services │ │ • Check error rate and latency dashboards │ │ • Determine: which services are affected? │ ├─────────────────────────────────────────────────────────┤ │ MINUTE 2-3: Identify Recent Changes │ │ • Check: was there a recent deployment? │ │ • Check: any infrastructure changes? │ │ • Check: any external dependency issues? │ ├─────────────────────────────────────────────────────────┤ │ MINUTE 3-4: Initial Communication │ │ • Post in #incidents channel │ │ • Update status page if SEV1/SEV2 │ │ • Page additional responders if needed │ ├─────────────────────────────────────────────────────────┤ │ MINUTE 4-5: Begin Investigation or Mitigate │ │ • If recent deploy: consider immediate rollback │ │ • If not deploy-related: begin diagnostic commands │ │ • Start incident timeline log │ └─────────────────────────────────────────────────────────┘
Quick health check command:
./skills/incident-response/scripts/health-check-all-services.sh \ --output-dir ./incident-triage/
Incident Commander Role
The incident commander (IC) coordinates the response. They do NOT investigate directly.
IC responsibilities:
- •Coordinate -- Assign tasks to responders, prevent duplicate work
- •Communicate -- Post regular updates to stakeholders
- •Decide -- Make go/no-go decisions on rollback, escalation, communication
- •Track -- Maintain the incident timeline
- •Close -- Declare the incident resolved and schedule the post-mortem
IC communication template (initial):
INCIDENT DECLARED: [Title] Severity: [SEV1/SEV2/SEV3/SEV4] Commander: [Name] Start time: [UTC timestamp] Impact: [What users are experiencing] Status: Investigating Next update: [Time]
IC communication template (update):
INCIDENT UPDATE: [Title] Severity: [SEV level] Duration: [Time since start] Status: [Investigating/Identified/Mitigating/Resolved] Current findings: [What we know] Actions in progress: [What we are doing] Next update: [Time]
Investigation Steps
Follow these diagnostic steps based on the type of issue.
Application Errors (FastAPI)
# 1. Check application logs for errors ./skills/incident-response/scripts/fetch-logs.sh \ --service backend \ --since "15 minutes ago" \ --output-dir ./incident-logs/ # 2. Check error rate from logs docker logs app-backend --since 15m 2>&1 | grep -c "ERROR" # 3. Check active connections and request patterns curl -s http://localhost:8000/health/ready | jq . # 4. Check if the issue is in a specific endpoint docker logs app-backend --since 15m 2>&1 | \ grep "ERROR" | \ grep -oP '"path":"[^"]*"' | sort | uniq -c | sort -rn # 5. Check Python process status docker exec app-backend ps aux docker exec app-backend python -c "import sys; print(sys.version)"
Database Issues (PostgreSQL)
# 1. Check database connectivity
docker exec app-db pg_isready -U postgres
# 2. Check active connections (connection pool exhaustion?)
docker exec app-db psql -U postgres -d app_prod -c "
SELECT count(*), state FROM pg_stat_activity
GROUP BY state ORDER BY count DESC;
"
# 3. Check for long-running queries (locks, deadlocks?)
docker exec app-db psql -U postgres -d app_prod -c "
SELECT pid, now() - pg_stat_activity.query_start AS duration,
query, state
FROM pg_stat_activity
WHERE (now() - pg_stat_activity.query_start) > interval '30 seconds'
AND state != 'idle'
ORDER BY duration DESC;
"
# 4. Check for lock contention
docker exec app-db psql -U postgres -d app_prod -c "
SELECT blocked_locks.pid AS blocked_pid,
blocking_locks.pid AS blocking_pid,
blocked_activity.query AS blocked_query
FROM pg_catalog.pg_locks blocked_locks
JOIN pg_catalog.pg_stat_activity blocked_activity
ON blocked_activity.pid = blocked_locks.pid
JOIN pg_catalog.pg_locks blocking_locks
ON blocking_locks.locktype = blocked_locks.locktype
AND blocking_locks.relation = blocked_locks.relation
AND blocking_locks.pid != blocked_locks.pid
JOIN pg_catalog.pg_stat_activity blocking_activity
ON blocking_activity.pid = blocking_locks.pid
WHERE NOT blocked_locks.granted;
"
# 5. Check disk space
docker exec app-db df -h /var/lib/postgresql/data
Redis Issues
# 1. Check Redis connectivity docker exec app-redis redis-cli ping # 2. Check memory usage docker exec app-redis redis-cli info memory | grep used_memory_human # 3. Check connected clients docker exec app-redis redis-cli info clients | grep connected_clients # 4. Check slow log docker exec app-redis redis-cli slowlog get 10 # 5. Check keyspace docker exec app-redis redis-cli info keyspace
Network and Infrastructure
# 1. Check DNS resolution
nslookup api.example.com
# 2. Check SSL certificate expiry
echo | openssl s_client -servername api.example.com -connect api.example.com:443 2>/dev/null | \
openssl x509 -noout -dates
# 3. Check container resource usage
docker stats --no-stream --format "table {{.Name}}\t{{.CPUPerc}}\t{{.MemUsage}}\t{{.NetIO}}"
# 4. Check disk space on host
df -h /
# 5. Check if dependent services are reachable
curl -sf https://external-api.example.com/health || echo "External API unreachable"
Remediation Actions
Immediate Mitigations (apply within minutes)
| Issue | Mitigation | Command |
|---|---|---|
| Bad deployment | Rollback | ./scripts/deploy.sh --rollback --env production --version $PREV_SHA --output-dir ./results/ |
| Connection pool exhausted | Restart backend | docker restart app-backend |
| Long-running query | Kill query | SELECT pg_terminate_backend(<pid>); |
| Memory leak | Restart service | docker restart app-backend |
| Redis full | Flush non-critical keys | redis-cli --scan --pattern "cache:*" | xargs redis-cli del |
| SSL expired | Apply new cert | Update cert in load balancer |
| Disk full | Clean logs/temp files | docker system prune -f |
Longer-Term Fixes (apply after stabilization)
- •Fix the root cause in code -- Create a branch, fix, test, deploy through normal pipeline
- •Add monitoring -- If the issue was not caught by existing alerts, add new alert rules
- •Add tests -- Write regression tests for the failure scenario
- •Update runbooks -- Document the new failure mode and remediation steps
Communication Protocol
Internal Communication
Channels:
- •
#incidents-- Active incident coordination (SEV1/SEV2) - •
#incidents-low-- SEV3/SEV4 tracking - •
#engineering-- Post-incident summaries
Rules:
- •All communication happens in the designated incident channel
- •Use threads for investigation details, keep main channel for status updates
- •IC posts updates at the defined cadence (see severity table)
- •Tag relevant people explicitly, do not assume they are watching
- •Timestamp all significant findings and actions
External Communication (SEV1/SEV2)
Status page update template:
[Investigating] We are investigating reports of [issue description]. Users may experience [user-visible impact]. We will provide an update within [time].
[Identified] The issue has been identified as [brief description]. We are working on a fix. Estimated resolution: [time estimate].
[Resolved] The issue affecting [service] has been resolved. The root cause was [brief description]. We apologize for the disruption and will publish a detailed post-mortem.
Post-Mortem / RCA Framework
Conduct a blameless post-mortem within 48 hours of every SEV1/SEV2 incident. SEV3 incidents receive a lightweight review.
See references/post-mortem-template.md for the full template.
Post-mortem principles:
- •Blameless -- Focus on systems and processes, not individuals
- •Thorough -- Identify all contributing factors, not just the trigger
- •Actionable -- Every finding must produce a concrete action item with an owner
- •Timely -- Conduct within 48 hours while details are fresh
- •Shared -- Publish to the entire engineering team
Post-mortem structure:
- •Summary -- What happened, when, and what was the impact
- •Timeline -- Minute-by-minute account of detection, investigation, mitigation
- •Root cause -- The fundamental reason the incident occurred
- •Contributing factors -- Other conditions that made the incident worse
- •What went well -- Effective parts of the response
- •What could be improved -- Gaps in detection, response, or tooling
- •Action items -- Specific tasks with owners and due dates
Five Whys technique for root cause analysis:
Why did users see 500 errors? -> Because the backend service returned errors to the load balancer. Why did the backend service return errors? -> Because database connections timed out. Why did database connections time out? -> Because the connection pool was exhausted. Why was the connection pool exhausted? -> Because a new endpoint opened connections without releasing them. Why were connections not released? -> Because the endpoint was missing the async context manager for sessions. Root cause: Missing async context manager for database sessions in new endpoint.
Generate a structured incident report:
python skills/incident-response/scripts/generate-incident-report.py \ --title "Database connection pool exhaustion" \ --severity SEV2 \ --start-time "2024-01-15T14:30:00Z" \ --end-time "2024-01-15T15:15:00Z" \ --output-dir ./post-mortems/
Incident Response Scripts
| Script | Purpose | Usage |
|---|---|---|
scripts/fetch-logs.sh | Fetch recent logs from services | ./scripts/fetch-logs.sh --service backend --since "30m" --output-dir ./logs/ |
scripts/health-check-all-services.sh | Check health of all services | ./scripts/health-check-all-services.sh --output-dir ./health/ |
scripts/generate-incident-report.py | Generate structured incident report | python scripts/generate-incident-report.py --title "..." --severity SEV1 --output-dir ./reports/ |
Quick Reference: Common Incident Patterns
| Pattern | Symptom | Likely Cause | First Action |
|---|---|---|---|
| 502/503 errors | Users see error page | Backend crashed or overloaded | Check docker ps, restart if needed |
| Slow responses | High latency, timeouts | DB queries, external API | Check slow query log, DB connections |
| Partial failures | Some endpoints fail | Single dependency down | Check individual service health |
| Memory growth | OOM kills, restarts | Memory leak | Check docker stats, restart |
| Error spike after deploy | Errors start exactly at deploy time | Bug in new code | Rollback immediately |
| Gradual degradation | Slowly worsening metrics | Resource exhaustion, connection leak | Check resource usage trends |