AgentSkillsCN

Incident Response

事件响应

SKILL.md

Incident Response Skill

When production breaks, every minute costs money and trust. This skill defines how to respond systematically to minimize damage.

Severity Levels

LevelDefinitionResponse TimeExamples
SEV-1Complete outage, data breach, security incidentImmediate (<15 min)Site down, payment processing broken, user data exposed
SEV-2Major feature broken, significant user impact<1 hourAuth failing, core workflow broken, payment errors >5%
SEV-3Minor feature broken, workaround exists<4 hoursNon-critical page broken, cosmetic issues, edge case bugs
SEV-4Low impact, can waitNext business dayTypos, minor UI issues, performance slightly degraded

Incident Response Procedure

1. DETECT (0-5 minutes)

How incidents are discovered:

  • Monitoring alerts (Sentry, Vercel, custom)
  • User reports
  • Health check failures
  • Error rate spikes

First responder actions:

bash
# Check health immediately
curl -s https://your-app.vercel.app/api/health | jq

# Check recent deployments
vercel ls --prod

# Check error tracking
# Open Sentry dashboard

2. ASSESS (5-15 minutes)

Determine severity:

QuestionIf Yes →
Is the site completely down?SEV-1
Is user data at risk?SEV-1
Is payment processing broken?SEV-1
Can users not log in?SEV-2
Is a major feature broken?SEV-2
Is there a workaround?SEV-3

Gather context:

bash
# Recent commits
git log --oneline -10

# Recent deploys
vercel ls --prod

# Error logs
# Check Vercel → Logs → Filter by error

# Database status
# Check Supabase/provider dashboard

3. COMMUNICATE (Immediately after assessment)

Internal notification template:

code
🚨 INCIDENT: [Brief description]
Severity: SEV-[1/2/3]
Impact: [What's broken]
Status: Investigating
ETA: [If known]
Point person: [Name]

For SEV-1/SEV-2:

  • Update status page (if exists)
  • Notify stakeholders immediately
  • Consider social media acknowledgment

4. MITIGATE (15-60 minutes)

Decision tree:

code
Is it a recent deploy?
├── YES → Can we rollback?
│   ├── YES → ROLLBACK (fastest)
│   └── NO → Why not?
│       ├── Database migration not reversible → Hotfix forward
│       └── Other reason → Document and hotfix
└── NO → What changed?
    ├── External dependency → Check provider status, implement fallback
    ├── Traffic spike → Scale up, enable rate limiting
    ├── Database issue → Check connections, indexes, query performance
    └── Unknown → Deep investigation

Rollback procedure (Vercel):

bash
# List recent deployments
vercel ls --prod

# Identify last known good deployment
# In Vercel Dashboard: Deployments → Find stable version → Promote to Production

# OR via CLI
vercel rollback [deployment-url]

Database rollback (if needed):

bash
# Check migration history
bunx prisma migrate status

# If migration is reversible
bunx prisma migrate resolve --rolled-back [migration-name]

# If not reversible, restore from backup
# Contact database provider or use point-in-time recovery

5. RESOLVE (Until fixed)

Hotfix process (if rollback not possible):

bash
# Create hotfix branch
git checkout -b hotfix/incident-[date]-[description]

# Make minimal fix
# ONLY fix the immediate problem
# No refactoring, no "while we're here"

# Test locally
bun test
bun run build

# Deploy directly (skip normal PR process for SEV-1)
git push origin hotfix/incident-[date]-[description]

# Create expedited PR
gh pr create --title "🚨 HOTFIX: [description]" --body "SEV-1 incident fix. Details: [link to incident doc]"

# Get emergency approval and merge

6. VERIFY (After fix deployed)

bash
# Confirm fix is live
curl -s https://your-app.vercel.app/api/health | jq

# Test the specific broken functionality
# [Run relevant manual tests]

# Monitor error rates for 15 minutes
# Check Sentry/Vercel for new errors

7. CLOSE & DOCUMENT (Within 24 hours)

Update status:

code
✅ RESOLVED: [Brief description]
Duration: [X hours Y minutes]
Impact: [What was affected]
Root cause: [Brief explanation]
Fix: [What was done]

Post-incident document (for SEV-1/SEV-2):

markdown
# Incident Report: [Date] - [Title]

## Summary

- **Duration:** [Start time] - [End time] ([X hours Y minutes])
- **Severity:** SEV-[N]
- **Impact:** [Number of users affected, revenue impact if known]
- **Root cause:** [One sentence]

## Timeline

- [HH:MM] Incident detected via [source]
- [HH:MM] First responder acknowledged
- [HH:MM] Severity assessed as SEV-[N]
- [HH:MM] [Action taken]
- [HH:MM] Fix deployed
- [HH:MM] Verified resolved

## Root Cause Analysis

[Detailed explanation of what went wrong and why]

## What Went Well

- [List positive aspects of response]

## What Went Poorly

- [List areas for improvement]

## Action Items

- [ ] [Preventive measure 1] - Owner: [Name] - Due: [Date]
- [ ] [Preventive measure 2] - Owner: [Name] - Due: [Date]

## Lessons Learned

[What the team should remember for next time]

Quick Reference Card

SEV-1 Checklist

code
[ ] Health check status
[ ] Identify if recent deploy
[ ] Attempt rollback
[ ] Notify stakeholders
[ ] Update status page
[ ] Monitor continuously until resolved
[ ] Document within 24 hours

Common Issues & Quick Fixes

SymptomLikely CauseQuick Fix
500 errors everywhereBad deployRollback
Database connection errorsPool exhaustedRestart, increase pool
Slow responses (>5s)N+1 queries, missing indexAdd index, optimize query
Auth not workingSession/JWT issueCheck env vars, restart
Payment failingStripe webhook issueCheck Stripe dashboard
Site partially downSingle route brokenCheck specific route code

Emergency Contacts

ServiceDashboardSupport
Vercelvercel.com/dashboardsupport@vercel.com
Supabaseapp.supabase.comsupport@supabase.com
Stripedashboard.stripe.comsupport@stripe.com
Sentrysentry.iosupport@sentry.io

Prevention

After every SEV-1/SEV-2, ask:

  1. Could we have detected this sooner? → Add monitoring/alerting
  2. Could we have prevented this? → Add tests, validation
  3. Could we have recovered faster? → Improve runbooks, automation
  4. Did our process work? → Update this document

Never Do During Incident

  • ❌ Make unrelated changes ("while we're here...")
  • ❌ Deploy untested code
  • ❌ Blame individuals
  • ❌ Hide or minimize the incident
  • ❌ Skip documentation
  • ❌ Forget to update status page
  • ❌ Go silent during long incidents