Incident Response Skill
When production breaks, every minute costs money and trust. This skill defines how to respond systematically to minimize damage.
Severity Levels
| Level | Definition | Response Time | Examples |
|---|---|---|---|
| SEV-1 | Complete outage, data breach, security incident | Immediate (<15 min) | Site down, payment processing broken, user data exposed |
| SEV-2 | Major feature broken, significant user impact | <1 hour | Auth failing, core workflow broken, payment errors >5% |
| SEV-3 | Minor feature broken, workaround exists | <4 hours | Non-critical page broken, cosmetic issues, edge case bugs |
| SEV-4 | Low impact, can wait | Next business day | Typos, minor UI issues, performance slightly degraded |
Incident Response Procedure
1. DETECT (0-5 minutes)
How incidents are discovered:
- •Monitoring alerts (Sentry, Vercel, custom)
- •User reports
- •Health check failures
- •Error rate spikes
First responder actions:
bash
# Check health immediately curl -s https://your-app.vercel.app/api/health | jq # Check recent deployments vercel ls --prod # Check error tracking # Open Sentry dashboard
2. ASSESS (5-15 minutes)
Determine severity:
| Question | If Yes → |
|---|---|
| Is the site completely down? | SEV-1 |
| Is user data at risk? | SEV-1 |
| Is payment processing broken? | SEV-1 |
| Can users not log in? | SEV-2 |
| Is a major feature broken? | SEV-2 |
| Is there a workaround? | SEV-3 |
Gather context:
bash
# Recent commits git log --oneline -10 # Recent deploys vercel ls --prod # Error logs # Check Vercel → Logs → Filter by error # Database status # Check Supabase/provider dashboard
3. COMMUNICATE (Immediately after assessment)
Internal notification template:
code
🚨 INCIDENT: [Brief description] Severity: SEV-[1/2/3] Impact: [What's broken] Status: Investigating ETA: [If known] Point person: [Name]
For SEV-1/SEV-2:
- •Update status page (if exists)
- •Notify stakeholders immediately
- •Consider social media acknowledgment
4. MITIGATE (15-60 minutes)
Decision tree:
code
Is it a recent deploy?
├── YES → Can we rollback?
│ ├── YES → ROLLBACK (fastest)
│ └── NO → Why not?
│ ├── Database migration not reversible → Hotfix forward
│ └── Other reason → Document and hotfix
└── NO → What changed?
├── External dependency → Check provider status, implement fallback
├── Traffic spike → Scale up, enable rate limiting
├── Database issue → Check connections, indexes, query performance
└── Unknown → Deep investigation
Rollback procedure (Vercel):
bash
# List recent deployments vercel ls --prod # Identify last known good deployment # In Vercel Dashboard: Deployments → Find stable version → Promote to Production # OR via CLI vercel rollback [deployment-url]
Database rollback (if needed):
bash
# Check migration history bunx prisma migrate status # If migration is reversible bunx prisma migrate resolve --rolled-back [migration-name] # If not reversible, restore from backup # Contact database provider or use point-in-time recovery
5. RESOLVE (Until fixed)
Hotfix process (if rollback not possible):
bash
# Create hotfix branch git checkout -b hotfix/incident-[date]-[description] # Make minimal fix # ONLY fix the immediate problem # No refactoring, no "while we're here" # Test locally bun test bun run build # Deploy directly (skip normal PR process for SEV-1) git push origin hotfix/incident-[date]-[description] # Create expedited PR gh pr create --title "🚨 HOTFIX: [description]" --body "SEV-1 incident fix. Details: [link to incident doc]" # Get emergency approval and merge
6. VERIFY (After fix deployed)
bash
# Confirm fix is live curl -s https://your-app.vercel.app/api/health | jq # Test the specific broken functionality # [Run relevant manual tests] # Monitor error rates for 15 minutes # Check Sentry/Vercel for new errors
7. CLOSE & DOCUMENT (Within 24 hours)
Update status:
code
✅ RESOLVED: [Brief description] Duration: [X hours Y minutes] Impact: [What was affected] Root cause: [Brief explanation] Fix: [What was done]
Post-incident document (for SEV-1/SEV-2):
markdown
# Incident Report: [Date] - [Title] ## Summary - **Duration:** [Start time] - [End time] ([X hours Y minutes]) - **Severity:** SEV-[N] - **Impact:** [Number of users affected, revenue impact if known] - **Root cause:** [One sentence] ## Timeline - [HH:MM] Incident detected via [source] - [HH:MM] First responder acknowledged - [HH:MM] Severity assessed as SEV-[N] - [HH:MM] [Action taken] - [HH:MM] Fix deployed - [HH:MM] Verified resolved ## Root Cause Analysis [Detailed explanation of what went wrong and why] ## What Went Well - [List positive aspects of response] ## What Went Poorly - [List areas for improvement] ## Action Items - [ ] [Preventive measure 1] - Owner: [Name] - Due: [Date] - [ ] [Preventive measure 2] - Owner: [Name] - Due: [Date] ## Lessons Learned [What the team should remember for next time]
Quick Reference Card
SEV-1 Checklist
code
[ ] Health check status [ ] Identify if recent deploy [ ] Attempt rollback [ ] Notify stakeholders [ ] Update status page [ ] Monitor continuously until resolved [ ] Document within 24 hours
Common Issues & Quick Fixes
| Symptom | Likely Cause | Quick Fix |
|---|---|---|
| 500 errors everywhere | Bad deploy | Rollback |
| Database connection errors | Pool exhausted | Restart, increase pool |
| Slow responses (>5s) | N+1 queries, missing index | Add index, optimize query |
| Auth not working | Session/JWT issue | Check env vars, restart |
| Payment failing | Stripe webhook issue | Check Stripe dashboard |
| Site partially down | Single route broken | Check specific route code |
Emergency Contacts
| Service | Dashboard | Support |
|---|---|---|
| Vercel | vercel.com/dashboard | support@vercel.com |
| Supabase | app.supabase.com | support@supabase.com |
| Stripe | dashboard.stripe.com | support@stripe.com |
| Sentry | sentry.io | support@sentry.io |
Prevention
After every SEV-1/SEV-2, ask:
- •Could we have detected this sooner? → Add monitoring/alerting
- •Could we have prevented this? → Add tests, validation
- •Could we have recovered faster? → Improve runbooks, automation
- •Did our process work? → Update this document
Never Do During Incident
- •❌ Make unrelated changes ("while we're here...")
- •❌ Deploy untested code
- •❌ Blame individuals
- •❌ Hide or minimize the incident
- •❌ Skip documentation
- •❌ Forget to update status page
- •❌ Go silent during long incidents