Incident Management
Provide end-to-end incident management guidance covering detection, response, communication, and learning. Emphasizes SRE culture, blameless post-mortems, and structured processes for high-reliability operations.
When to Use This Skill
Apply this skill when:
- •Setting up incident response processes for a team
- •Designing on-call rotations and escalation policies
- •Creating runbooks for common failure scenarios
- •Conducting blameless post-mortems after incidents
- •Implementing incident communication protocols (internal and external)
- •Choosing incident management tooling and platforms
- •Improving MTTR and incident frequency metrics
Core Principles
Incident Management Philosophy
Declare Early and Often: Do not wait for certainty. Declaring an incident enables coordination, can be downgraded if needed, and prevents delayed response.
Mitigation First, Root Cause Later: Stop customer impact immediately (rollback, disable feature, failover). Debug and fix root cause after stability restored.
Blameless Culture: Assume good intentions. Focus on how systems failed, not who failed. Create psychological safety for honest learning.
Clear Command Structure: Assign Incident Commander (IC) to own coordination. IC delegates tasks but does not do hands-on debugging.
Communication is Critical: Internal coordination via dedicated channels, external transparency via status pages. Update stakeholders every 15-30 minutes during critical incidents.
Severity Classification
Standard severity levels with response times:
SEV0 (P0) - Critical Outage:
- •Impact: Complete service outage, critical data loss, payment processing down
- •Response: Page immediately 24/7, all hands on deck, executive notification
- •Example: API completely down, entire customer base affected
SEV1 (P1) - Major Degradation:
- •Impact: Major functionality degraded, significant customer subset affected
- •Response: Page during business hours, escalate off-hours, IC assigned
- •Example: 15% error rate, critical feature unavailable
SEV2 (P2) - Minor Issues:
- •Impact: Minor functionality impaired, edge case bug, small user subset
- •Response: Email/Slack alert, next business day response
- •Example: UI glitch, non-critical feature slow
SEV3 (P3) - Low Impact:
- •Impact: Cosmetic issues, no customer functionality affected
- •Response: Ticket queue, planned sprint
- •Example: Visual inconsistency, documentation error
For detailed severity decision framework and interactive classifier, see references/severity-classification.md.
Incident Roles
Incident Commander (IC):
- •Owns overall incident response and coordination
- •Makes strategic decisions (rollback vs. debug, when to escalate)
- •Delegates tasks to responders (does NOT do hands-on debugging)
- •Declares incident resolved when stability confirmed
Communications Lead:
- •Posts status updates to internal and external channels
- •Coordinates with stakeholders (executives, product, support)
- •Drafts post-incident customer communication
- •Cadence: Every 15-30 minutes for SEV0/SEV1
Subject Matter Experts (SMEs):
- •Hands-on debugging and mitigation
- •Execute runbooks and implement fixes
- •Provide technical context to IC
Scribe:
- •Documents timeline, actions, decisions in real-time
- •Records incident notes for post-mortem reconstruction
Assign roles based on severity:
- •SEV2/SEV3: Single responder
- •SEV1: IC + SME(s)
- •SEV0: IC + Communications Lead + SME(s) + Scribe
For detailed role responsibilities, see references/incident-roles.md.
On-Call Management
Rotation Patterns
Primary + Secondary:
- •Primary: First responder
- •Secondary: Backup if primary doesn't ack within 5 minutes
- •Rotation length: 1 week (optimal balance)
Follow-the-Sun (24/7):
- •Team A: US hours, Team B: Europe hours, Team C: Asia hours
- •Benefit: No night shifts, improved work-life balance
- •Requires: Multiple global teams
Tiered Escalation:
- •Tier 1: Junior on-call (common issues, runbook-driven)
- •Tier 2: Senior on-call (complex troubleshooting)
- •Tier 3: Team lead/architect (critical decisions)
Best Practices
- •Rotation length: 1 week per rotation
- •Handoff ceremony: 30-minute call to discuss active issues
- •Compensation: On-call stipend + time off after major incidents
- •Tooling: PagerDuty, Opsgenie, or incident.io
- •Limits: Max 2-3 pages per night; escalate if exceeded
Incident Response Workflow
Standard incident lifecycle:
Detection → Triage → Declaration → Investigation ↓ Mitigation → Resolution → Monitoring → Closure ↓ Post-Mortem (within 48 hours)
Key Decision Points
When to Declare: When in doubt, declare (can always downgrade severity)
When to Escalate:
- •No progress after 30 minutes
- •Severity increases (SEV2 → SEV1)
- •Specialized expertise needed
When to Close:
- •Issue resolved and stable for 30+ minutes
- •Monitoring shows all metrics at baseline
- •No customer-reported issues
For complete workflow details, see references/incident-workflow.md.
Communication Protocols
Internal Communication
Incident Slack Channel:
- •Format:
#incident-YYYY-MM-DD-topic-description - •Pin: Severity, IC name, status update template, runbook links
War Room: Video call for SEV0/SEV1 requiring real-time voice coordination
Status Update Cadence:
- •SEV0: Every 15 minutes
- •SEV1: Every 30 minutes
- •SEV2: Every 1-2 hours or at major milestones
External Communication
Status Page:
- •Tools: Statuspage.io, Instatus, custom
- •Stages: Investigating → Identified → Monitoring → Resolved
- •Transparency: Acknowledge issue publicly, provide ETAs when possible
Customer Email:
- •When: SEV0/SEV1 affecting customers
- •Timing: Within 1 hour (acknowledge), post-resolution (full details)
- •Tone: Apologetic, transparent, action-oriented
Regulatory Notifications:
- •Data Breach: GDPR requires notification within 72 hours
- •Financial Services: Immediate notification to regulators
- •Healthcare: HIPAA breach notification rules
For communication templates, see examples/communication-templates.md.
Runbooks and Playbooks
Runbook Structure
Every runbook should include:
- •Trigger: Alert conditions that activate this runbook
- •Severity: Expected severity level
- •Prerequisites: System state requirements
- •Steps: Numbered, executable commands (copy-pasteable)
- •Verification: How to confirm fix worked
- •Rollback: How to undo if steps fail
- •Owner: Team/person responsible
- •Last Updated: Date of last revision
Best Practices
- •Executable: Commands copy-pasteable, not just descriptions
- •Tested: Run during disaster recovery drills
- •Versioned: Track changes in Git
- •Linked: Reference from alert definitions
- •Automated: Convert manual steps to scripts over time
For runbook templates, see examples/runbooks/ directory.
Blameless Post-Mortems
Blameless Culture Tenets
Assume Good Intentions: Everyone made the best decision with information available.
Focus on Systems: Investigate how processes failed, not who failed.
Psychological Safety: Create environment where honesty is rewarded.
Learning Opportunity: Incidents are gifts of organizational knowledge.
Post-Mortem Process
1. Schedule Review (Within 48 Hours): While memory is fresh
2. Pre-Work: Reconstruct timeline, gather metrics/logs, draft document
3. Meeting Facilitation:
- •Timeline walkthrough
- •5 Whys Analysis to identify systemic root causes
- •What Went Well / What Went Wrong
- •Define action items with owners and due dates
4. Post-Mortem Document:
- •Sections: Summary, Timeline, Root Cause, Impact, What Went Well/Wrong, Action Items
- •Distribution: Engineering, product, support, leadership
- •Storage: Archive in searchable knowledge base
5. Follow-Up: Track action items in sprint planning
For detailed facilitation guide and template, see references/blameless-postmortems.md and examples/postmortem-template.md.
Alert Design Principles
Actionable Alerts Only:
- •Every alert requires human action
- •Include graphs, runbook links, recent changes
- •Deduplicate related alerts
- •Route to appropriate team based on service ownership
Preventing Alert Fatigue:
- •Audit alerts quarterly: Remove non-actionable alerts
- •Increase thresholds for noisy metrics
- •Use anomaly detection instead of static thresholds
- •Limit: Max 2-3 pages per night
Tool Selection
Incident Management Platforms
PagerDuty:
- •Best for: Established enterprises, complex escalation policies
- •Cost: $19-41/user/month
- •When: Team size 10+, budget $500+/month
Opsgenie:
- •Best for: Atlassian ecosystem users, flexible routing
- •Cost: $9-29/user/month
- •When: Using Atlassian products, budget $200-500/month
incident.io:
- •Best for: Modern teams, AI-powered response, Slack-native
- •When: Team size 5-50, Slack-centric culture
For detailed tool comparison, see references/tool-comparison.md.
Status Page Solutions
Statuspage.io: Most trusted, easy setup ($29-399/month) Instatus: Budget-friendly, modern design ($19-99/month)
Metrics and Continuous Improvement
Key Incident Metrics
MTTA (Mean Time To Acknowledge):
- •Target: < 5 minutes for SEV1
- •Improvement: Better on-call coverage
MTTR (Mean Time To Recovery):
- •Target: < 1 hour for SEV1
- •Improvement: Runbooks, automation
MTBF (Mean Time Between Failures):
- •Target: > 30 days for critical services
- •Improvement: Root cause fixes
Incident Frequency:
- •Track: SEV0, SEV1, SEV2 counts per month
- •Target: Downward trend
Action Item Completion Rate:
- •Target: > 90%
- •Improvement: Sprint integration, ownership clarity
Continuous Improvement Loop
Incident → Post-Mortem → Action Items → Prevention ↑ ↓ └──────────── Fewer Incidents ─────────────┘
Decision Frameworks
Severity Classification Decision Tree
Is production completely down or critical data at risk?
├─ YES → SEV0
└─ NO → Is major functionality degraded?
├─ YES → Is there a workaround?
│ ├─ YES → SEV1
│ └─ NO → SEV0
└─ NO → Are customers impacted?
├─ YES → SEV2
└─ NO → SEV3
Use interactive classifier: python scripts/classify-severity.py
Escalation Matrix
For detailed escalation guidance, see references/escalation-matrix.md.
Mitigation vs. Root Cause
Prioritize Mitigation When:
- •Active customer impact ongoing
- •Quick fix available (rollback, disable feature)
Prioritize Root Cause When:
- •Customer impact already mitigated
- •Fix requires careful analysis
Default: Mitigation first (99% of cases)
Anti-Patterns to Avoid
- •Delayed Declaration: Waiting for certainty before declaring incident
- •Skipping Post-Mortems: "Small" incidents still provide learning
- •Blame Culture: Punishing individuals prevents systemic learning
- •Ignoring Action Items: Post-mortems without follow-through waste time
- •No Clear IC: Multiple people leading creates confusion
- •Alert Fatigue: Noisy, non-actionable alerts cause on-call to ignore pages
- •Hands-On IC: IC should delegate debugging, not do it themselves
Implementation Checklist
Phase 1: Foundation (Week 1)
- • Define severity levels (SEV0-SEV3)
- • Choose incident management platform
- • Set up basic on-call rotation
- • Create incident Slack channel template
Phase 2: Processes (Weeks 2-3)
- • Create first 5 runbooks for common incidents
- • Set up status page
- • Train team on incident response
- • Conduct tabletop exercise
Phase 3: Culture (Weeks 4+)
- • Conduct first blameless post-mortem
- • Establish post-mortem cadence
- • Implement MTTA/MTTR dashboards
- • Track action items in sprint planning
Phase 4: Optimization (Months 3-6)
- • Automate incident declaration
- • Implement runbook automation
- • Monthly disaster recovery drills
- • Quarterly incident trend reviews
Integration with Other Skills
Observability: Monitoring alerts trigger incidents → Use incident-management for response
Disaster Recovery: DR provides recovery procedures → Incident-management provides operational response
Security Incident Response: Similar process with added compliance/forensics
Infrastructure-as-Code: IaC enables fast recovery via automated rebuild
Performance Engineering: Performance incidents trigger response → Performance team investigates post-mitigation
Examples and Templates
Runbook Templates:
- •
examples/runbooks/database-failover.md - •
examples/runbooks/cache-invalidation.md - •
examples/runbooks/ddos-mitigation.md
Post-Mortem Template:
- •
examples/postmortem-template.md- Complete blameless post-mortem structure
Communication Templates:
- •
examples/communication-templates.md- Status updates, customer emails
On-Call Handoff:
- •
examples/oncall-handoff-template.md- Weekly handoff format
Integration Scripts:
- •
examples/integrations/pagerduty-slack.py - •
examples/integrations/statuspage-auto-update.py - •
examples/integrations/postmortem-generator.py
Scripts
Interactive Severity Classifier:
python scripts/classify-severity.py
Asks questions to determine appropriate severity level based on impact and urgency.
Further Reading
Books:
- •Google SRE Book: "Postmortem Culture" (Chapter 15)
- •"The Phoenix Project" by Gene Kim
- •"Site Reliability Engineering" (Full book)
Online Resources:
- •Atlassian: "How to Run a Blameless Postmortem"
- •PagerDuty: "Incident Response Guide"
- •Google SRE: "Postmortem Culture: Learning from Failure"
Standards:
- •Incident Command System (ICS) - FEMA standard adapted for tech
- •ITIL Incident Management - Traditional IT service management