Incident Severity Levels
Overview
Severity levels provide a standardized way to classify incidents based on their impact, enabling appropriate response, resource allocation, and communication. Consistent severity classification ensures the right people respond with the right urgency.
Core Principle: "Severity drives response - classify quickly, respond appropriately."
1. Why Severity Levels Matter
Benefits of Clear Severity Definitions
code
✓ Appropriate resource allocation - Don't wake everyone for minor issues - Do mobilize all hands for critical outages ✓ Clear response expectations - Everyone knows what SEV1 means - Consistent response across teams ✓ Communication clarity - Stakeholders understand impact - Status page updates match severity ✓ Postmortem requirements - SEV0/1: Always postmortem - SEV3/4: Optional review ✓ SLA tracking - Measure response times by severity - Identify improvement areas ✓ Historical analysis - Trend severity over time - Identify systemic issues
Cost of Inconsistent Severity
code
❌ Over-escalation: - Alert fatigue - Wasted resources - Boy who cried wolf syndrome ❌ Under-escalation: - Delayed response - Increased customer impact - Missed SLA targets
2. Standard Severity Definitions
SEV0 / P0: Critical - Complete Outage
code
Definition: Complete service outage affecting all or nearly all users with severe business impact. Characteristics: - 100% or near-100% of users affected - Core functionality completely unavailable - Significant revenue impact - Data loss or security breach - No workaround available Examples: ✓ Entire website/app down (returns 500/503) ✓ Database completely unavailable ✓ Payment processing completely halted ✓ Data breach with customer data exposed ✓ Complete datacenter failure ✓ Critical security vulnerability being actively exploited Response: - Immediate all-hands response - War room established - Executive notification - Status page: "Major outage" - Customer communication: Immediate
SEV1 / P1: High - Major Functionality Broken
code
Definition: Major functionality unavailable or severely degraded, affecting significant portion of users. Characteristics: - 25-100% of users affected - Critical feature unavailable - Significant business impact - Limited or no workaround - Revenue impact Examples: ✓ Login system down (can't authenticate) ✓ Checkout broken (can't complete purchases) ✓ API returning 50%+ errors ✓ Database in read-only mode ✓ Major feature completely broken ✓ Significant data corruption Response: - Immediate response (< 15 minutes) - Senior engineer + team lead - War room for extended incidents - Status page: "Partial outage" - Customer communication: Within 30 minutes
SEV2 / P2: Medium - Important Feature Degraded
code
Definition: Important functionality degraded or unavailable, affecting subset of users. Characteristics: - 5-25% of users affected - Important (not critical) feature impacted - Moderate business impact - Workaround may exist - Performance degradation Examples: ✓ Search feature slow or returning poor results ✓ Email notifications delayed ✓ Dashboard loading slowly ✓ Some API endpoints timing out ✓ Single region experiencing issues ✓ Non-critical feature broken Response: - Response within 1 hour - On-call engineer + escalation if needed - Status page: "Degraded performance" (optional) - Customer communication: If prolonged
SEV3 / P3: Low - Minor Issue
code
Definition: Minor functionality issue with minimal user impact, workaround available. Characteristics: - < 5% of users affected - Minor feature impacted - Minimal business impact - Workaround available - Cosmetic or edge case Examples: ✓ UI element misaligned ✓ Minor data inconsistency ✓ Rare edge case bug ✓ Non-critical background job failing ✓ Internal tool slow ✓ Documentation outdated Response: - Response within 4 hours (business hours) - On-call engineer (low priority) - No status page update - No customer communication
SEV4 / P4: Minimal - Cosmetic Issue
code
Definition: Cosmetic issue or very minor bug with no user impact. Characteristics: - No user impact - Cosmetic only - Internal tooling - Nice-to-have improvement Examples: ✓ Typo in UI text ✓ Color scheme inconsistency ✓ Internal dashboard formatting ✓ Log message formatting ✓ Code style issue Response: - Response next business day or later - Regular development workflow - No on-call response needed - No communication required
3. Severity Assessment Criteria
The Four Dimensions
typescript
interface SeverityAssessment {
scope: {
usersAffected: number | string; // Absolute number or percentage
servicesAffected: string[];
regionsAffected: string[];
};
impact: {
functionalityLost: 'complete' | 'major' | 'partial' | 'minor' | 'cosmetic';
businessImpact: 'critical' | 'high' | 'medium' | 'low' | 'none';
revenueImpact: number; // $ per hour
};
duration: {
actual?: number; // minutes (if resolved)
projected?: number; // minutes (if ongoing)
};
workaround: {
available: boolean;
difficulty: 'easy' | 'moderate' | 'difficult' | 'none';
};
}
function calculateSeverity(assessment: SeverityAssessment): string {
// Complete outage
if (assessment.scope.usersAffected === '100%' &&
assessment.impact.functionalityLost === 'complete') {
return 'SEV0';
}
// Major functionality broken
if (assessment.impact.functionalityLost === 'major' ||
assessment.impact.businessImpact === 'critical' ||
(typeof assessment.scope.usersAffected === 'number' && assessment.scope.usersAffected > 10000)) {
return 'SEV1';
}
// Important feature degraded
if (assessment.impact.functionalityLost === 'partial' ||
assessment.impact.businessImpact === 'medium') {
return 'SEV2';
}
// Minor issue
if (assessment.impact.functionalityLost === 'minor') {
return 'SEV3';
}
// Cosmetic
return 'SEV4';
}
Scope: How Many Users?
code
100% of users → SEV0 (if critical functionality) 50-100% of users → SEV1 10-50% of users → SEV2 1-10% of users → SEV3 < 1% of users → SEV3 or SEV4 Examples: - All users can't login → SEV0 - Half of users seeing errors → SEV1 - 10% of users experiencing slow search → SEV2 - Single enterprise customer affected → SEV2 or SEV3 (depends on contract) - One user reports UI glitch → SEV4
Impact: How Severe?
code
Critical: Core business function unavailable - Can't process payments → SEV0/1 - Can't access data → SEV0/1 - Security breach → SEV0 High: Important function degraded - Slow checkout → SEV1/2 - Search not working → SEV1/2 - Email delays → SEV2 Medium: Nice-to-have function affected - Recommendations not showing → SEV2/3 - Analytics dashboard slow → SEV3 Low: Cosmetic or edge case - UI misalignment → SEV4 - Rare bug → SEV3
Duration: How Long?
code
Duration affects severity escalation: Initial classification: - SEV2: Important feature degraded After 4 hours: - Escalate to SEV1 (prolonged impact) After 8 hours: - Consider SEV0 (major incident) Example: - Search slow for 30 minutes → SEV2 - Search slow for 6 hours → SEV1 - Search down for 12 hours → SEV0
Workaround: Is There an Alternative?
code
No workaround → Higher severity Easy workaround → Lower severity Examples: - Login broken, no alternative → SEV0 - Login broken, can use SSO → SEV1 - Feature A broken, can use Feature B → SEV2 - UI button broken, can use keyboard shortcut → SEV3
4. Examples for Each Severity Level
SEV0 Examples
code
1. Complete Service Outage Symptom: Website returns 503 for all requests Impact: 100% of users can't access service Revenue: $50k/hour Severity: SEV0 2. Data Breach Symptom: Customer data exposed publicly Impact: All customers' PII at risk Legal: GDPR violation, regulatory fines Severity: SEV0 3. Payment Processing Halted Symptom: All payment transactions failing Impact: Can't process any orders Revenue: $100k/hour Severity: SEV0 4. Database Deleted Symptom: Production database dropped Impact: All data lost, service unusable Recovery: Hours to restore from backup Severity: SEV0 5. Critical Security Vulnerability Exploited Symptom: SQL injection being actively exploited Impact: Data exfiltration in progress Risk: Complete data compromise Severity: SEV0
SEV1 Examples
code
1. Login System Down Symptom: Authentication service returning errors Impact: Users can't login (existing sessions work) Affected: ~30% of users (those not logged in) Severity: SEV1 2. Checkout Broken Symptom: Payment submission fails Impact: Can't complete purchases Revenue: $20k/hour Severity: SEV1 3. Database Read-Only Mode Symptom: All write operations failing Impact: Can view data, can't create/update Affected: 100% of users (partial functionality) Severity: SEV1 4. API 50% Error Rate Symptom: Half of API requests failing Impact: Mobile app intermittently broken Affected: 50% of mobile users Severity: SEV1 5. Major Feature Completely Broken Symptom: Search returns no results Impact: Can't find products Business: Significant conversion impact Severity: SEV1
SEV2 Examples
code
1. Search Slow Symptom: Search takes 10s instead of 1s Impact: Poor user experience, some users give up Affected: All users (degraded, not broken) Severity: SEV2 2. Email Notifications Delayed Symptom: Emails sent 2 hours late Impact: Users don't get timely notifications Workaround: Check in-app notifications Severity: SEV2 3. Single Region Degraded Symptom: US-West region slow Impact: 20% of users (in that region) Workaround: None for those users Severity: SEV2 4. Admin Dashboard Unavailable Symptom: Internal admin tool down Impact: Support team can't access user data Affected: Internal users only Severity: SEV2 5. Some API Endpoints Timing Out Symptom: /api/recommendations timing out Impact: Recommendations not showing Workaround: Users can still browse/purchase Severity: SEV2
SEV3 Examples
code
1. UI Glitch Symptom: Button overlaps text on mobile Impact: Looks bad, but still functional Affected: Mobile users Severity: SEV3 2. Minor Data Inconsistency Symptom: User's last login time incorrect Impact: Cosmetic, doesn't affect functionality Affected: All users (minor) Severity: SEV3 3. Rare Edge Case Bug Symptom: Error when user has exactly 100 items Impact: Very few users affected Workaround: Remove one item Severity: SEV3 4. Non-Critical Background Job Failing Symptom: Daily analytics aggregation not running Impact: Internal reports outdated Workaround: Run manually Severity: SEV3 5. Internal Tool Slow Symptom: Developer dashboard takes 5s to load Impact: Internal productivity slightly reduced Affected: Engineering team only Severity: SEV3
SEV4 Examples
code
1. Typo in UI Symptom: "Sumbit" instead of "Submit" Impact: None (users understand) Affected: All users (cosmetic only) Severity: SEV4 2. Color Scheme Issue Symptom: Button color doesn't match design Impact: Purely aesthetic Affected: All users (cosmetic) Severity: SEV4 3. Documentation Outdated Symptom: API docs show old endpoint Impact: Developers might be confused Workaround: Check code or ask Severity: SEV4 4. Log Message Formatting Symptom: Logs missing timestamp Impact: Slightly harder to debug Affected: Engineers only Severity: SEV4 5. Code Style Inconsistency Symptom: Some files use tabs, others spaces Impact: None (linter catches it) Affected: Developers only Severity: SEV4
5. Severity and Response SLAs
Response Time SLAs
typescript
interface SeveritySLA {
severity: string;
acknowledgement: number; // minutes
initialResponse: number; // minutes
updateFrequency: number; // minutes
resolutionTarget: number; // hours
}
const severitySLAs: SeveritySLA[] = [
{
severity: 'SEV0',
acknowledgement: 5,
initialResponse: 10,
updateFrequency: 15,
resolutionTarget: 1
},
{
severity: 'SEV1',
acknowledgement: 15,
initialResponse: 30,
updateFrequency: 30,
resolutionTarget: 4
},
{
severity: 'SEV2',
acknowledgement: 60,
initialResponse: 120,
updateFrequency: 120,
resolutionTarget: 24
},
{
severity: 'SEV3',
acknowledgement: 240,
initialResponse: 480,
updateFrequency: 480,
resolutionTarget: 168 // 1 week
},
{
severity: 'SEV4',
acknowledgement: 1440, // 1 day
initialResponse: 2880, // 2 days
updateFrequency: 0, // No updates needed
resolutionTarget: 720 // 30 days
}
];
SLA Table
| Severity | Acknowledge | Initial Response | Update Frequency | Resolution Target |
|---|---|---|---|---|
| SEV0 | 5 min | 10 min | Every 15 min | 1 hour |
| SEV1 | 15 min | 30 min | Every 30 min | 4 hours |
| SEV2 | 1 hour | 2 hours | Every 2 hours | 24 hours |
| SEV3 | 4 hours | 8 hours | Daily | 1 week |
| SEV4 | 1 day | 2 days | None | 30 days |
6. Severity Escalation and De-escalation
When to Escalate Severity
code
Escalation Triggers: 1. Duration - SEV2 lasting > 4 hours → SEV1 - SEV1 lasting > 8 hours → SEV0 2. Scope Expansion - Initially 10% users → Now 50% users - Single region → Multiple regions 3. New Information - Thought it was cosmetic → Actually breaking functionality - Discovered data loss 4. Business Impact - Revenue impact higher than estimated - Major customer affected 5. Cascading Failures - One service down → Multiple services affected
typescript
// Auto-escalation logic
function checkEscalation(incident: Incident): boolean {
const duration = Date.now() - incident.startTime.getTime();
const durationHours = duration / (1000 * 60 * 60);
// SEV2 for > 4 hours → SEV1
if (incident.severity === 'SEV2' && durationHours > 4) {
escalateIncident(incident, 'SEV1', 'Duration exceeded 4 hours');
return true;
}
// SEV1 for > 8 hours → SEV0
if (incident.severity === 'SEV1' && durationHours > 8) {
escalateIncident(incident, 'SEV0', 'Duration exceeded 8 hours');
return true;
}
return false;
}
When to De-escalate Severity
code
De-escalation Triggers: 1. Partial Mitigation - SEV0 → SEV1: Core functionality restored, some features degraded - SEV1 → SEV2: Workaround implemented 2. Scope Reduction - 100% users → 10% users - All regions → Single region 3. Better Understanding - Thought 100% affected → Actually 10% - Thought critical → Actually non-critical 4. Temporary Workaround - SEV1 → SEV2: Manual workaround available
typescript
// De-escalation example
async function deEscalateIncident(
incident: Incident,
newSeverity: string,
reason: string
) {
await updateIncident(incident.id, {
severity: newSeverity,
timeline: [
...incident.timeline,
{
timestamp: new Date(),
event: `De-escalated from ${incident.severity} to ${newSeverity}`,
reason
}
]
});
await notifyStakeholders({
incident: incident.id,
message: `Incident de-escalated to ${newSeverity}: ${reason}`
});
}
7. Communication Requirements by Severity
SEV0/1: Maximum Communication
code
Internal: ✓ Create dedicated Slack channel (#inc-YYYY-NNN) ✓ Establish war room (video call) ✓ Page on-call team + escalation ✓ Notify executives (CTO, CEO for SEV0) ✓ Update every 15-30 minutes External: ✓ Update status page immediately ✓ Post on social media (if appropriate) ✓ Email affected customers ✓ Prepare customer-facing postmortem Status Page Updates: - Initial: "Investigating major outage affecting [service]" - Progress: "Identified issue, implementing fix" - Resolution: "Issue resolved, monitoring for stability" - Follow-up: "Postmortem available at [link]"
SEV2: Moderate Communication
code
Internal: ✓ Create incident channel (optional) ✓ Notify team lead ✓ Update every 2 hours ✓ No executive notification (unless prolonged) External: ✓ Update status page (if customer-facing) ✓ Email enterprise customers (if affected) ✓ No social media posts Status Page Updates: - "Experiencing degraded performance on [service]" - "Issue resolved"
SEV3/4: Minimal Communication
code
Internal: ✓ Create ticket in issue tracker ✓ Assign to engineer ✓ No real-time updates External: ✗ No status page update ✗ No customer communication ✗ Fix in regular release cycle
8. Post-Incident Requirements by Severity
SEV0/1: Mandatory Postmortem
code
Requirements: ✓ Full postmortem within 48 hours ✓ Root cause analysis (5 Whys) ✓ Timeline of events ✓ Action items with owners ✓ Executive review ✓ Share with entire engineering org ✓ Optional: Public postmortem Template: - Executive summary - Impact (users, revenue, duration) - Timeline - Root cause - What went well / wrong - Action items - Lessons learned
SEV2: Recommended Postmortem
code
Requirements: ✓ Lightweight postmortem (if prolonged or interesting) ✓ Brief root cause analysis ✓ Key learnings ✓ Action items ✓ Share with team Optional: - Full postmortem if valuable learnings - Skip if straightforward fix
SEV3/4: Optional Review
code
Requirements: ✓ Document fix in ticket ✓ Update runbook (if applicable) Optional: - Team discussion if pattern emerges - No formal postmortem
9. Resource Allocation by Severity
SEV0: All Hands
code
Mobilization: - Incident Commander (senior engineer or manager) - Technical Lead (architect or principal engineer) - On-call team (all available) - Subject matter experts (database, networking, etc.) - Communications Lead - Executive sponsor (CTO) War Room: - Video call (Zoom/Meet) - Dedicated Slack channel - Shared incident doc Duration: - Until resolved - Rotate responders if > 4 hours
SEV1: Core Team
code
Mobilization: - On-call engineer (primary) - Team lead or senior engineer - Subject matter expert (if needed) - Communications (if customer-facing) War Room: - Slack channel - Video call (if needed) Duration: - Until resolved or de-escalated
SEV2: On-Call + Backup
code
Mobilization: - On-call engineer - Escalate to team lead if not resolved in 2 hours Communication: - Slack thread or channel Duration: - Business hours response
SEV3/4: Single Engineer
code
Mobilization: - On-call engineer (low priority) - Or regular development workflow Communication: - Ticket comments Duration: - Fix in next sprint
10. On-Call Rotation Intensity
Severity-Based On-Call Tiers
code
Tier 1 (Primary On-Call): - Responds to all SEV0-SEV3 - 24/7 availability - 15-minute response SLA Tier 2 (Secondary On-Call): - Escalation from Tier 1 - Subject matter experts - 30-minute response SLA Tier 3 (Management): - SEV0 incidents only - Executive visibility - 1-hour response SLA
On-Call Compensation by Severity
code
SEV0: - Immediate response required - Compensation: On-call pay + overtime - Time off in lieu (TOIL) SEV1: - Urgent response required - Compensation: On-call pay - TOIL for extended incidents SEV2-4: - Business hours response acceptable - Compensation: Standard on-call pay
11. Severity Level Confusion (Common Mistakes)
Mistake 1: Confusing Impact with Effort
code
❌ Wrong: "This will take 2 weeks to fix → SEV0" ✓ Right: "100% of users can't login → SEV0" "UI typo affecting 0 users → SEV4 (even if 2-week fix)" Severity = User Impact, not Engineering Effort
Mistake 2: Internal vs External Impact
code
❌ Wrong: "Internal dashboard down → SEV1" ✓ Right: "Internal dashboard down → SEV2 or SEV3" (Unless it blocks customer support) Customer-facing > Internal tools
Mistake 3: Potential vs Actual Impact
code
❌ Wrong: "Security vulnerability discovered → SEV0" ✓ Right: "Security vulnerability being exploited → SEV0" "Security vulnerability discovered (not exploited) → SEV1 or SEV2" Actual impact > Potential impact
Mistake 4: Over-Escalation
code
❌ Wrong: "Single user reports UI glitch → SEV1" ✓ Right: "Single user reports UI glitch → SEV3 or SEV4" Don't cry wolf - save SEV0/1 for real emergencies
Mistake 5: Under-Escalation
code
❌ Wrong: "50% error rate for 2 hours → SEV2" ✓ Right: "50% error rate for 2 hours → SEV1" Don't minimize serious issues
12. Industry Standards Comparison
Tech Company Standards
code
Google: P0 = SEV0 (complete outage) P1 = SEV1 (major impact) P2 = SEV2 (moderate impact) P3 = SEV3 (minor impact) P4 = SEV4 (trivial) Amazon: SEV1 = Critical (customer-facing outage) SEV2 = High (significant degradation) SEV3 = Medium (minor degradation) SEV4 = Low (cosmetic) SEV5 = Trivial Microsoft: Sev A = Critical Sev B = High Sev C = Medium Sev D = Low Atlassian: P1 = Critical (system down) P2 = High (major feature broken) P3 = Medium (minor feature broken) P4 = Low (cosmetic)
ITIL Standards
code
Priority 1: Critical - Complete service outage - Response: Immediate - Resolution: 4 hours Priority 2: High - Significant degradation - Response: 1 hour - Resolution: 24 hours Priority 3: Medium - Minor degradation - Response: 4 hours - Resolution: 1 week Priority 4: Low - Cosmetic issue - Response: 1 day - Resolution: 1 month
13. Customizing Severity for Your Organization
Factors to Consider
code
1. Company Size - Startup: 3 levels (Critical, High, Low) - Enterprise: 5 levels (SEV0-SEV4) 2. Industry - Healthcare: Stricter (patient safety) - Gaming: More lenient (entertainment) - Finance: Stricter (regulatory) 3. Customer Base - B2C: User count matters - B2B: Contract SLAs matter - Enterprise: Single customer = high severity 4. Business Model - E-commerce: Revenue impact critical - SaaS: Uptime critical - Freemium: Paying users > free users
Customization Template
typescript
interface CustomSeverityDefinition {
level: string;
name: string;
description: string;
criteria: {
usersAffected: string;
businessImpact: string;
examples: string[];
};
response: {
acknowledgement: number;
initialResponse: number;
updateFrequency: number;
resolutionTarget: number;
};
communication: {
internal: string[];
external: string[];
};
postIncident: {
postmortemRequired: boolean;
executiveReview: boolean;
};
}
// Example: E-commerce company
const customSeverities: CustomSeverityDefinition[] = [
{
level: 'SEV0',
name: 'Critical - Revenue Impacting',
description: 'Complete inability to process orders',
criteria: {
usersAffected: '100% or major revenue impact',
businessImpact: '> $50k/hour revenue loss',
examples: [
'Checkout completely broken',
'Payment processing down',
'Website completely unavailable'
]
},
response: {
acknowledgement: 5,
initialResponse: 10,
updateFrequency: 15,
resolutionTarget: 1
},
communication: {
internal: ['All hands', 'Executive notification', 'War room'],
external: ['Status page', 'Email to all customers', 'Social media']
},
postIncident: {
postmortemRequired: true,
executiveReview: true
}
}
];
14. Real Incident Severity Examples
Example 1: GitLab Database Incident (2017)
code
Incident: Accidental database deletion Initial Classification: SEV0 Reasoning: - 100% of users affected - 6 hours of data lost - Service completely unavailable - No workaround Response: - All hands on deck - 18 hours to full recovery - Public postmortem published Correct severity: SEV0 ✓
Example 2: AWS S3 Outage (2017)
code
Incident: S3 US-EAST-1 outage Initial Classification: SEV0 Reasoning: - Thousands of websites affected - Complete S3 unavailability - 4-hour duration - Massive business impact Response: - All AWS teams mobilized - Detailed postmortem - Process changes implemented Correct severity: SEV0 ✓
Example 3: Slack Outage (2021)
code
Incident: Slack service disruption Initial Classification: SEV1 Reasoning: - Most users could still access (degraded) - Some features unavailable - Intermittent issues - Workarounds available Response: - Core team response - Status page updates - Resolved in 2 hours Correct severity: SEV1 ✓
Example 4: GitHub Actions Slow (2022)
code
Incident: GitHub Actions experiencing delays Initial Classification: SEV2 Reasoning: - Service still functional - Delays but not failures - Subset of users affected - Non-critical feature Response: - Engineering team investigation - Status page update - Resolved in 4 hours Correct severity: SEV2 ✓
Summary
Key takeaways for Severity Levels:
- •Classify based on user impact - Not engineering effort
- •Use consistent definitions - Everyone should agree what SEV1 means
- •Escalate when needed - Duration and scope changes matter
- •Communicate appropriately - SEV0 needs more communication than SEV4
- •Allocate resources correctly - Don't wake everyone for SEV3
- •Follow SLAs - Response times should match severity
- •Require postmortems for SEV0/1 - Learn from major incidents
- •Customize for your org - But stay close to industry standards
- •Avoid common mistakes - Don't over or under-escalate
- •Document everything - Severity classification reasoning
Related Skills
- •
41-incident-management/incident-triage- Initial assessment and classification - •
41-incident-management/escalation-paths- When and how to escalate - •
41-incident-management/stakeholder-communication- Communication by severity - •
40-system-resilience/postmortem-analysis- Post-incident learning