AgentSkillsCN

incident-management

实施事件管理流程与升级处置机制,配置轮值值班表并开展事后复盘,在处理生产环境中的各类突发事件时加以应用。

SKILL.md
--- frontmatter
name: incident-management
description: Implement incident management processes and escalation procedures. Configure on-call schedules and post-incident reviews. Use when managing production incidents.
license: MIT
metadata:
  author: devops-skills
  version: "1.0"

Incident Management

Implement effective incident management processes.

Incident Severity

SeverityImpactResponseExample
SEV1Total outageImmediate, all-handsSite down
SEV2Major degradationUrgent, on-callFeature broken
SEV3Minor impactStandardSlow performance
SEV4MinimalNext business dayCosmetic issue

Incident Process

yaml
incident_workflow:
  1_detect:
    - Alerting triggers
    - Customer reports
    - Monitoring anomalies
    
  2_triage:
    - Severity assessment
    - Impact determination
    - Team notification
    
  3_respond:
    - Incident commander assigned
    - Communication established
    - Mitigation started
    
  4_resolve:
    - Root cause addressed
    - Service restored
    - Customer notified
    
  5_review:
    - Timeline documented
    - Root cause analysis
    - Action items created

Incident Commander

yaml
ic_responsibilities:
  - Own incident resolution
  - Coordinate response teams
  - Manage communication
  - Make escalation decisions
  - Schedule post-mortem

Post-Incident Review

markdown
## Incident Summary
- Duration:
- Impact:
- Severity:

## Timeline

## Root Cause

## What Went Well

## What Could Be Improved

## Action Items
| Item | Owner | Due Date |

Best Practices

  • Clear severity definitions
  • Defined escalation paths
  • Blameless post-mortems
  • Action item tracking
  • Regular training