AgentSkillsCN

runbook-generator

当用户要求创建、生成或编写事故应急手册、行动预案或响应流程时,可使用此技能。可通过警报名称、事故描述,或包含“应急手册”“行动预案”“事故响应”“值班流程”或“故障排除指南”等词汇的请求触发。此外,当收到监控警报并被要求记录响应过程时,也可触发此技能。

SKILL.md
--- frontmatter
name: runbook-generator
description: Use this skill when the user asks to create, generate, or write an incident runbook, playbook, or response procedure. Triggers on alert names, incident descriptions, or requests containing words like "runbook", "playbook", "incident response", "on-call procedure", or "troubleshooting guide". Also triggers when given a monitoring alert and asked to document the response.

Incident Runbook Generator

Overview

Generate structured, actionable incident runbooks that follow the team's standard format. Every runbook produced by this skill will be consistent in structure, tone, and level of detail — making them reliable under pressure at 3 AM.

Instructions

When asked to generate a runbook:

  1. Identify the incident type from the user's description (e.g., "high CPU", "connection pool exhaustion", "certificate expiry")
  2. Classify severity using the definitions below
  3. Generate the runbook following the exact template structure
  4. Output as a markdown file named runbook-<incident-slug>.md (e.g., runbook-high-cpu-api-servers.md)

Runbook Template

Every runbook MUST contain these sections in this exact order:

markdown
# [Incident Title]

**Severity:** [SEV-1 | SEV-2 | SEV-3 | SEV-4]  
**Last Updated:** [date]  
**Owner:** [team name — leave as TBD if unknown]  
**Review Cadence:** Quarterly

## Symptoms

What does this incident look like? List the observable indicators.
- Alert name and threshold
- User-facing impact
- Dashboard signals

## Impact

Who and what is affected?
- Services impacted
- User population affected
- Business impact (revenue, SLA, compliance)

## Triage Checklist

Step-by-step diagnostic procedure. Each step should be a command or action.
1. [ ] First thing to check (include exact command)
2. [ ] Second thing to check
3. [ ] Third thing to check

## Mitigation

Immediate actions to reduce impact. NOT root cause fixes.
1. [ ] First mitigation step (include exact command)
2. [ ] Second mitigation step
3. [ ] Rollback procedure if applicable

## Resolution

Steps to fully resolve the underlying issue.
1. [ ] Resolution step with commands
2. [ ] Verification that the fix worked

## Escalation

When and how to escalate.
- **Escalate to [team]** if: [condition]
- **Page [role]** if: [condition]
- **Incident commander** if: SEV-1 or customer-facing for > [duration]

## Post-Incident

- [ ] Create post-incident review ticket
- [ ] Update this runbook if procedure changed
- [ ] Notify stakeholders via [channel]

## References

- Relevant dashboards: [links]
- Related runbooks: [links]
- Architecture docs: [links]

Severity Definitions

LevelDefinitionResponse TimeExamples
SEV-1Complete service outage or data loss riskImmediate, all-handsFull production down, data corruption, security breach
SEV-2Major feature degraded, significant user impact< 30 minutesPartial outage, severe latency, payment failures
SEV-3Minor feature degraded, limited user impact< 2 hoursSingle endpoint slow, non-critical service down
SEV-4Cosmetic or minimal impactNext business dayLog noise, minor UI glitch, non-user-facing

Style Guide

  • Tone: Terse, direct, action-oriented. No filler. No "you might want to consider..."
  • Commands: Always include exact CLI commands, not just descriptions. Use code blocks.
  • Checklists: Use [ ] checkboxes so responders can track progress
  • Assumptions: Assume the reader is an on-call engineer at 3 AM who has never seen this alert before
  • Specificity: Prefer kubectl get pods -n production | grep CrashLoop over "check the pods"
  • Time estimates: Include expected duration for each section where applicable