AgentSkillsCN

actionable-alerting-runbook-design

设计高效的告警与应急响应手册。请主动启用以下场景:(1) 创建告警规则,(2) 编写应急响应手册,(3) 降低告警疲劳,(4) 设置值班轮换与升级机制,(5) 制定事故响应流程。触发指令:“告警”“应急响应手册”“值班轮换”“PagerDuty”“事故”“告警疲劳”“升级”“行动手册”

SKILL.md
--- frontmatter
name: actionable-alerting-runbook-design
version: "1.0"
description: >
  Designing effective alerts and runbooks for incident response.
  PROACTIVELY activate for: (1) Creating alerting rules, (2) Writing runbooks,
  (3) Reducing alert fatigue, (4) On-call escalation setup, (5) Incident response procedures.
  Triggers: "alerting", "runbook", "on-call", "pagerduty", "incident", "alert fatigue", "escalation", "playbook"
core-integration:
  techniques:
    primary: ["systematic_analysis"]
    secondary: ["structured_evaluation"]
  contracts:
    input: "none"
    output: "none"
  patterns: "none"
  rubrics: "none"

Actionable Alerting and Runbook Design

This skill provides expertise in designing alerts and runbooks for effective incident response.

Overview

Good alerting enables quick incident detection and resolution. Bad alerting causes fatigue and missed issues.

Alerting Principles

What Makes an Alert Actionable?

  1. Specific: Clear about what's wrong
  2. Contextual: Includes relevant information
  3. Timely: Fires before users notice
  4. Actionable: Recipient can do something about it
  5. Linked: Points to runbook or dashboard

Alert Anti-Patterns

  • Flapping alerts: Constantly firing and resolving
  • Too sensitive: Alerts on normal variance
  • No runbook: Alert with no remediation guidance
  • Wrong audience: Alerting people who can't help

Runbook Structure

markdown
# Alert: High API Error Rate

## Summary
API error rate exceeds 5% for 5 minutes

## Impact
Users experiencing failed requests

## Diagnosis Steps
1. Check error logs: [link]
2. Check recent deployments: [link]
3. Check database health: [link]

## Remediation Steps
1. If recent deployment, rollback: `kubectl rollout undo...`
2. If database issue, scale: `gcloud sql instances patch...`
3. If unknown, escalate to: @team-leads

## Escalation
- L1: On-call engineer
- L2: Team lead (if not resolved in 15min)
- L3: VP Engineering (if customer impact > 30min)

Best Practices

  1. Alert on symptoms, not causes
  2. Use multi-window alerting to reduce noise
  3. Include dashboards and runbook links in alerts
  4. Review and prune alerts quarterly
  5. Track alert-to-incident ratio

[Content to be expanded based on plugin_spec_agentient-observability.md specifications]