Actionable Alerting and Runbook Design

Name: actionable-alerting-runbook-design
Rating: 76
Author: Agentient

This skill provides expertise in designing alerts and runbooks for effective incident response.

Overview

Good alerting enables quick incident detection and resolution. Bad alerting causes fatigue and missed issues.

Alerting Principles

What Makes an Alert Actionable?

•Specific: Clear about what's wrong
•Contextual: Includes relevant information
•Timely: Fires before users notice
•Actionable: Recipient can do something about it
•Linked: Points to runbook or dashboard

Alert Anti-Patterns

•Flapping alerts: Constantly firing and resolving
•Too sensitive: Alerts on normal variance
•No runbook: Alert with no remediation guidance
•Wrong audience: Alerting people who can't help

Runbook Structure

markdown

# Alert: High API Error Rate

## Summary
API error rate exceeds 5% for 5 minutes

## Impact
Users experiencing failed requests

## Diagnosis Steps
1. Check error logs: [link]
2. Check recent deployments: [link]
3. Check database health: [link]

## Remediation Steps
1. If recent deployment, rollback: `kubectl rollout undo...`
2. If database issue, scale: `gcloud sql instances patch...`
3. If unknown, escalate to: @team-leads

## Escalation
- L1: On-call engineer
- L2: Team lead (if not resolved in 15min)
- L3: VP Engineering (if customer impact > 30min)

Best Practices

•Alert on symptoms, not causes
•Use multi-window alerting to reduce noise
•Include dashboards and runbook links in alerts
•Review and prune alerts quarterly
•Track alert-to-incident ratio

[Content to be expanded based on plugin_spec_agentient-observability.md specifications]