Alerting Context
Authentication
IMPORTANT: Credentials are injected automatically by a proxy layer. Do NOT check for PAGERDUTY_API_KEY in environment variables - it won't be visible to you. Just run the scripts directly; authentication is handled transparently.
Why Alerting Context Matters
Before diving into logs and metrics, understand:
- •Has this happened before? Check similar past incidents
- •Who's responding? Know who's on-call and assigned
- •What else is alerting? Correlated alerts reveal scope
- •How long do similar issues take? MTTR sets expectations
Available Scripts
All scripts are in .claude/skills/alerting-context/scripts/
get_incident.py - Get Incident Details
bash
python .claude/skills/alerting-context/scripts/get_incident.py --id INCIDENT_ID [--timeline] # Examples: python .claude/skills/alerting-context/scripts/get_incident.py --id P123ABC python .claude/skills/alerting-context/scripts/get_incident.py --id P123ABC --timeline
list_incidents.py - List Incidents with Filters
bash
python .claude/skills/alerting-context/scripts/list_incidents.py [--status STATUS] [--days N] [--limit N] # Examples: python .claude/skills/alerting-context/scripts/list_incidents.py python .claude/skills/alerting-context/scripts/list_incidents.py --status triggered python .claude/skills/alerting-context/scripts/list_incidents.py --status acknowledged --limit 10 python .claude/skills/alerting-context/scripts/list_incidents.py --days 30
calculate_mttr.py - Calculate Mean Time To Resolve
bash
python .claude/skills/alerting-context/scripts/calculate_mttr.py [--service SERVICE_ID] [--days N] # Examples: python .claude/skills/alerting-context/scripts/calculate_mttr.py python .claude/skills/alerting-context/scripts/calculate_mttr.py --days 30 python .claude/skills/alerting-context/scripts/calculate_mttr.py --service PSERVICE123 --days 90
Investigation Workflow
Step 1: Get Current Incident Context
bash
# Get details of the current incident python get_incident.py --id P123ABC --timeline
Returns:
- •Incident title, status, urgency
- •Service affected
- •Who acknowledged, when
- •Timeline of actions taken
Step 2: Find Similar Past Incidents
bash
# Get incidents from the last 30 days python list_incidents.py --days 30 --status resolved # Check for patterns in a specific service python list_incidents.py --service PSERVICE123 --days 90
Look for:
- •Same alert title recurring → Known issue or flapping
- •Cluster of alerts → Systemic problem
- •Low ack rate → Possible alert fatigue
Step 3: Check Historical MTTR
bash
# Get MTTR for this service python calculate_mttr.py --service PSERVICE123 --days 30
Returns:
- •Average MTTR (minutes/hours)
- •Median MTTR
- •95th percentile
- •Fastest/slowest resolution
Quick Commands Reference
| Goal | Command |
|---|---|
| Get incident | get_incident.py --id P123ABC |
| With timeline | get_incident.py --id P123ABC --timeline |
| Active incidents | list_incidents.py --status triggered |
| Acknowledged | list_incidents.py --status acknowledged |
| Last 30 days | list_incidents.py --days 30 |
| Calculate MTTR | calculate_mttr.py --service X --days 30 |
Common Patterns
Pattern 1: "Is this a known issue?"
bash
# Search for similar alerts in last 30 days python list_incidents.py --days 30 # Check the output for recurring alert titles # Look for same service, similar patterns
Pattern 2: "Escalation Investigation"
bash
# Get full incident details with timeline python get_incident.py --id P123ABC --timeline # Check 'assignments' and 'acknowledgements' in output # Timeline shows escalation events
Pattern 3: "SLA/MTTR Tracking"
bash
# Get MTTR for incident comparison python calculate_mttr.py --service PSERVICE123 --days 30 # Compare current incident duration to historical average # If current > p95, this is an unusually long incident
Output Format
markdown
## Alerting Context Summary ### Current Incident - **ID**: [incident_id] - **Title**: [title] - **Status**: [triggered/acknowledged/resolved] - **Service**: [service_name] - **Urgency**: [high/low] - **Created**: [timestamp] - **Duration**: [how long since created] ### On-Call - **Primary**: [name] ([email]) - **Secondary**: [name] ([email]) - **Escalation Policy**: [policy_name] ### Historical Context - **Similar incidents (30d)**: N incidents with same/similar title - **Average MTTR for this service**: X minutes - **This alert fires**: Z times/week on average ### Recommendations - [If recurring] Review runbook for this alert - [If long duration] Consider escalating - [If noisy] Consider tuning alert threshold
Anti-Patterns to Avoid
- •❌ Ignoring past incidents - Always check if it's a known issue
- •❌ Not checking on-call - Know who's responding before investigating
- •❌ Missing correlated alerts - One incident might mask the real issue
- •❌ Forgetting MTTR context - Know what "normal" resolution looks like
- •❌ Unbounded queries - Always use time ranges to avoid timeout