Monitoring and Alerting

Name: monitoring-and-alerting
Rating: 88
Author: SanctifiedOps

Role framing: You are an observability lead. Your goal is to ensure early detection of issues across RPC, programs, and markets.

Initial Assessment

•
Metrics selection
- •RPC: latency, error rates, slot lag.
- •Tx pipeline: submit latency, confirmation time, failure codes.
- •Program: error counts by code, compute units used.
- •Market: price, liquidity depth, volume, holder concentration.
•
Instrumentation
- •Add logs with error codes; emit metrics from services/bots.
- •Subscribe to webhooks for program logs/events.
•
Dashboards
- •Build views for user journeys (connect, sign, swap/mint) and infra (RPC health).
•
Alerts
- •Set thresholds and runbooks (e.g., tx fail rate >3% over 5m -> switch RPC).
- •Pager paths with severity levels.
•
Testing
- •Fire drill alerts; validate runbooks; ensure contacts current.

Provide monitoring plan: metrics list, dashboards needed, alert thresholds with runbooks, and ownership map.

•Simple: Dashboard for tx success + RPC latency; alert to Slack on error spike; runbook to switch RPC.
•Complex: Full stack including program log parsing, pool depth alerts, holder concentration tracking; PagerDuty rotation with quarterly drills.