AgentSkillsCN

observability

在实施指标监控、链路追踪、服务级别目标(SLO)、告警机制,或构建仪表盘时使用。内容涵盖 Prometheus/Grafana/OTel 栈设计、SLI/SLO 框架、错误预算、燃尽率告警,以及分布式追踪策略。

SKILL.md
--- frontmatter
name: observability
description: Use when implementing metrics, tracing, SLOs, alerting, or dashboards. Covers Prometheus/Grafana/OTel stack design, SLI/SLO frameworks, error budgets, burn-rate alerting, and distributed tracing strategy.

Observability

Decision Framework: What to Instrument

Golden Signals (prefer for services): Latency, Traffic, Errors, Saturation

  • RED (request-scoped): Rate, Errors, Duration
  • USE (resource-scoped): Utilization, Saturation, Errors

Pick RED for microservices, USE for infrastructure. Don't mix.

Metric Design Opinions

  • Always use histograms over summaries for latency -- histograms are aggregatable, summaries are not
  • Bucket defaults [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 2, 5] cover most HTTP services
  • Exclude /health and /metrics endpoints from SLI calculations
  • Use recording rules for any query used in alerts or dashboards -- never put raw PromQL in alerts
  • Label cardinality kills Prometheus: never use user IDs, request IDs, or unbounded values as labels

Service Tier Classification

TierAvailabilityLatency P99Examples
Critical99.95%100msPayment, auth
Essential99.9%500msSearch, catalog
Standard99.5%1sRecommendations
Best Effort99.0%2sBatch, reporting

Assign tiers before writing SLOs. Tier drives alert routing and error budget policy.

SLO Framework

Error Budget Policy (non-negotiable escalation)

Budget RemainingAction
>50%Normal velocity
10-50%Postpone risky changes
1-10%Freeze non-critical changes
0%Feature freeze, reliability only

Release Decision Matrix

Budget StatusLow RiskMedium RiskHigh Risk
HealthyApproveApproveReview
WarningReviewDeferBlock
CriticalDeferBlockBlock
ExhaustedBlockBlockBlock

Burn Rate Alert Thresholds

AlertBurn RateShort WindowAction
Fast burn14.4x1h + 5mPage on-call
Slow burn3x6h + 30mCreate ticket

Multi-window burn rate is the only correct SLO alerting pattern. Single-window alerts produce false positives or miss slow degradation.

Progressive SLO Rollout

Start at 99.0% for 1 month baseline, then tighten: 99.5% (2 months) -> 99.9% (3 months) -> 99.95% (ongoing). Never set SLO tighter than current measured reliability.

SLO Templates

API service: availability (99.9% over 30d) + latency (95% of requests < 500ms over 30d) Data pipeline: freshness (99% batches within 30 min over 7d) + completeness (99.95% records processed over 7d)

Distributed Tracing Strategy

Sampling Decisions

  • Dev/staging: 100% sampling
  • Production low-traffic (<1k rps): 10-50% probabilistic
  • Production high-traffic (>10k rps): 1% probabilistic or rate-limit to ~100 traces/sec
  • Always use ParentBased sampler so child spans follow parent's decision
  • Force-sample all errors and high-latency requests regardless of probabilistic rate

Context Propagation

  • Use W3C traceparent header (not B3 or Jaeger-native) for new systems
  • Always inject trace_id into structured logs for correlation
  • Propagate context through async boundaries (queues, event buses) explicitly

Backend Choice

  • Tempo (Grafana stack): prefer when already using Grafana; object-storage backed, cheap at scale
  • Jaeger: prefer when you need standalone deployment or Elasticsearch integration
  • Both support OTLP -- always send via OpenTelemetry Collector, never direct from app to backend

Alerting Opinions

  • Alert on symptoms, not causes -- alert on error rate, not "pod restarted"
  • Severity levels: critical (pages), warning (tickets), info (dashboard only)
  • Every alert must have a runbook link in annotations
  • for: duration: critical >= 2m, warning >= 5m, info >= 15m -- prevents flapping
  • Route critical to PagerDuty, warning to Slack channel, info to dashboard only

Stack Preferences

ConcernPreferred ToolRationale
MetricsPrometheus + Thanos/MimirDe facto standard, PromQL ecosystem
VisualizationGrafanaDashboard-as-code, multi-datasource
TracingTempo or Jaeger via OTelOTLP-native, cost-effective
LogsLoki or OpenSearchLoki for Grafana stack, OpenSearch for complex queries
CollectorOpenTelemetry CollectorVendor-neutral pipeline, single agent
AlertingAlertmanagerNative Prometheus integration

Gotchas

  • Prometheus rate() requires at least 2 data points in the window -- use [5m] minimum with 15s scrape interval
  • histogram_quantile is an estimate; accuracy degrades with poor bucket choices
  • OTel Collector batch processor default timeout is 200ms -- increase to 5-10s for production to reduce export overhead
  • Grafana dashboards without variables become unmaintainable past 3 services
  • Never scrape intervals faster than 10s in production -- it causes storage and CPU issues
  • Alertmanager grouping: group by alertname, namespace, service -- too broad silences everything, too narrow floods on-call