prometheus
Use this skill for Prometheus + Alertmanager + Grafana 监控体系建设与运维。
Defaults / assumptions to confirm
- •Deployment: kube-prometheus-stack / standalone
- •Alert routing: Alertmanager receivers (Slack/WeCom/PagerDuty)
- •Metrics source: app exporters, node-exporter, kube-state-metrics
- •Naming conventions and label cardinality constraints
Workflow
- •Understand what to measure
- •Identify golden signals: latency, traffic, errors, saturation.
- •Map business KPIs and critical user journeys to technical indicators.
- •Instrumentation guidance
- •Prefer stable metric names and bounded label sets.
- •Avoid high-cardinality labels (user_id, request_id, raw URLs).
- •Use histograms for latency (p50/p95/p99 via
histogram_quantile).
- •Scraping configuration
- •Confirm scrape targets (ServiceMonitor/PodMonitor or static configs).
- •Ensure relabeling rules are correct; set scrape intervals/timeouts appropriately.
- •Alert design (practical)
- •Alerts should be actionable and low-noise.
- •Use multi-window multi-burn-rate for SLO alerts where applicable.
- •Add
for:to avoid flapping; include runbook links in annotations.
- •Dashboarding
- •Provide per-service dashboards: RPS, p95 latency, error rate, resource usage.
- •Add drill-down: by route group, instance, and dependency.
- •Troubleshooting checklist
- •Missing metrics: target down, wrong labels, scrape failures, RBAC/network issues.
- •Wrong metrics: unit mismatch, counter resets, histogram buckets incorrect.
- •High load: cardinality explosion, too frequent scrapes, heavy queries.
Outputs
- •Metrics plan: required metrics, labels, and thresholds.
- •Alert rules: PromQL + severity + routing + runbook.
- •Grafana dashboard layout and key panels.
- •Runbook: symptom → checks → mitigation → rollback.