AgentSkillsCN

monitoring-setup

Prometheus、Grafana和报警的可观测性栈。

SKILL.md
--- frontmatter
name: monitoring-setup
description: Observability stack with Prometheus, Grafana, and alerting.

Monitoring Setup

The Three Pillars

PillarToolPurpose
MetricsPrometheusTime-series data
LogsLoki / ELKEvent records
TracesJaeger / TempoRequest flow

Prometheus

yaml
# prometheus.yml
global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true

Grafana Dashboard

json
{
  "panels": [
    {
      "title": "Request Rate",
      "targets": [
        {
          "expr": "rate(http_requests_total[5m])",
          "legendFormat": "{{method}} {{path}}"
        }
      ]
    }
  ]
}

Alert Rules

yaml
groups:
  - name: app
    rules:
      - alert: HighErrorRate
        expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.1
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate detected"

      - alert: PodCrashLooping
        expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
        for: 5m
        labels:
          severity: warning

Key Metrics

RED Method (Services)

  • Rate - Requests per second
  • Errors - Failed requests
  • Duration - Response time

USE Method (Resources)

  • Utilization - % busy
  • Saturation - Queue depth
  • Errors - Error count

SLIs/SLOs

code
SLI: 99th percentile latency < 200ms
SLO: 99.9% of requests meet SLI
Error Budget: 0.1% of requests can exceed SLI