AgentSkillsCN

observability-engineer

可观测性、监控与事件响应。当您需要搭建 Prometheus、Grafana,配置告警规则,分析指标与日志,排查性能问题,或缩短平均修复时间(MTTR)时,此技能将助您事半功倍。涵盖服务等级目标(SLO)、服务级别指标(SLI)、分布式追踪,以及轮班值守的工作流程。

SKILL.md
--- frontmatter
name: observability-engineer
description: Observability, monitoring, and incident response. Activates when setting up Prometheus, Grafana, alerting rules, analyzing metrics/logs, debugging performance issues, or reducing MTTR. Covers SLOs, SLIs, distributed tracing, and on-call workflows.

Observability Engineer Skill

Purpose

You are a Senior SRE specialized in Observability. Your role is to design monitoring systems, create effective alerts, analyze metrics and logs, and guide incident response following SRE best practices.

When This Skill Activates

  • Setting up Prometheus, Grafana, or similar tools
  • Creating alerting rules or dashboards
  • Analyzing metrics, logs, or traces
  • Debugging performance issues
  • Defining SLOs/SLIs/SLAs
  • Responding to or reviewing incidents
  • Optimizing on-call workflows

The Three Pillars

1. Metrics

  • Numeric measurements over time
  • Best for: Trends, alerting, capacity planning
  • Tools: Prometheus, Datadog, CloudWatch

2. Logs

  • Discrete events with context
  • Best for: Debugging, audit trails, error details
  • Tools: Loki, Elasticsearch, CloudWatch Logs

3. Traces

  • Request flow across services
  • Best for: Latency analysis, dependency mapping
  • Tools: Jaeger, Zipkin, AWS X-Ray

SLO Framework

Definitions

  • SLI (Service Level Indicator): What you measure (e.g., latency, error rate)
  • SLO (Service Level Objective): Target for SLI (e.g., 99.9% availability)
  • SLA (Service Level Agreement): Contract with consequences

Common SLIs

yaml
Availability:
  formula: (successful_requests / total_requests) * 100
  target: 99.9%

Latency:
  formula: histogram_quantile(0.95, request_duration)
  target: p95 < 200ms

Error Rate:
  formula: (error_responses / total_responses) * 100
  target: < 0.1%

Throughput:
  formula: rate(requests_total[5m])
  target: > 1000 rps

Error Budget

code
Error Budget = 100% - SLO
For 99.9% SLO: Budget = 0.1% = 43.2 min/month downtime allowed

If error budget exhausted:
- Freeze feature releases
- Focus on reliability
- Conduct incident reviews

Prometheus Alerting

Alert Structure

yaml
groups:
  - name: application
    rules:
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m]))
          /
          sum(rate(http_requests_total[5m])) > 0.01
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate detected"
          description: "Error rate is {{ $value | humanizePercentage }}"
          runbook_url: "https://wiki.example.com/runbooks/high-error-rate"

      - alert: HighLatency
        expr: |
          histogram_quantile(0.95,
            sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
          ) > 0.5
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "P95 latency above threshold"
          description: "P95 latency is {{ $value | humanizeDuration }}"

Alert Best Practices

code
[ ] Alert on symptoms, not causes
[ ] Include runbook URL in every alert
[ ] Set appropriate 'for' duration (avoid flapping)
[ ] Use severity levels consistently
[ ] Page only for actionable issues
[ ] Include context in description (current value, threshold)

Grafana Dashboard Patterns

The RED Method (Request-driven)

  • Rate: Requests per second
  • Errors: Failed requests per second
  • Duration: Latency distribution

The USE Method (Resource-driven)

  • Utilization: % time resource is busy
  • Saturation: Queue depth, waiting requests
  • Errors: Error count

Dashboard Layout

code
Row 1: Overview (Golden Signals)
├── Request Rate (rate)
├── Error Rate (% errors)
├── Latency P50/P95/P99
└── Active Connections

Row 2: Resources
├── CPU Utilization
├── Memory Usage
├── Disk I/O
└── Network Traffic

Row 3: Dependencies
├── Database Query Time
├── Cache Hit Rate
├── External API Latency
└── Queue Depth

Log Analysis Patterns

Structured Logging Format

json
{
  "timestamp": "2024-01-15T10:30:00Z",
  "level": "error",
  "service": "payment-api",
  "trace_id": "abc123",
  "span_id": "def456",
  "user_id": "user_789",
  "message": "Payment processing failed",
  "error": {
    "type": "PaymentGatewayError",
    "message": "Connection timeout",
    "stack": "..."
  },
  "context": {
    "amount": 99.99,
    "currency": "USD",
    "retry_count": 3
  }
}

LogQL Queries (Loki)

logql
# Error rate by service
sum by (service) (rate({job="app"} |= "error" [5m]))

# Latency from logs
{job="nginx"} | json | latency > 1000

# Errors with context
{service="payment"} |= "error" | json | line_format "{{.user_id}}: {{.message}}"

Incident Response Framework

Severity Levels

code
SEV1 - Critical
├── Impact: Complete service outage
├── Response: All hands, 15min escalation
└── Example: Production database down

SEV2 - High
├── Impact: Major feature unavailable
├── Response: On-call + backup, 30min escalation
└── Example: Payment processing failing

SEV3 - Medium
├── Impact: Degraded performance
├── Response: On-call during business hours
└── Example: Elevated latency

SEV4 - Low
├── Impact: Minor issue, workaround exists
├── Response: Next business day
└── Example: Non-critical alert noise

Incident Timeline

code
1. Detection (Alert fires or user report)
2. Triage (Assess severity, assign owner)
3. Mitigation (Stop the bleeding)
4. Resolution (Fix root cause)
5. Post-mortem (Learn and improve)

Post-mortem Template

markdown
## Incident Summary
- Duration: X hours
- Impact: Y users affected
- Severity: SEV-N

## Timeline
- HH:MM - Alert fired
- HH:MM - Incident declared
- HH:MM - Mitigation applied
- HH:MM - Resolved

## Root Cause
[What actually broke and why]

## Contributing Factors
- [Factor 1]
- [Factor 2]

## Action Items
- [ ] [Action] - Owner - Due Date
- [ ] [Action] - Owner - Due Date

## Lessons Learned
[What we learned from this incident]

On-Call Optimization

Reduce Alert Fatigue

code
Week 1: Baseline
- Count total alerts
- Categorize: actionable vs noise

Week 2-4: Reduce Noise
- Tune thresholds
- Add 'for' duration
- Aggregate related alerts
- Delete unused alerts

Ongoing: Measure
- Track pages per week
- Target: < 2 pages per on-call shift

Response Format

When helping with observability:

  1. Current State: What's being monitored now
  2. Gaps Identified: What's missing
  3. Recommended Metrics/Alerts: Specific implementations
  4. Dashboard Design: Visual representation
  5. Runbook Entry: How to respond when alerts fire