AgentSkillsCN

Observability Design

制定监控、告警与日志记录策略,明确SLI/SLO指标。

SKILL.md
--- frontmatter
name: "Observability Design"
department: "operator"
description: "Monitoring, alerting, and logging strategy with SLI/SLO definitions"
version: 1
triggers:
  - "monitoring"
  - "alerting"
  - "logging"
  - "observability"
  - "Sentry"
  - "metrics"
  - "tracing"
  - "dashboard"
  - "SLO"

Observability Design

Purpose

Design a comprehensive observability strategy covering metrics, logging, tracing, alerting, and SLI/SLO definitions. Produces a monitoring architecture that enables rapid incident detection, diagnosis, and resolution.

Inputs

  • System architecture (services, databases, APIs, third-party dependencies)
  • Current monitoring setup (existing tools, dashboards, alerts)
  • Reliability requirements (SLA commitments, uptime targets)
  • Team structure (on-call rotation, escalation paths)

Process

Step 1: Define Observability Pillars

Establish the three pillars for this system:

  • Metrics: What to measure — request rate, error rate, latency, saturation, business KPIs
  • Logs: What to record — request lifecycle, state changes, errors, audit events
  • Traces: What to follow — cross-service request flows, database queries, external API calls
  • Map each pillar to specific use cases: debugging, alerting, capacity planning, business intelligence

Step 2: Design Metric Collection

Define the metric taxonomy:

  • Application metrics: Request count, error count, latency histograms, queue depth, cache hit rate
  • Infrastructure metrics: CPU, memory, disk I/O, network throughput, connection pool utilization
  • Business metrics: Sign-ups, conversions, revenue events, feature adoption rates
  • Custom instrumentation: Counters (events), gauges (current values), histograms (distributions)
  • Specify metric naming conventions, label/tag strategy, and cardinality limits

Step 3: Define Alert Thresholds and Escalation

Design the alerting strategy:

  • Warning alerts: Early indicators — elevated error rate, latency creep, resource approaching limits
  • Critical alerts: Immediate action required — service down, error rate spike, SLO burn rate exceeded
  • Escalation paths: Primary on-call → secondary → engineering lead → incident commander
  • Runbook links: Every alert includes a link to its diagnosis and remediation runbook
  • Alert fatigue prevention: Grouping, deduplication, silence windows, alert quality reviews

Step 4: Plan Structured Logging

Design the logging architecture:

  • Log levels: DEBUG (development only), INFO (normal operations), WARN (unexpected but handled), ERROR (requires attention)
  • Structured fields: timestamp, service, request_id, user_id, action, duration_ms, status
  • Correlation IDs: Request ID propagation across services for distributed request tracing
  • PII redaction: Identify sensitive fields, implement automatic redaction/masking
  • Log aggregation: Collection, indexing, retention periods, search capabilities

Step 5: Design Distributed Tracing

Plan request flow visibility:

  • Span naming conventions: service.operation format, consistent across services
  • Context propagation: How trace context passes between services (headers, message metadata)
  • Sampling strategy: Head-based vs tail-based sampling, sampling rate by endpoint or error status
  • Trace enrichment: Adding business context (user tier, feature flag state) to spans
  • Critical paths: Which request flows must always be traced (payments, auth, data mutations)

Step 6: Specify Dashboard Requirements

Define dashboard hierarchy:

  • Operational dashboards: Service health overview, real-time traffic, error rates, latency percentiles
  • Business dashboards: User activity, feature adoption, conversion funnels, revenue metrics
  • SLO dashboards: Error budget remaining, burn rate, SLO compliance history
  • Incident dashboards: Pre-built investigation views for common failure modes
  • Specify dashboard layout, refresh intervals, time range defaults, and access controls

Step 7: Define SLIs/SLOs

Establish reliability targets:

  • Availability SLI: Successful requests / total requests (define "successful")
  • Latency SLI: Proportion of requests faster than threshold (p50, p95, p99 targets)
  • Error rate SLI: Proportion of requests without errors (define "error")
  • SLO targets: e.g., 99.9% availability, p95 latency < 200ms, error rate < 0.1%
  • Error budgets: Calculate error budget from SLO, define burn rate alerts (fast burn, slow burn)
  • SLO review cadence: Weekly error budget check, monthly SLO review, quarterly target adjustment

Output Format

markdown
# Observability Design: [Service/Feature Name]

## Observability Architecture

[Application] → [Metrics Agent] → [Metrics Store] → [Dashboards] ↓ ↓ [Structured Logs] → [Log Aggregator] → [Log Search] [Alerts] → [On-call] ↓ [Trace SDK] → [Trace Collector] → [Trace UI]

code

## Metric Catalog

| Metric Name | Type | Labels | Description | Alert Threshold |
|-------------|------|--------|-------------|-----------------|
| http_requests_total | counter | method, path, status | Request count | N/A |
| http_request_duration_ms | histogram | method, path | Request latency | p95 > 500ms |
| ... | ... | ... | ... | ... |

## Alert Catalog

| Alert Name | Severity | Condition | Duration | Runbook |
|------------|----------|-----------|----------|---------|
| HighErrorRate | critical | error_rate > 5% | 5m | [link] |
| LatencyDegraded | warning | p95 > 500ms | 10m | [link] |
| ... | ... | ... | ... | ... |

## Logging Schema

```json
{
  "timestamp": "ISO8601",
  "level": "INFO",
  "service": "api",
  "request_id": "uuid",
  "user_id": "string (optional)",
  "action": "string",
  "duration_ms": "number",
  "status": "number",
  "message": "string"
}

SLI/SLO Definitions

SLIMeasurementSLO TargetError Budget (30d)
Availabilitysuccessful requests / total99.9%43.2 min downtime
Latencyrequests < 200ms / total99.0%432 min slow
Error Ratenon-error requests / total99.9%0.1% errors

Dashboard Specifications

DashboardAudienceKey PanelsRefresh
Service HealthOn-callTraffic, errors, latency, saturation30s
SLO StatusEngineeringError budget, burn rate, compliance5m
Business MetricsProductAdoption, conversions, revenue1h
code

## Quality Checks

- [ ] All three observability pillars (metrics, logs, traces) are covered
- [ ] Every alert has a defined severity, threshold, and linked runbook
- [ ] Structured logging schema includes correlation IDs for distributed tracing
- [ ] PII fields are identified with redaction strategy
- [ ] SLIs are measurable and SLO targets are realistic for the service tier
- [ ] Error budgets are calculated with burn rate alert thresholds
- [ ] Dashboard hierarchy covers operational, business, and SLO views
- [ ] Sampling strategy balances trace coverage with storage costs

## Evolution Notes
<!-- Observations appended after each use -->