AgentSkillsCN

Monitoring Observability

监控与可观测性

SKILL.md

Monitoring & Observability Skill

Purpose

Standards for monitoring, metrics, alerting, and observability.

Auto-Invoke Triggers

  • Setting up monitoring infrastructure
  • Defining metrics and KPIs
  • Configuring alerts
  • Implementing distributed tracing

Three Pillars of Observability

PillarPurposeTools
LogsEvent recordsELK, Loki, CloudWatch
MetricsNumerical measurementsPrometheus, Datadog
TracesRequest flowJaeger, Zipkin, X-Ray

Key Metrics (Golden Signals)

The Four Golden Signals

SignalDescriptionExample Metric
LatencyResponse timep50, p95, p99 latency
TrafficRequest volumeRequests per second
ErrorsFailure rateError percentage
SaturationResource usageCPU, memory utilization

RED Method (Request-focused)

  • Rate - Requests per second
  • Errors - Failed requests per second
  • Duration - Request latency

USE Method (Resource-focused)

  • Utilization - Resource % used
  • Saturation - Queue depth
  • Errors - Error count

Application Metrics

HTTP Endpoints

MetricTypeDescription
http_requests_totalCounterTotal requests
http_request_duration_secondsHistogramRequest latency
http_requests_in_flightGaugeActive requests
http_response_size_bytesHistogramResponse size

Database

MetricTypeDescription
db_connections_activeGaugeActive connections
db_query_duration_secondsHistogramQuery time
db_errors_totalCounterQuery errors

Business Metrics

MetricTypeDescription
users_registered_totalCounterNew registrations
orders_created_totalCounterOrders placed
payment_amount_totalCounterRevenue

Metric Types

TypeUse CaseExample
CounterCumulative totalsRequests, errors
GaugeCurrent valueTemperature, queue size
HistogramValue distributionLatency buckets
SummaryQuantilesp50, p95, p99

Naming Convention

code
{namespace}_{subsystem}_{name}_{unit}

http_server_request_duration_seconds
db_pool_connections_active
app_users_registered_total

Alerting Strategy

Alert Severity

SeverityResponse TimeAction
CriticalImmediatePage on-call
WarningWithin hoursCreate ticket
InfoNext business dayReview

Alert Rules

  • Alert on symptoms, not causes
  • Include runbook links
  • Set appropriate thresholds
  • Avoid alert fatigue
  • Group related alerts

What to Alert On

AlertConditionSeverity
Service downHealth check failsCritical
High error rate> 5% errorsCritical
High latencyp99 > 2sWarning
High CPU> 80% for 5minWarning
Disk space< 20% freeWarning
SSL expiry< 30 daysWarning

SLOs and SLIs

Service Level Indicators (SLIs)

  • Availability: Successful requests / Total requests
  • Latency: % requests < threshold
  • Throughput: Requests per second
  • Error rate: Failed requests / Total requests

Service Level Objectives (SLOs)

SLOTargetError Budget
Availability99.9%43.8 min/month
Latency (p99)< 500ms-
Error rate< 0.1%-

Error Budget

  • Monthly allowed downtime
  • Spend on risky deployments
  • Freeze deploys when exhausted

Distributed Tracing

Concepts

TermDescription
TraceEnd-to-end request journey
SpanSingle operation in trace
ContextTrace ID propagated across services

What to Trace

  • Cross-service calls
  • Database queries
  • External API calls
  • Message queue operations

Trace Propagation

  • Pass trace context in headers
  • Standard: W3C Trace Context
  • Headers: traceparent, tracestate

Health Checks

Endpoint Types

EndpointPurposeResponse
/healthBasic liveness200 OK
/health/readyFull readiness200 + deps status
/health/liveProcess alive200 OK

Readiness Check Components

  • Database connection
  • Cache connection
  • External service connectivity
  • Required configuration present

Health Response Format

json
{
  "status": "healthy",
  "checks": {
    "database": "healthy",
    "cache": "healthy",
    "external-api": "degraded"
  },
  "version": "1.2.3"
}

Dashboard Design

Layout Principles

  • Most important metrics at top
  • Group related metrics
  • Use consistent time ranges
  • Include context (deploy markers)

Standard Panels

  1. Overview - Traffic, errors, latency
  2. Resources - CPU, memory, disk
  3. Dependencies - DB, cache, external APIs
  4. Business - Domain-specific metrics

Best Practices

  • Link to runbooks
  • Add annotations for deploys
  • Use consistent colors
  • Set reasonable refresh rates

Runbooks

Structure

  1. Alert description - What triggered
  2. Impact - User/business effect
  3. Diagnosis steps - How to investigate
  4. Resolution steps - How to fix
  5. Escalation - Who to contact

Required Runbooks

  • Service restart procedure
  • Database failover
  • Rollback deployment
  • Scale up/down
  • Incident communication

Best Practices

DO

  • Monitor the four golden signals
  • Set up alerts before incidents
  • Create runbooks for alerts
  • Use distributed tracing
  • Track SLOs and error budgets
  • Review dashboards regularly

DON'T

  • Alert on every metric
  • Ignore alert fatigue
  • Skip health checks
  • Forget to trace async operations
  • Set unrealistic SLOs
  • Neglect runbook maintenance

Observability Checklist

Metrics

  • Golden signals tracked
  • Custom business metrics
  • Resource utilization metrics
  • Dependency health metrics

Alerting

  • Critical alerts defined
  • Severity levels set
  • Runbooks linked
  • On-call rotation configured

Tracing

  • Trace context propagated
  • Key operations traced
  • Trace sampling configured

Dashboards

  • Service overview dashboard
  • Dependency dashboard
  • Business metrics dashboard
  • Deploy annotations enabled