AgentSkillsCN

Observability Standards

当用户询问“可观测性”“日志记录”“指标监控”“链路追踪”“监控系统”“结构化日志”“日志格式”“日志级别”“分布式追踪”“OpenTelemetry”“健康检查”,或需要关于可观测性与监控需求落地的指导时,应使用此技能。

SKILL.md
--- frontmatter
name: Observability Standards
description: This skill should be used when the user asks about "observability", "logging", "metrics", "tracing", "monitoring", "structured logging", "log format", "log levels", "distributed tracing", "OpenTelemetry", "health checks", or needs guidance on implementing observability and monitoring requirements.
version: 1.0.0

Observability Standards

Guidance for implementing observability requirements including logging, metrics, tracing, and monitoring configuration.

Tooling

Available Tools: If using Claude Code, the agents:sre-engineer agent specializes in observability setup and SLO management. The agents:devops-engineer agent can help configure monitoring infrastructure.

Logging Requirements

Structured Logging (MUST)

All logging MUST use structured format (JSON preferred):

json
{
  "timestamp": "2024-01-15T10:30:00.000Z",
  "level": "info",
  "message": "Request processed",
  "service": "api-gateway",
  "trace_id": "abc123",
  "span_id": "def456",
  "duration_ms": 45,
  "status_code": 200
}

Log Levels (MUST)

Use consistent log levels with defined semantics:

LevelPurposeWhen to Use
ERRORErrors requiring attentionFailures, exceptions
WARNPotential issuesDegraded performance, retries
INFOSignificant eventsRequest completion, state changes
DEBUGDetailed informationDevelopment, troubleshooting
TRACEVery detailed tracingDeep debugging

Required Log Fields (MUST)

All log entries MUST include:

FieldDescription
timestampISO 8601 format with timezone
levelLog severity level
messageHuman-readable description
serviceService/application name

Recommended Log Fields (SHOULD)

Log entries SHOULD include when applicable:

FieldDescription
trace_idDistributed trace identifier
span_idSpan identifier
user_idUser identifier (if authenticated)
request_idRequest correlation ID
duration_msOperation duration

Sensitive Data (MUST NOT)

Logs MUST NOT contain:

  • Passwords or secrets
  • API keys or tokens
  • Personal identifiable information (PII)
  • Credit card numbers
  • Session tokens

Metrics Requirements

Metric Types (MUST)

Implement appropriate metric types:

TypePurposeExamples
CounterCumulative valuesRequest count, errors
GaugeCurrent valuesQueue size, connections
HistogramValue distributionsResponse time, payload size
SummaryQuantile calculationsP50, P95, P99 latencies

Required Metrics (MUST)

Services MUST expose:

MetricTypeDescription
requests_totalCounterTotal requests by endpoint/status
request_duration_secondsHistogramRequest latency
errors_totalCounterError count by type
active_connectionsGaugeCurrent connections

Metric Naming (MUST)

Follow naming conventions:

code
# Format: <namespace>_<name>_<unit>
http_requests_total
http_request_duration_seconds
db_connections_active
cache_hits_total

Metric Labels (SHOULD)

Use consistent label naming:

LabelDescription
serviceService name
endpointAPI endpoint
methodHTTP method
statusResponse status
error_typeError classification

Distributed Tracing

Tracing Implementation (SHOULD)

Implement distributed tracing using OpenTelemetry:

yaml
# OpenTelemetry configuration
exporters:
  otlp:
    endpoint: "collector:4317"

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [otlp]

Trace Context (MUST)

When tracing is implemented, propagate context:

HeaderStandard
traceparentW3C Trace Context
tracestateW3C Trace Context
X-Request-IDRequest correlation

Span Requirements (SHOULD)

Spans SHOULD include:

  • Operation name
  • Start/end timestamps
  • Status (OK, ERROR)
  • Relevant attributes
  • Error details (if applicable)

Health Checks

Health Endpoints (MUST)

Services MUST expose health endpoints:

EndpointPurposeResponse
/healthBasic health200 OK or 503
/health/liveLiveness probe200 if running
/health/readyReadiness probe200 if ready to serve

Health Response Format (MUST)

json
{
  "status": "healthy",
  "checks": {
    "database": {
      "status": "healthy",
      "latency_ms": 5
    },
    "cache": {
      "status": "healthy",
      "latency_ms": 1
    }
  },
  "version": "1.2.3",
  "uptime_seconds": 3600
}

Dependency Checks (SHOULD)

Health checks SHOULD verify:

  • Database connectivity
  • Cache availability
  • External service reachability
  • Disk space adequacy
  • Memory availability

Alerting

Alert Configuration (MUST)

Define alerts for critical conditions:

ConditionSeverityResponse
Service downCriticalImmediate page
Error rate > 5%HighPage within 5 min
Latency P95 > 1sMediumNotify team
Disk > 80%WarningCreate ticket

Alert Requirements (MUST)

Alerts MUST include:

  • Clear description of condition
  • Severity level
  • Runbook link
  • Affected service/component

Implementation Checklist

  • Configure structured logging
  • Define log level policies
  • Implement required metrics
  • Set up health endpoints
  • Configure distributed tracing (if applicable)
  • Define alerting rules
  • Create runbooks for alerts
  • Verify sensitive data exclusion

Compliance Verification

bash
# Verify structured log output
app_command 2>&1 | jq .

# Check health endpoint
curl -s http://localhost:8080/health | jq .

# Verify metrics endpoint
curl -s http://localhost:8080/metrics | grep -E "^(http_|app_)"

# Check for sensitive data in logs
grep -r -i "password\|secret\|token" logs/ | wc -l
# Should be 0

Language-Specific Logging

Rust (tracing)

rust
use tracing::{info, instrument};

#[instrument]
fn process_request(id: &str) {
    info!(request_id = %id, "Processing request");
}

TypeScript (pino)

typescript
import pino from "pino";

const logger = pino({
  level: "info",
  formatters: {
    level: (label) => ({ level: label }),
  },
});

logger.info({ requestId, duration }, "Request processed");

Python (structlog)

python
import structlog

logger = structlog.get_logger()
logger.info("request_processed", request_id=request_id, duration=duration)

Additional Resources

Reference Files

  • references/logging-config.md - Logging configuration guide
  • references/metrics-guide.md - Metrics implementation patterns

Examples

  • examples/otel-config.yaml - OpenTelemetry configuration
  • examples/alerting-rules.yaml - Prometheus alerting rules