Observability
Stack-agnostic patterns for building observable systems through logging, metrics, and tracing. This skill defines concepts and patterns, not specific library implementations.
Design Principles
- •Concepts over libraries: Teach patterns that work across any stack
- •Discover project tools: Check what observability tools the project uses
- •Research when implementing: Use WebSearch for current library recommendations
- •Follow project conventions: Match existing logging/metrics patterns
Three Pillars of Observability
1. Logging
Structured, contextual logs for debugging and auditing.
2. Metrics
Numerical measurements for monitoring and alerting.
3. Tracing
Distributed request tracking across services.
Structured Logging
Log Format Principles
A well-structured log entry should include:
{
"timestamp": "ISO 8601 format",
"level": "INFO/WARN/ERROR/DEBUG",
"service": "service identifier",
"trace_id": "correlation ID for request tracking",
"message": "human-readable description",
"context": {
"relevant": "contextual data"
}
}
Log Levels
| Level | Use Case | Example |
|---|---|---|
| DEBUG | Development details | Query params, cache hits |
| INFO | Normal operations | User created, request completed |
| WARN | Potential issues | Retry needed, deprecated API used |
| ERROR | Operation failed | Database connection failed |
| FATAL | Application crash | Unrecoverable error |
Best Practices
DO:
- •Use structured logging (JSON or similar)
- •Include correlation/trace IDs
- •Log at service boundaries
- •Include relevant context
- •Use consistent field names across services
DON'T:
- •Log sensitive data (passwords, tokens, PII)
- •Log at high frequency in loops
- •Use string concatenation for log messages
- •Log entire request/response bodies
- •Use print statements in production
Implementation
When implementing logging:
- •Discover existing patterns: Check how the project currently logs
- •Research current libraries:
code
WebSearch: "[language] structured logging library [year]"
- •Follow project conventions: Match existing log format and style
Metrics
Metric Types
| Type | Use Case | Example |
|---|---|---|
| Counter | Cumulative values (only increase) | Request count, errors |
| Gauge | Point-in-time values (can go up/down) | Active connections, queue size |
| Histogram | Distribution of values | Request latency, response size |
| Summary | Pre-calculated quantiles | p50, p99 latency |
Naming Conventions
Format: <namespace>_<name>_<unit> Good examples: - http_requests_total - http_request_duration_seconds - database_connections_active - queue_messages_waiting Bad examples: - requests (no namespace, no unit) - httpRequestDuration (camelCase, inconsistent) - request-latency (hyphens, no unit)
Key Metrics Frameworks
RED Method (Request-oriented):
- •Rate: Requests per second
- •Errors: Error rate
- •Duration: Latency percentiles
USE Method (Resource-oriented):
- •Utilization: Percentage time busy
- •Saturation: Queue length/backlog
- •Errors: Error count
Four Golden Signals:
- •Latency: Time to serve requests
- •Traffic: Demand on system
- •Errors: Rate of failed requests
- •Saturation: How "full" the system is
Implementation
When implementing metrics:
- •Identify what to measure: Use RED/USE/Golden Signals as guide
- •Discover existing setup: Check if project has metrics infrastructure
- •Research current tools:
code
WebSearch: "[language] metrics library [year]" WebSearch: "metrics collection [your infrastructure] [year]"
Distributed Tracing
Trace Concepts
Trace (entire request journey) ├── Span A: API Gateway (parent) │ ├── Span B: Auth Service │ └── Span C: User Service │ └── Span D: Database Query
- •Trace: End-to-end request journey across services
- •Span: Single operation within a trace
- •Context: Propagated trace/span IDs
Trace Context Propagation
Standard headers for context propagation:
| Standard | Description |
|---|---|
| W3C Trace Context | Modern standard (traceparent, tracestate) |
| B3 | Zipkin format (X-B3-* headers) |
Best Practices
- •Propagate trace context across all service boundaries
- •Include trace IDs in logs for correlation
- •Sample traces in high-traffic environments
- •Add meaningful span names and attributes
Implementation
When implementing tracing:
- •Check existing setup: Does project already have tracing?
- •Research current standards:
code
WebSearch: "distributed tracing [language] [year]" WebSearch: "[tracing platform] integration guide"
Health Checks
Endpoint Design
// GET /health
{
"status": "healthy|degraded|unhealthy",
"version": "app version",
"uptime_seconds": 3600,
"checks": {
"database": {
"status": "healthy",
"latency_ms": 5
},
"cache": {
"status": "healthy",
"latency_ms": 2
},
"external_api": {
"status": "degraded",
"message": "High latency"
}
}
}
Kubernetes Health Checks
| Check | Purpose | Failure Action |
|---|---|---|
| Liveness | Is app running? | Restart container |
| Readiness | Can handle requests? | Remove from load balancer |
| Startup | Has app started? | Don't check liveness yet |
Alerting
Alert Design Principles
Good alerts include:
- •Clear, actionable name
- •Threshold with duration (avoid flapping)
- •Severity level
- •Link to runbook
- •Relevant labels/context
Avoid:
- •Flapping alerts (too sensitive thresholds)
- •Alerts on symptoms only (dig to root cause)
- •Too many alerts (alert fatigue)
- •Alerts without runbooks
SLO-Based Alerting
SLI: 99.9% of requests complete in < 200ms SLO: 99.9% success rate over 30 days Error Budget: 0.1% = ~43 minutes/month Alert when: - Burn rate > 1x: Slow burn, low severity - Burn rate > 10x: Fast burn, high severity
Implementation Checklist
- • Structured logging configured
- • Log levels appropriate for environment
- • Sensitive data excluded from logs
- • Key metrics identified (RED/USE framework)
- • Metrics endpoint exposed (/metrics)
- • Trace context propagated across services
- • Health endpoints implemented (/health)
- • Alerts defined for critical paths
- • Runbooks linked to alerts
Rules (L1 - Hard)
Critical for security and operational safety.
- •NEVER log sensitive data (PII, tokens, passwords) - security requirement
- •ALWAYS propagate trace context across services (enables debugging)
- •ALWAYS include correlation IDs in logs (request tracing)
Defaults (L2 - Soft)
Important for operational quality. Override with reasoning when appropriate.
- •Use structured logging (not print/console.log)
- •Expose health check endpoints for orchestration
- •Discover existing project patterns before implementing
- •Use WebSearch for current library recommendations
- •Link alerts to runbooks
Guidelines (L3)
Recommendations for comprehensive observability.
- •Consider using RED/USE/Golden Signals frameworks for metrics
- •Prefer sampling traces in high-traffic environments
- •Consider SLO-based alerting over threshold-based