Observability & Monitoring Skill
Comprehensive frameworks for implementing observability including structured logging, metrics, distributed tracing, and alerting.
When to Use
- •Setting up application monitoring
- •Implementing structured logging
- •Adding metrics and dashboards
- •Configuring distributed tracing
- •Creating alerting rules
- •Debugging production issues
Three Pillars of Observability
code
┌─────────────────┬─────────────────┬─────────────────┐ │ LOGS │ METRICS │ TRACES │ ├─────────────────┼─────────────────┼─────────────────┤ │ What happened │ How is system │ How do requests │ │ at specific │ performing │ flow through │ │ point in time │ over time │ services │ └─────────────────┴─────────────────┴─────────────────┘
Structured Logging
Log Levels
| Level | Use Case |
|---|---|
| ERROR | Unhandled exceptions, failed operations |
| WARN | Deprecated API, retry attempts |
| INFO | Business events, successful operations |
| DEBUG | Development troubleshooting |
Best Practice
typescript
// Good: Structured with context
logger.info('User action completed', {
action: 'purchase',
userId: user.id,
orderId: order.id,
duration_ms: 150
});
// Bad: String interpolation
logger.info(`User ${user.id} completed purchase`);
See
templates/structured-logging.tsfor Winston setup and request middleware
Metrics Collection
RED Method (Rate, Errors, Duration)
Essential metrics for any service:
- •Rate - Requests per second
- •Errors - Failed requests per second
- •Duration - Request latency distribution
Prometheus Buckets
typescript
// HTTP request latency buckets: [0.01, 0.05, 0.1, 0.5, 1, 2, 5] // Database query latency buckets: [0.001, 0.01, 0.05, 0.1, 0.5, 1]
See
templates/prometheus-metrics.tsfor full metrics configuration
Distributed Tracing
OpenTelemetry Setup
Auto-instrument common libraries:
- •Express/HTTP
- •PostgreSQL
- •Redis
Manual Spans
typescript
tracer.startActiveSpan('processOrder', async (span) => {
span.setAttribute('order.id', orderId);
// ... work
span.end();
});
See
templates/opentelemetry-tracing.tsfor full setup
Alerting Strategy
Severity Levels
| Level | Response Time | Examples |
|---|---|---|
| Critical (P1) | < 15 min | Service down, data loss |
| High (P2) | < 1 hour | Major feature broken |
| Medium (P3) | < 4 hours | Increased error rate |
| Low (P4) | Next day | Warnings |
Key Alerts
| Alert | Condition | Severity |
|---|---|---|
| ServiceDown | up == 0 for 1m | Critical |
| HighErrorRate | 5xx > 5% for 5m | Critical |
| HighLatency | p95 > 2s for 5m | High |
| LowCacheHitRate | < 70% for 10m | Medium |
See
templates/alerting-rules.ymlfor Prometheus alerting rules
Health Checks
Kubernetes Probes
| Probe | Purpose | Endpoint |
|---|---|---|
| Liveness | Is app running? | /health |
| Readiness | Ready for traffic? | /ready |
| Startup | Finished starting? | /startup |
Readiness Response
json
{
"status": "healthy|degraded|unhealthy",
"checks": {
"database": { "status": "pass", "latency_ms": 5 },
"redis": { "status": "pass", "latency_ms": 2 }
},
"version": "1.0.0",
"uptime": 3600
}
See
templates/health-checks.tsfor implementation
Observability Checklist
Implementation
- • JSON structured logging
- • Request correlation IDs
- • RED metrics (Rate, Errors, Duration)
- • Business metrics
- • Distributed tracing
- • Health check endpoints
Alerting
- • Service outage alerts
- • Error rate thresholds
- • Latency thresholds
- • Resource utilization alerts
Dashboards
- • Service overview
- • Error analysis
- • Performance metrics
Extended Thinking Triggers
Use Opus 4.5 extended thinking for:
- •Incident investigation - Correlating logs, metrics, traces
- •Alert tuning - Reducing noise, catching real issues
- •Architecture decisions - Choosing monitoring solutions
- •Performance debugging - Cross-service latency analysis
Templates Reference
| Template | Purpose |
|---|---|
structured-logging.ts | Winston logger with request middleware |
prometheus-metrics.ts | HTTP, DB, cache metrics with middleware |
opentelemetry-tracing.ts | Distributed tracing setup |
alerting-rules.yml | Prometheus alerting rules |
health-checks.ts | Liveness, readiness, startup probes |