Observability & Monitoring Skill
See what's happening in production. Debug without reproducing. Understand system behavior at scale.
The Three Pillars
| Pillar | What | When | Tools |
|---|---|---|---|
| Logs | Discrete events | Debugging, auditing | Winston, Pino, Serilog |
| Metrics | Aggregated measurements | Alerting, dashboards | Prometheus, CloudWatch |
| Traces | Request flow across services | Distributed debugging | Jaeger, Zipkin |
Modern approach: OpenTelemetry unifies all three.
Logging Best Practices
Structured Logging
typescript
// ❌ Bad: Unstructured
console.log(`User ${userId} clicked button ${buttonId}`);
// ✅ Good: Structured
logger.info('Button clicked', {
userId,
buttonId,
timestamp: Date.now(),
sessionId: ctx.sessionId
});
Log Levels
| Level | Usage | Example |
|---|---|---|
| ERROR | Something failed, needs attention | Payment failed |
| WARN | Unexpected but handled | Retry succeeded |
| INFO | Business events | User logged in |
| DEBUG | Developer details | Cache hit/miss |
| TRACE | Verbose internals | Function entry/exit |
Correlation IDs
Track requests across services:
typescript
// Middleware to propagate trace ID
app.use((req, res, next) => {
req.traceId = req.headers['x-trace-id'] || uuid();
res.setHeader('x-trace-id', req.traceId);
next();
});
// Include in all logs
logger.info('Processing request', { traceId: req.traceId, ...data });
Metrics Patterns
The RED Method (Request-focused)
For services:
- •Rate: Requests per second
- •Errors: Failed requests per second
- •Duration: Request latency distribution
The USE Method (Resource-focused)
For infrastructure:
- •Utilization: % time resource busy
- •Saturation: Queue depth
- •Errors: Error count
Key Metric Types
| Type | Use Case | Example |
|---|---|---|
| Counter | Cumulative totals | requests_total |
| Gauge | Current value | temperature, queue_size |
| Histogram | Value distribution | request_duration_seconds |
| Summary | Quantiles | response_time_p99 |
Golden Signals (SRE)
- •Latency — Time to serve request
- •Traffic — Demand on system
- •Errors — Failed requests rate
- •Saturation — How full is the system
Distributed Tracing
Span Structure
code
Trace: user-checkout-abc123 ├── Span: api-gateway (50ms) │ ├── Span: auth-service (10ms) │ └── Span: order-service (35ms) │ ├── Span: inventory-check (8ms) │ └── Span: payment-service (20ms) │ └── Span: database-write (5ms)
Context Propagation
typescript
// OpenTelemetry automatic propagation
import { trace, context, propagation } from '@opentelemetry/api';
// Extract context from incoming request
const ctx = propagation.extract(context.active(), req.headers);
// Create span with parent context
const span = tracer.startSpan('process-order', undefined, ctx);
// Propagate to outgoing request
propagation.inject(context.active(), headers);
OpenTelemetry Setup
Node.js Quick Start
typescript
// tracing.ts - Load FIRST
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
const sdk = new NodeSDK({
traceExporter: new OTLPTraceExporter({
url: 'http://localhost:4318/v1/traces',
}),
instrumentations: [getNodeAutoInstrumentations()],
});
sdk.start();
.NET Quick Start
csharp
// Program.cs
builder.Services.AddOpenTelemetry()
.WithTracing(tracing => tracing
.AddAspNetCoreInstrumentation()
.AddHttpClientInstrumentation()
.AddOtlpExporter());
Alerting Strategy
Alert Hierarchy
| Severity | Response | Example |
|---|---|---|
| P1/Critical | Wake someone up | Service down |
| P2/High | Fix within hours | Error rate > 5% |
| P3/Medium | Fix within days | Disk 80% |
| P4/Low | Fix when convenient | Deprecation warning |
Alert Anti-Patterns
❌ Alert fatigue — Too many non-actionable alerts ❌ Missing runbook — Alert with no remediation steps ❌ Threshold-only — Alert on static value, not trend ❌ No owner — Alert goes to void
Good Alert Template
yaml
alert: HighErrorRate expr: sum(rate(http_errors_total[5m])) / sum(rate(http_requests_total[5m])) > 0.05 for: 5m labels: severity: high team: backend annotations: summary: "Error rate above 5%" runbook: "https://runbooks.example.com/high-error-rate" dashboard: "https://grafana.example.com/d/errors"
Dashboard Design
Layout Principles
code
┌─────────────────────────────────────────────────────────┐ │ SERVICE HEALTH │ │ [Status] [Error Rate] [Latency P50] [Latency P99] │ ├─────────────────────────────────────────────────────────┤ │ TRAFFIC │ │ [Requests/sec graph over time] │ ├─────────────────────────────────────────────────────────┤ │ ERRORS │ LATENCY │ │ [Error breakdown by type] │ [Latency histogram] │ ├─────────────────────────────────────────────────────────┤ │ RESOURCES │ │ [CPU] [Memory] [Disk] [Network] │ └─────────────────────────────────────────────────────────┘
Dashboard Hierarchy
- •Overview — Executive view, all services
- •Service — Single service deep dive
- •Debug — Detailed metrics for investigation
Cloud Provider Tools
| Cloud | Metrics | Logs | Traces |
|---|---|---|---|
| Azure | Azure Monitor | Log Analytics | App Insights |
| AWS | CloudWatch | CloudWatch Logs | X-Ray |
| GCP | Cloud Monitoring | Cloud Logging | Cloud Trace |
Azure Application Insights
typescript
// Node.js
import { useAzureMonitor } from '@azure/monitor-opentelemetry';
useAzureMonitor({
azureMonitorExporterOptions: {
connectionString: process.env.APPLICATIONINSIGHTS_CONNECTION_STRING
}
});
VS Code Extension Observability
For VS Code extensions like Alex:
What to Monitor
| Metric | Why |
|---|---|
| Command execution time | User experience |
| Activation time | Startup performance |
| Error rates by command | Reliability |
| Memory usage | Resource efficiency |
| API call latency | External dependencies |
Telemetry Implementation
typescript
import * as vscode from 'vscode';
const telemetry = vscode.env.createTelemetryLogger({
sendEventData(eventName, data) {
// Send to your telemetry backend
},
sendErrorData(error, data) {
// Send errors with context
}
});
// Usage
telemetry.logUsage('command.executed', {
commandId: 'alex.meditate',
durationMs: 1500
});
Debugging Patterns
Log-Driven Debugging
- •Find error in logs
- •Get correlation ID
- •Search all logs with that ID
- •Reconstruct timeline
Trace-Driven Debugging
- •Find slow/failed trace
- •Examine span waterfall
- •Identify bottleneck span
- •Drill into that service
Metric-Driven Debugging
- •Notice anomaly in dashboard
- •Correlate with other metrics
- •Narrow time window
- •Switch to logs/traces for details
Implementation Checklist
New Service
- • Structured logging configured
- • Correlation ID propagation
- • Basic metrics (RED/USE)
- • Health check endpoint
- • OpenTelemetry instrumentation
- • Dashboard created
- • Alerts defined with runbooks
Production Readiness
- • Error rates < 0.1% baseline
- • P99 latency acceptable
- • Logs searchable and retained
- • Traces sampling configured
- • On-call runbooks written
Related Skills
- •performance-profiling — Deep dive into specific bottlenecks
- •incident-response — Using observability during outages
- •infrastructure-as-code — Deploying monitoring stack
- •security-review — Audit logging requirements
"If you can't measure it, you can't improve it." — Peter Drucker
Good observability means finding the problem before your users do.