AgentSkillsCN

Observability & Monitoring

通过日志、指标、追踪与告警,实现生产环境的全面可观测性——这三大支柱共同构筑起可观测性的坚实根基。

SKILL.md
--- frontmatter
name: "Observability & Monitoring"
description: "Production visibility through logs, metrics, traces, and alerting — the three pillars of observability"
applyTo: "**/*observ*,**/*monitor*,**/*telemetry*,**/*logging*,**/*metrics*,**/*traces*"

Observability & Monitoring Skill

See what's happening in production. Debug without reproducing. Understand system behavior at scale.

The Three Pillars

PillarWhatWhenTools
LogsDiscrete eventsDebugging, auditingWinston, Pino, Serilog
MetricsAggregated measurementsAlerting, dashboardsPrometheus, CloudWatch
TracesRequest flow across servicesDistributed debuggingJaeger, Zipkin

Modern approach: OpenTelemetry unifies all three.


Logging Best Practices

Structured Logging

typescript
// ❌ Bad: Unstructured
console.log(`User ${userId} clicked button ${buttonId}`);

// ✅ Good: Structured
logger.info('Button clicked', {
  userId,
  buttonId,
  timestamp: Date.now(),
  sessionId: ctx.sessionId
});

Log Levels

LevelUsageExample
ERRORSomething failed, needs attentionPayment failed
WARNUnexpected but handledRetry succeeded
INFOBusiness eventsUser logged in
DEBUGDeveloper detailsCache hit/miss
TRACEVerbose internalsFunction entry/exit

Correlation IDs

Track requests across services:

typescript
// Middleware to propagate trace ID
app.use((req, res, next) => {
  req.traceId = req.headers['x-trace-id'] || uuid();
  res.setHeader('x-trace-id', req.traceId);
  next();
});

// Include in all logs
logger.info('Processing request', { traceId: req.traceId, ...data });

Metrics Patterns

The RED Method (Request-focused)

For services:

  • Rate: Requests per second
  • Errors: Failed requests per second
  • Duration: Request latency distribution

The USE Method (Resource-focused)

For infrastructure:

  • Utilization: % time resource busy
  • Saturation: Queue depth
  • Errors: Error count

Key Metric Types

TypeUse CaseExample
CounterCumulative totalsrequests_total
GaugeCurrent valuetemperature, queue_size
HistogramValue distributionrequest_duration_seconds
SummaryQuantilesresponse_time_p99

Golden Signals (SRE)

  1. Latency — Time to serve request
  2. Traffic — Demand on system
  3. Errors — Failed requests rate
  4. Saturation — How full is the system

Distributed Tracing

Span Structure

code
Trace: user-checkout-abc123
├── Span: api-gateway (50ms)
│   ├── Span: auth-service (10ms)
│   └── Span: order-service (35ms)
│       ├── Span: inventory-check (8ms)
│       └── Span: payment-service (20ms)
│           └── Span: database-write (5ms)

Context Propagation

typescript
// OpenTelemetry automatic propagation
import { trace, context, propagation } from '@opentelemetry/api';

// Extract context from incoming request
const ctx = propagation.extract(context.active(), req.headers);

// Create span with parent context
const span = tracer.startSpan('process-order', undefined, ctx);

// Propagate to outgoing request
propagation.inject(context.active(), headers);

OpenTelemetry Setup

Node.js Quick Start

typescript
// tracing.ts - Load FIRST
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';

const sdk = new NodeSDK({
  traceExporter: new OTLPTraceExporter({
    url: 'http://localhost:4318/v1/traces',
  }),
  instrumentations: [getNodeAutoInstrumentations()],
});

sdk.start();

.NET Quick Start

csharp
// Program.cs
builder.Services.AddOpenTelemetry()
    .WithTracing(tracing => tracing
        .AddAspNetCoreInstrumentation()
        .AddHttpClientInstrumentation()
        .AddOtlpExporter());

Alerting Strategy

Alert Hierarchy

SeverityResponseExample
P1/CriticalWake someone upService down
P2/HighFix within hoursError rate > 5%
P3/MediumFix within daysDisk 80%
P4/LowFix when convenientDeprecation warning

Alert Anti-Patterns

Alert fatigue — Too many non-actionable alerts ❌ Missing runbook — Alert with no remediation steps ❌ Threshold-only — Alert on static value, not trend ❌ No owner — Alert goes to void

Good Alert Template

yaml
alert: HighErrorRate
expr: sum(rate(http_errors_total[5m])) / sum(rate(http_requests_total[5m])) > 0.05
for: 5m
labels:
  severity: high
  team: backend
annotations:
  summary: "Error rate above 5%"
  runbook: "https://runbooks.example.com/high-error-rate"
  dashboard: "https://grafana.example.com/d/errors"

Dashboard Design

Layout Principles

code
┌─────────────────────────────────────────────────────────┐
│                   SERVICE HEALTH                         │
│  [Status] [Error Rate] [Latency P50] [Latency P99]      │
├─────────────────────────────────────────────────────────┤
│                   TRAFFIC                                │
│  [Requests/sec graph over time]                         │
├─────────────────────────────────────────────────────────┤
│           ERRORS             │        LATENCY           │
│  [Error breakdown by type]   │  [Latency histogram]     │
├─────────────────────────────────────────────────────────┤
│                   RESOURCES                              │
│  [CPU] [Memory] [Disk] [Network]                        │
└─────────────────────────────────────────────────────────┘

Dashboard Hierarchy

  1. Overview — Executive view, all services
  2. Service — Single service deep dive
  3. Debug — Detailed metrics for investigation

Cloud Provider Tools

CloudMetricsLogsTraces
AzureAzure MonitorLog AnalyticsApp Insights
AWSCloudWatchCloudWatch LogsX-Ray
GCPCloud MonitoringCloud LoggingCloud Trace

Azure Application Insights

typescript
// Node.js
import { useAzureMonitor } from '@azure/monitor-opentelemetry';

useAzureMonitor({
  azureMonitorExporterOptions: {
    connectionString: process.env.APPLICATIONINSIGHTS_CONNECTION_STRING
  }
});

VS Code Extension Observability

For VS Code extensions like Alex:

What to Monitor

MetricWhy
Command execution timeUser experience
Activation timeStartup performance
Error rates by commandReliability
Memory usageResource efficiency
API call latencyExternal dependencies

Telemetry Implementation

typescript
import * as vscode from 'vscode';

const telemetry = vscode.env.createTelemetryLogger({
  sendEventData(eventName, data) {
    // Send to your telemetry backend
  },
  sendErrorData(error, data) {
    // Send errors with context
  }
});

// Usage
telemetry.logUsage('command.executed', {
  commandId: 'alex.meditate',
  durationMs: 1500
});

Debugging Patterns

Log-Driven Debugging

  1. Find error in logs
  2. Get correlation ID
  3. Search all logs with that ID
  4. Reconstruct timeline

Trace-Driven Debugging

  1. Find slow/failed trace
  2. Examine span waterfall
  3. Identify bottleneck span
  4. Drill into that service

Metric-Driven Debugging

  1. Notice anomaly in dashboard
  2. Correlate with other metrics
  3. Narrow time window
  4. Switch to logs/traces for details

Implementation Checklist

New Service

  • Structured logging configured
  • Correlation ID propagation
  • Basic metrics (RED/USE)
  • Health check endpoint
  • OpenTelemetry instrumentation
  • Dashboard created
  • Alerts defined with runbooks

Production Readiness

  • Error rates < 0.1% baseline
  • P99 latency acceptable
  • Logs searchable and retained
  • Traces sampling configured
  • On-call runbooks written

Related Skills

  • performance-profiling — Deep dive into specific bottlenecks
  • incident-response — Using observability during outages
  • infrastructure-as-code — Deploying monitoring stack
  • security-review — Audit logging requirements

"If you can't measure it, you can't improve it." — Peter Drucker

Good observability means finding the problem before your users do.