AgentSkillsCN

observability-audit

当您需要审计日志记录、监控系统、告警机制、链路追踪,或衡量指标体系时,此技能将助您全面覆盖结构化日志、错误追踪、健康检查、仪表盘、分布式追踪,以及事件检测能力。

SKILL.md
--- frontmatter
name: observability-audit
description: "Use when auditing logging, monitoring, alerting, tracing, or metrics. Covers structured logging, error tracking, health checks, dashboards, distributed tracing, and incident detection capabilities."

Observability Audit

Overview

If you can't observe it, you can't debug it. If you can't debug it, you can't fix it. Observability is the difference between "something is wrong" and "here's exactly what's wrong, where, since when, and for whom."

Core principle: Every production system must answer four questions at any point in time: What happened? When? Why? To whom?

The Iron Law

code
NO PRODUCTION SERVICE WITHOUT STRUCTURED LOGGING, HEALTH CHECKS, AND ERROR TRACKING. NO DEPLOYMENT WITHOUT MONITORING. NO ALERT WITHOUT A RUNBOOK.

When to Use

  • "Can we debug production issues?"
  • Before deploying a new service
  • After a production incident (post-mortem revealed gaps)
  • When investigation takes too long
  • When errors are discovered by users instead of alerts
  • During any codebase audit
  • Before scaling a service

When NOT to Use

  • Prototypes or throwaway experiments (basic console logging is fine)
  • Static sites with no server component
  • If you're just adding a single log line (use code-review instead)
  • If the issue is specifically performance, not visibility (use performance-audit)

Anti-Shortcut Rules

code
YOU CANNOT:
- Say "logging is fine" — read actual log output and verify it's structured, contextual, and actionable
- Say "we have monitoring" — show the dashboard, verify it shows real-time data, check alerting rules
- Say "errors are tracked" — verify error grouping, notification routing, and triage workflow
- Skip checking what happens when a dependency goes down — simulate failure mentally or actually
- Assume health checks work — call the health endpoint and verify the response checks REAL dependencies
- Trust that alerts work — verify the notification pipeline end-to-end (alert → channel → person)
- Say "we'll add observability later" — it's needed BEFORE production, not after the first incident

Common Rationalizations (Don't Accept These)

RationalizationReality
"We'll add logging when something breaks"You won't know it's broken without logging. Circular dependency.
"Console.log is fine for now"Unstructured text logs are unsearchable, unfilterable, and fill up disk fast.
"We don't need distributed tracing yet"You will need it the moment your first cross-service bug appears. Retrofitting is painful.
"Nobody looks at the dashboard"Then your dashboard shows the wrong things. Fix the dashboard, don't eliminate monitoring.
"Alerts are too noisy so we ignore them"Noisy alerts are worse than no alerts. Tune them — don't disable.
"Our error rate is low enough"Low ≠ zero. One uncaught error in a payment flow is one too many.
"We know the system well enough"You won't in 6 months, and neither will the person who replaces you.

Iron Questions (Ask Before Concluding the Audit)

code
1. If this service crashed right now, how would we know? How quickly?
2. If response times doubled, which alert fires? In how many minutes?
3. Can I trace a single user request from entry to database and back?
4. If I search for a specific user's actions today, what do I find?
5. When was the last error? What was it? Is it resolved?
6. What's the difference between "service is down" and "service is degraded"? Can your health checks tell?
7. If a downstream dependency becomes slow (not down, just slow), does the system degrade gracefully?
8. Can an on-call engineer understand a 3AM alert without looking at code?
9. Are there any errors happening right now that nobody knows about?
10. What was the MTTR (mean time to resolution) for the last 3 incidents? Could better observability have reduced it?

The Audit Process

Phase 1: Structured Logging

Structured logging requirements:

json
{
  "timestamp": "2024-01-15T10:30:00.123Z",
  "level": "error",
  "message": "Payment processing failed",
  "service": "payment-service",
  "trace_id": "abc-123-def-456",
  "span_id": "ghi-789",
  "user_id": "user_456",
  "error": {
    "type": "PaymentGatewayError",
    "message": "Connection timeout after 30s",
    "code": "GATEWAY_TIMEOUT",
    "stack": "..."
  },
  "context": {
    "order_id": "order_789",
    "amount": 99.99,
    "currency": "USD",
    "gateway": "stripe",
    "attempt": 2
  },
  "duration_ms": 30042
}

Logging checklist:

CheckStatusDetection
Structured format (JSON, not text)grep -rn "console.log|print(" --include="*.ts" --include="*.py" . | grep -v node_modules
Consistent log levels (debug, info, warn, error)grep -rn "logger\.|log\." --include="*.ts" --include="*.py" . | head -30
Request correlation IDs (trace_id)grep -rn "trace_id|correlation_id|request_id" --include="*.ts" --include="*.py" .
No PII in logs (passwords, tokens, SSN, emails)grep -rn "password|secret|token|ssn" --include="*.ts" --include="*.py" . | grep -i log
Error logs include stack tracesVerify error handler attaches stack
Sufficient context for debuggingRead 5 random log entries — can you debug from them alone?
Log rotation / retention configuredCheck logging config for max size / days
Log aggregation (centralized collection)Check for logging service (CloudWatch, Datadog, ELK, Loki)
Sensitive data redactedCheck for redaction middleware in logging pipeline
Request/response logging for APIsCheck middleware for HTTP request logging

Log levels guide (enforce consistency):

LevelWhenExampleDoes NOT Include
debugDetailed diagnostic info for developers"Query took 45ms, returned 12 rows"In production (should be disabled)
infoNormal operations worth recording"User logged in", "Order created", "Deploy started"Routine internal events (loop iterations)
warnUnexpected but handled situations"Rate limit approaching 80%", "Retry attempt 2/3", "Deprecated API called"Things that are actually errors
errorFailures requiring attention"Payment failed", "Database connection lost", "External API returned 500"Things the code recovers from gracefully
fatalApplication cannot continue"Config file missing", "Database unreachable on startup", "Port already in use"Recoverable errors

Common logging anti-patterns:

Anti-PatternExampleFix
Log and swallowcatch (e) { log(e); } — then nothingLog AND handle appropriately
Generic messageslogger.error("Error occurred")Include WHAT error, WHERE, in WHICH context
Missing contextlogger.error(error.message)Add user_id, request_id, input data
Sensitive datalogger.info("User logged in", { password })Redact sensitive fields
Wrong levellogger.error("User not found") — 404 is not errorUse warn for expected-but-unusual
Excessive loggingLogging every loop iterationLog aggregates or milestones

Phase 2: Health Checks

code
1. DOES the service have a health endpoint? (/health, /healthz, /ready)
2. DOES it check actual dependencies? (DB connection, cache, external APIs)
3. IS it used by load balancers / orchestrators for routing decisions?
4. DOES it distinguish liveness vs readiness?
5. DOES it include version information?
6. IS the health check itself fast (< 500ms)?
7. DOES it avoid false positives? (returns healthy when service is actually degraded)

Liveness vs Readiness:

CheckPurposeWhat it TestsFailure Action
Liveness (/healthz)"Is the process running?"Process alive, not deadlockedRestart container
Readiness (/ready)"Can it serve traffic?"Dependencies reachable, migrations doneRemove from load balancer
Startup (/startup)"Has it finished initializing?"Warm-up complete, caches loadedWait longer before killing

Health check response (gold standard):

json
{
  "status": "healthy",
  "checks": {
    "database": { "status": "healthy", "latency_ms": 5, "connection_pool": "8/20" },
    "cache": { "status": "healthy", "latency_ms": 1, "hit_rate": "94%" },
    "external_api": { "status": "degraded", "latency_ms": 2500, "note": "Slow but responding" },
    "disk": { "status": "healthy", "free_gb": 45.2, "usage_percent": 62 }
  },
  "version": "1.2.3",
  "commit": "abc1234",
  "uptime_seconds": 86400,
  "timestamp": "2024-01-15T10:30:00Z"
}

Phase 3: Error Tracking

code
1. ARE errors captured and aggregated? (Sentry, Bugsnag, Datadog, Rollbar)
2. ARE errors grouped intelligently? (not 1000 identical alerts)
3. DO errors include user context, request context, and breadcrumbs?
4. ARE error notifications routed correctly? (critical → PagerDuty, low → Slack)
5. IS there error rate monitoring with thresholds?
6. ARE errors triaged and assigned? (not just collected)
7. IS there a distinction between expected errors (404, validation) and unexpected errors (500, null reference)?
8. ARE source maps uploaded for frontend error tracking?

Error tracking maturity model:

LevelDescriptionAssessment
0 — NoneErrors disappear into void🔴 Critical
1 — BasicErrors logged to file, reviewed manually🟠 High risk
2 — CollectedErrors sent to tracking service (Sentry)🟡 Acceptable
3 — ManagedErrors grouped, prioritized, assigned🟢 Good
4 — ProactiveError budgets, anomaly detection, auto-remediation🟢 Excellent

Phase 4: Metrics

code
1. ARE key business metrics tracked? (signups, orders, revenue, churn)
2. ARE key technical metrics tracked? (response time, error rate, throughput, saturation)
3. IS there a dashboard? (Grafana, Datadog, CloudWatch)
4. ARE alerts configured for anomalies?
5. ARE metrics labeled with dimensions? (by endpoint, by user tier, by region)
6. ARE custom metrics defined for domain-specific concerns?
7. CAN you compare current metrics to historical baselines?

Essential metrics frameworks:

RED Method (for request-driven services):

MetricWhat It MeasuresAlert Threshold Example
RateRequests per second< 50% of baseline for 5 min
ErrorsError rate (% or count)> 1% error rate for 2 min
DurationResponse time (p50, p95, p99)p99 > 2s for 5 min

USE Method (for infrastructure/resources):

MetricWhat It MeasuresAlert Threshold Example
UtilizationCPU, memory, disk, connections> 80% for 10 min
SaturationQueue depth, thread pool> 90% capacity
ErrorsHardware errors, connection errorsAny non-zero

The Four Golden Signals (Google SRE):

SignalMeasures
LatencyHow long requests take (distinguish success vs error latency)
TrafficHow much demand is placed on the system
ErrorsRate of failed requests
SaturationHow "full" the system is (most constrained resource)

Phase 5: Distributed Tracing

code
1. ARE requests traceable across services? (OpenTelemetry, Jaeger, Zipkin)
2. ARE trace IDs propagated through the ENTIRE call chain? (HTTP headers, message queues, async jobs)
3. CAN you reconstruct a full request path from a single trace ID?
4. ARE slow operations visible as spans with duration?
5. ARE database queries captured as spans?
6. ARE external API calls captured as spans?
7. IS sampling configured appropriately? (100% for errors, 1-10% for normal traffic)

Trace propagation checklist:

BoundaryPropagated?Method
HTTP → HTTPW3C Trace Context header or B3 header
HTTP → QueueMessage metadata
Queue → WorkerRead from message metadata
Sync → AsyncExplicit context passing
Service → DatabaseSpan wrapping DB client
Service → External APIInjected in outgoing headers

Phase 6: Alerting

code
1. ARE alerts defined for critical metrics? (error rate, latency, availability)
2. DO alerts have clear, actionable titles? (not "Alert #472")
3. DO alerts include runbooks or links? (what to do when this fires)
4. IS there alert severity tiering? (page vs warn vs info)
5. IS there alert fatigue? (> 50 alerts/week per team = too many)
6. DO alerts have proper thresholds? (not too sensitive, not too lax)
7. IS there an on-call rotation? (who gets paged?)
8. ARE alerts tested periodically? (fire drills for alerting)

Alert quality rubric:

QualityBad AlertGood Alert
Title"Error detected""Payment API error rate > 5% for 3 min"
ContextNoneCurrent rate: 8.2%, baseline: 0.3%, affected endpoints: /checkout
Action"Investigate""Check payment gateway status, see runbook: link"
RoutingEveryoneOn-call for payment-team
ThresholdAny error> 5% error rate sustained for 3 minutes

Output Format

markdown
# Observability Audit: [Project Name]

## Summary
| Capability | Status | Tool | Maturity Level | Assessment |
|-----------|--------|------|---------------|------------|
| Structured Logging | ✅/⚠️/❌ | [Tool] | [0-4] | [Notes] |
| Health Checks | ✅/⚠️/❌ | — | [0-4] | [Notes] |
| Error Tracking | ✅/⚠️/❌ | [Tool] | [0-4] | [Notes] |
| Metrics & Dashboards | ✅/⚠️/❌ | [Tool] | [0-4] | [Notes] |
| Distributed Tracing | ✅/⚠️/❌ | [Tool] | [0-4] | [Notes] |
| Alerting | ✅/⚠️/❌ | [Tool] | [0-4] | [Notes] |

## Can Answer The Four Questions?
| Question | Answer | Confidence |
|----------|--------|------------|
| What happened? | ✅/❌ | [How quickly, how precisely] |
| When? | ✅/❌ | [Timestamp precision] |
| Why? | ✅/❌ | [Root cause discoverability] |
| To whom? | ✅/❌ | [User attribution capability] |

## Findings
[Standard severity format — sorted by severity]

## Summary
| Severity | Count |
|----------|-------|
| 🔴 Critical | N |
| 🟠 High | N |
| 🟡 Medium | N |
| 🟢 Low | N |

## Verdict: [PASS / CONDITIONAL PASS / FAIL]

Red Flags — STOP and Investigate

  • console.log / print() as the primary logging strategy
  • No health check endpoint
  • No error tracking service (errors silently disappear)
  • No alerting on error rate spikes
  • PII in logs (passwords, tokens, personal data, IP addresses)
  • No request correlation IDs (can't trace a request across services)
  • Health check returns 200 even when database is down
  • Alerts that nobody responds to (alert fatigue)
  • Log files on disk with no rotation (disk fills up)
  • No distinction between expected and unexpected errors
  • Metrics without baselines (can't detect anomalies)
  • Traces that stop at service boundaries

Integration

  • Part of: Full audit with architecture-audit
  • Enables: incident-response capabilities (what you can observe, you can respond to)
  • Informs: performance-audit metrics analysis
  • Complements: security-audit for audit logging and intrusion detection
  • Prerequisites: ci-cd-audit for deployment marker integration