AgentSkillsCN

logging-monitoring

实施可观测性模式,包括结构化日志、日志级别、关联 ID、指标与分布式追踪。在添加结构化日志、为请求追踪实施关联 ID、配置指标采集、搭建分布式追踪,或设计告警规则时,可使用此技能。

SKILL.md
--- frontmatter
name: "logging-monitoring"
description: 'Implement observability patterns including structured logging, log levels, correlation IDs, metrics, and distributed tracing. Use when adding structured logging, implementing correlation IDs for request tracing, configuring metrics collection, setting up distributed tracing, or designing alerting rules.'
metadata:
  author: "AgentX"
  version: "1.0.0"
  created: "2025-01-15"
  updated: "2025-01-15"

Logging & Monitoring

Purpose: Implement observability for production systems.
Goal: Structured logs, correlation across requests, actionable metrics.
Note: For implementation, see C# Development or Python Development.


When to Use This Skill

  • Adding structured logging to applications
  • Implementing request correlation IDs
  • Configuring metrics collection
  • Setting up distributed tracing (OpenTelemetry)
  • Designing alerting rules and health checks

Prerequisites

  • Logging framework installed
  • Monitoring platform access

Decision Tree

code
Observability concern?
├─ What to log?
│   ├─ Request start/end → INFO with correlation ID
│   ├─ Expected errors → WARN (validation, not-found)
│   ├─ Unexpected errors → ERROR with stack trace
│   └─ Debug details → DEBUG (disabled in production)
├─ What NOT to log?
│   └─ PII, passwords, tokens, credit cards → NEVER
├─ Metrics needed?
│   ├─ RED metrics: Rate, Errors, Duration (for services)
│   └─ USE metrics: Utilization, Saturation, Errors (for resources)
├─ Distributed tracing?
│   └─ OpenTelemetry for cross-service correlation
└─ Alerting?
    ├─ SLO-based: alert on error budget burn rate
    └─ Avoid alert fatigue: page only for actionable issues

Structured Logging

Concept

Log structured data (key-value pairs) instead of plain text for better searchability and analysis.

code
❌ Unstructured (hard to parse):
  "User john@example.com logged in from 192.168.1.1 at 2024-01-15 10:30:00"

✅ Structured (machine-readable):
  {
    "event": "user_login",
    "user_email": "john@example.com",
    "ip_address": "192.168.1.1",
    "timestamp": "2024-01-15T10:30:00Z",
    "level": "INFO"
  }

Benefits

  • Searchable: Query by any field
  • Filterable: Show only errors, specific users, etc.
  • Aggregatable: Count events, calculate averages
  • Parseable: Tools can process automatically

Log Levels

Standard Levels

LevelWhen to UseExample
TRACEVery detailed debugging"Entering function with params: {x: 1, y: 2}"
DEBUGDebugging information"Cache hit for key: user_123"
INFONormal operations"User logged in", "Order created"
WARNUnexpected but recoverable"Retry attempt 2 of 3", "Rate limit approaching"
ERRORFailures requiring attention"Payment failed", "Database connection lost"
FATALApplication cannot continue"Out of memory", "Configuration invalid"

Level Configuration by Environment

code
Development: DEBUG or TRACE
  - See detailed information for debugging

Staging: INFO
  - Normal operations plus warnings/errors

Production: INFO (or WARN)
  - Reduce noise, focus on significant events
  - Keep ERROR/FATAL always enabled

Best Practices Summary

PracticeDescription
Structured loggingJSON format with key-value pairs
Correlation IDsTrace requests across services
Appropriate levelsDEBUG in dev, INFO+ in prod
No sensitive dataNever log passwords, tokens, PII
Context in errorsInclude what, why, and how to fix
Meaningful metricsTrack rate, errors, duration
Health checksLiveness + readiness endpoints
Actionable alertsInclude runbooks, reduce noise

Observability Tools

CategoryTools
LoggingELK Stack, Splunk, Datadog Logs, CloudWatch Logs
MetricsPrometheus + Grafana, Datadog, New Relic, CloudWatch
TracingJaeger, Zipkin, Datadog APM, Application Insights
All-in-OneDatadog, New Relic, Dynatrace, Elastic Observability

See Also: Error HandlingC# DevelopmentPython Development

Troubleshooting

IssueSolution
Logs not appearing in monitoring platformCheck log level configuration, verify sink/exporter endpoint
Correlation IDs missing across servicesPropagate W3C trace context headers in all HTTP calls
Alert fatigue from too many notificationsSet meaningful thresholds, group related alerts, add alert suppression windows

References