AgentSkillsCN

structured-logging

使用宽事件/规范日志行编写有效日志消息的指南。在编写日志代码、添加仪器、提高可观测性或审查日志语句时使用此方法。教授高基数、高维度的结构化日志记录,便于调试。

SKILL.md
--- frontmatter
name: structured-logging
description: Guide for writing effective log messages using wide events / canonical log lines. Use when writing logging code, adding instrumentation, improving observability, or reviewing log statements. Teaches high-cardinality, high-dimensionality structured logging that enables debugging.

Structured Logging

Core Principle

Emit one wide event per request per service, not scattered log statements.

  • Build context throughout the request lifecycle
  • Emit once at the end (in a finally block)
  • Optimize for querying, not writing
  • Include high-cardinality fields (user_id, request_id) that enable debugging
  • Include high-dimensionality (50+ fields) capturing full context

Anti-patterns

Avoid these common mistakes:

code
// String concatenation - loses structure
log("User " + userId + " payment failed: " + error)

// Scattered statements - 17 lines for one request
log("Starting request")
log("Validating token")
log("Token valid")
log("Fetching user")
log("User found")
log("Processing payment")
log("Payment failed")

// Low cardinality only - can't debug specific users
log({level: "error", message: "Payment failed"})

// Missing business context - no debugging power
log({user_id: "123", error: "failed"})

Wide Event Structure

One comprehensive event per request:

json
{
  "timestamp": "2025-01-15T10:23:45.612Z",
  "level": "error",

  "request_id": "req_8bf7ec2d",
  "trace_id": "abc123",
  "span_id": "span_456",

  "service": "checkout-service",
  "version": "2.4.1",
  "deployment_id": "deploy_789",
  "region": "us-east-1",

  "method": "POST",
  "path": "/api/checkout",
  "status_code": 500,
  "duration_ms": 1247,

  "user": {
    "id": "user_456",
    "subscription": "premium",
    "account_age_days": 847,
    "lifetime_value_cents": 284700
  },

  "cart": {
    "id": "cart_xyz",
    "item_count": 3,
    "total_cents": 15999,
    "coupon_applied": "SAVE20"
  },

  "payment": {
    "method": "card",
    "provider": "stripe",
    "latency_ms": 1089,
    "attempt": 3
  },

  "error": {
    "type": "PaymentError",
    "code": "card_declined",
    "message": "Card declined by issuer",
    "retriable": false,
    "stripe_decline_code": "insufficient_funds"
  },

  "feature_flags": {
    "new_checkout_flow": true,
    "express_payment": false
  },

  "timing": {
    "db_query_ms": 45,
    "external_api_ms": 1089,
    "db_queries_count": 3,
    "cache_hit": false
  }
}

For the comprehensive field reference, see references/wide-event-fields.md.

Log Levels

Use log levels to indicate severity and required action:

LevelWhen to UseExamples
ERRORRequest failed, needs investigationUnhandled exception, 5xx response, data corruption
WARNDegraded but recoveredRetry succeeded, fallback used, rate limit approaching, deprecated API called
INFONormal request completionThe canonical log line for successful requests
DEBUGDetailed diagnosticsCache lookups, query plans, intermediate state (usually sampled out in production)

The wide event's level field reflects the worst outcome of the request:

  • Request succeeded normally: info
  • Request succeeded after retry/fallback: warn
  • Request failed: error

Examples

Bad: Scattered Logs

code
log.info("Incoming request", {path: "/checkout"})
log.debug("Fetching user")
log.debug("User found", {user_id: "123"})
log.debug("Fetching cart")
log.info("Processing payment")
log.warn("Payment attempt 1 failed, retrying")
log.warn("Payment attempt 2 failed, retrying")
log.error("Payment failed after 3 attempts")

Good: Single Wide Event

code
log.error({
  request_id: "req_abc",
  path: "/checkout",
  method: "POST",
  status_code: 500,
  duration_ms: 3200,
  user: {id: "123", subscription: "premium"},
  cart: {item_count: 3, total_cents: 15999},
  payment: {
    provider: "stripe",
    attempts: 3,
    latency_ms: 2800
  },
  error: {
    type: "PaymentError",
    code: "card_declined",
    retriable: false
  }
})

Bad: Missing Context

code
log.error("Database query failed", {error: "timeout"})

Good: Full Context

code
log.error({
  request_id: "req_xyz",
  user: {id: "user_789", subscription: "enterprise"},
  operation: "fetch_orders",
  database: {
    host: "db-prod-01",
    query_duration_ms: 30000,
    timeout_ms: 30000,
    table: "orders",
    rows_scanned: 1500000
  },
  error: {
    type: "DatabaseError",
    code: "QUERY_TIMEOUT",
    message: "Query exceeded 30s timeout",
    retriable: true,
    suggestion: "Add index on orders.user_id or paginate results"
  }
})

Implementation Pattern

Build the event throughout the request, emit once at the end:

code
function loggingMiddleware(handler):
    return function(request):
        start_time = now()

        // Initialize wide event
        event = {
            request_id: request.id,
            trace_id: request.trace_id,
            timestamp: start_time,
            method: request.method,
            path: request.path,
            service: SERVICE_NAME,
            version: SERVICE_VERSION
        }

        // Make event available to handlers
        request.wide_event = event

        try:
            response = handler(request)
            event.status_code = response.status
            event.level = "info"
            return response

        catch error:
            event.status_code = 500
            event.level = "error"
            event.error = {
                type: error.name,
                message: error.message,
                code: error.code,
                retriable: error.retriable ?? false
            }
            throw error

        finally:
            event.duration_ms = now() - start_time
            logger.log(event)

In handlers, enrich with business context:

code
function checkoutHandler(request):
    event = request.wide_event

    // Add user context
    user = getUser(request)
    event.user = {
        id: user.id,
        subscription: user.tier,
        account_age_days: daysSince(user.created_at),
        lifetime_value_cents: user.ltv
    }

    // Add business context
    cart = getCart(user.id)
    event.cart = {
        id: cart.id,
        item_count: cart.items.length,
        total_cents: cart.total
    }

    // Track timing for external calls
    payment_start = now()
    result = processPayment(cart)
    event.payment = {
        provider: "stripe",
        latency_ms: now() - payment_start,
        attempt: result.attempt_number
    }

    return response(result)

Querying Power

Wide events enable queries that scattered logs cannot:

sql
-- Find all checkout failures for premium users with new checkout flow
SELECT * FROM logs
WHERE path = '/checkout'
  AND status_code >= 500
  AND user.subscription = 'premium'
  AND feature_flags.new_checkout_flow = true
  AND timestamp > now() - interval '1 hour'

-- Analyze payment failures by error code and user tier
SELECT
  error.code,
  user.subscription,
  count(*) as failures,
  avg(payment.latency_ms) as avg_latency
FROM logs
WHERE error.type = 'PaymentError'
GROUP BY error.code, user.subscription

-- Find slow requests for enterprise customers
SELECT * FROM logs
WHERE duration_ms > 2000
  AND user.subscription = 'enterprise'
ORDER BY duration_ms DESC