AgentSkillsCN

Observability

可观测性

SKILL.md

Observability Skill

Production monitoring, logging, and debugging infrastructure for CAIO projects. If you can't see it, you can't fix it.

Why This Matters

Without observability:

  • Bugs discovered by users, not by you
  • Debugging takes hours instead of minutes
  • No data for post-mortems
  • Can't answer "what happened at 3am?"

Core Components

1. Structured Logging

typescript
// lib/logger.ts
import pino from "pino";

export const logger = pino({
  level: process.env.LOG_LEVEL || "info",
  formatters: {
    level: (label) => ({ level: label }),
  },
  base: {
    env: process.env.NODE_ENV,
    service: "your-app",
  },
});

// Usage
logger.info({ userId, action: "login" }, "User logged in");
logger.error({ err, requestId }, "Payment failed");

2. Request Tracing

typescript
// middleware.ts
import { NextRequest, NextResponse } from "next/server";
import { randomUUID } from "crypto";

export function middleware(request: NextRequest) {
  const requestId = request.headers.get("x-request-id") || randomUUID();

  const response = NextResponse.next();
  response.headers.set("x-request-id", requestId);

  return response;
}

3. Error Boundaries with Reporting

typescript
// components/ErrorBoundary.tsx
"use client";

import { Component, ReactNode } from "react";
import { logger } from "@/lib/logger";

interface Props {
  children: ReactNode;
  fallback: ReactNode;
}

interface State {
  hasError: boolean;
  error?: Error;
}

export class ErrorBoundary extends Component<Props, State> {
  state: State = { hasError: false };

  static getDerivedStateFromError(error: Error): State {
    return { hasError: true, error };
  }

  componentDidCatch(error: Error, errorInfo: React.ErrorInfo) {
    logger.error(
      {
        error: error.message,
        stack: error.stack,
        componentStack: errorInfo.componentStack,
      },
      "React error boundary caught error",
    );

    // Send to error tracking service
    // Sentry.captureException(error)
  }

  render() {
    if (this.state.hasError) {
      return this.props.fallback;
    }
    return this.props.children;
  }
}

4. API Route Instrumentation

typescript
// lib/api-wrapper.ts
import { NextRequest, NextResponse } from "next/server";
import { logger } from "@/lib/logger";

type Handler = (req: NextRequest) => Promise<NextResponse>;

export function withLogging(handler: Handler): Handler {
  return async (req: NextRequest) => {
    const start = Date.now();
    const requestId = req.headers.get("x-request-id") || "unknown";

    logger.info(
      {
        requestId,
        method: req.method,
        path: req.nextUrl.pathname,
      },
      "Request started",
    );

    try {
      const response = await handler(req);

      logger.info(
        {
          requestId,
          method: req.method,
          path: req.nextUrl.pathname,
          status: response.status,
          duration: Date.now() - start,
        },
        "Request completed",
      );

      return response;
    } catch (error) {
      logger.error(
        {
          requestId,
          method: req.method,
          path: req.nextUrl.pathname,
          error: error instanceof Error ? error.message : "Unknown error",
          duration: Date.now() - start,
        },
        "Request failed",
      );

      throw error;
    }
  };
}

5. Health Check Endpoint

typescript
// app/api/health/route.ts
import { NextResponse } from "next/server";
import { db } from "@/lib/db";

export async function GET() {
  const checks = {
    status: "healthy",
    timestamp: new Date().toISOString(),
    checks: {
      database: "unknown",
      memory: "unknown",
    },
  };

  // Database check
  try {
    await db.$queryRaw`SELECT 1`;
    checks.checks.database = "healthy";
  } catch {
    checks.checks.database = "unhealthy";
    checks.status = "degraded";
  }

  // Memory check
  const used = process.memoryUsage();
  const heapUsedMB = Math.round(used.heapUsed / 1024 / 1024);
  checks.checks.memory = heapUsedMB < 512 ? "healthy" : "warning";

  return NextResponse.json(checks, {
    status: checks.status === "healthy" ? 200 : 503,
  });
}

Integration Points

Vercel (Recommended for CAIO)

typescript
// Use Vercel's built-in logging
// Logs automatically captured from console.log, console.error

// For structured logging, use:
// vercel.json
{
  "functions": {
    "app/api/**/*.ts": {
      "memory": 1024,
      "maxDuration": 30
    }
  }
}

Sentry (Error Tracking)

bash
bun add @sentry/nextjs
typescript
// sentry.client.config.ts
import * as Sentry from "@sentry/nextjs";

Sentry.init({
  dsn: process.env.NEXT_PUBLIC_SENTRY_DSN,
  tracesSampleRate: 0.1,
  environment: process.env.NODE_ENV,
});

Axiom (Log Aggregation)

bash
bun add @axiomhq/nextjs
typescript
// next.config.js
const { withAxiom } = require("@axiomhq/nextjs");

module.exports = withAxiom({
  // your config
});

What to Log

Always Log

  • User authentication events (login, logout, failed attempts)
  • Authorization failures (403s)
  • Payment events (success, failure, amounts)
  • Data mutations (create, update, delete)
  • External API calls (request, response, latency)
  • Errors with full context

Never Log

  • Passwords or secrets
  • Full credit card numbers
  • Personal health information
  • Session tokens (log hash only)
  • Full request bodies with PII

Log Levels

LevelUse For
errorExceptions, failed operations, things that need alerts
warnDegraded service, approaching limits, recoverable issues
infoBusiness events, request lifecycle, audit trail
debugDetailed debugging, only in development

Alerts to Configure

AlertThresholdSeverity
Error rate spike>1% of requestsCritical
Response time p95>2sWarning
Health check failing3 consecutiveCritical
Memory usage>80%Warning
Failed paymentsAnyCritical
Failed logins (same IP)>10/minuteWarning

Debugging Production Issues

1. Start with Health Check

bash
curl https://your-app.vercel.app/api/health | jq

2. Check Recent Logs

  • Vercel Dashboard → Logs
  • Filter by request ID from error report

3. Reproduce Locally

bash
# Get the request that failed
curl -v -X POST https://your-app.vercel.app/api/problematic-endpoint \
  -H "Content-Type: application/json" \
  -d '{"same": "payload"}'

4. Check External Dependencies

  • Database: Connection pool exhaustion?
  • External APIs: Rate limited? Down?
  • Payment provider: Webhook failures?

Checklist Before Production

  • Health check endpoint exists and returns meaningful data
  • Structured logging configured (not just console.log)
  • Error boundary wraps all pages
  • Sentry or equivalent configured
  • All payment events logged
  • All auth events logged
  • Log rotation/retention configured
  • Alert thresholds defined
  • Runbook for common issues exists