Observability Skill
Production monitoring, logging, and debugging infrastructure for CAIO projects. If you can't see it, you can't fix it.
Why This Matters
Without observability:
- •Bugs discovered by users, not by you
- •Debugging takes hours instead of minutes
- •No data for post-mortems
- •Can't answer "what happened at 3am?"
Core Components
1. Structured Logging
typescript
// lib/logger.ts
import pino from "pino";
export const logger = pino({
level: process.env.LOG_LEVEL || "info",
formatters: {
level: (label) => ({ level: label }),
},
base: {
env: process.env.NODE_ENV,
service: "your-app",
},
});
// Usage
logger.info({ userId, action: "login" }, "User logged in");
logger.error({ err, requestId }, "Payment failed");
2. Request Tracing
typescript
// middleware.ts
import { NextRequest, NextResponse } from "next/server";
import { randomUUID } from "crypto";
export function middleware(request: NextRequest) {
const requestId = request.headers.get("x-request-id") || randomUUID();
const response = NextResponse.next();
response.headers.set("x-request-id", requestId);
return response;
}
3. Error Boundaries with Reporting
typescript
// components/ErrorBoundary.tsx
"use client";
import { Component, ReactNode } from "react";
import { logger } from "@/lib/logger";
interface Props {
children: ReactNode;
fallback: ReactNode;
}
interface State {
hasError: boolean;
error?: Error;
}
export class ErrorBoundary extends Component<Props, State> {
state: State = { hasError: false };
static getDerivedStateFromError(error: Error): State {
return { hasError: true, error };
}
componentDidCatch(error: Error, errorInfo: React.ErrorInfo) {
logger.error(
{
error: error.message,
stack: error.stack,
componentStack: errorInfo.componentStack,
},
"React error boundary caught error",
);
// Send to error tracking service
// Sentry.captureException(error)
}
render() {
if (this.state.hasError) {
return this.props.fallback;
}
return this.props.children;
}
}
4. API Route Instrumentation
typescript
// lib/api-wrapper.ts
import { NextRequest, NextResponse } from "next/server";
import { logger } from "@/lib/logger";
type Handler = (req: NextRequest) => Promise<NextResponse>;
export function withLogging(handler: Handler): Handler {
return async (req: NextRequest) => {
const start = Date.now();
const requestId = req.headers.get("x-request-id") || "unknown";
logger.info(
{
requestId,
method: req.method,
path: req.nextUrl.pathname,
},
"Request started",
);
try {
const response = await handler(req);
logger.info(
{
requestId,
method: req.method,
path: req.nextUrl.pathname,
status: response.status,
duration: Date.now() - start,
},
"Request completed",
);
return response;
} catch (error) {
logger.error(
{
requestId,
method: req.method,
path: req.nextUrl.pathname,
error: error instanceof Error ? error.message : "Unknown error",
duration: Date.now() - start,
},
"Request failed",
);
throw error;
}
};
}
5. Health Check Endpoint
typescript
// app/api/health/route.ts
import { NextResponse } from "next/server";
import { db } from "@/lib/db";
export async function GET() {
const checks = {
status: "healthy",
timestamp: new Date().toISOString(),
checks: {
database: "unknown",
memory: "unknown",
},
};
// Database check
try {
await db.$queryRaw`SELECT 1`;
checks.checks.database = "healthy";
} catch {
checks.checks.database = "unhealthy";
checks.status = "degraded";
}
// Memory check
const used = process.memoryUsage();
const heapUsedMB = Math.round(used.heapUsed / 1024 / 1024);
checks.checks.memory = heapUsedMB < 512 ? "healthy" : "warning";
return NextResponse.json(checks, {
status: checks.status === "healthy" ? 200 : 503,
});
}
Integration Points
Vercel (Recommended for CAIO)
typescript
// Use Vercel's built-in logging
// Logs automatically captured from console.log, console.error
// For structured logging, use:
// vercel.json
{
"functions": {
"app/api/**/*.ts": {
"memory": 1024,
"maxDuration": 30
}
}
}
Sentry (Error Tracking)
bash
bun add @sentry/nextjs
typescript
// sentry.client.config.ts
import * as Sentry from "@sentry/nextjs";
Sentry.init({
dsn: process.env.NEXT_PUBLIC_SENTRY_DSN,
tracesSampleRate: 0.1,
environment: process.env.NODE_ENV,
});
Axiom (Log Aggregation)
bash
bun add @axiomhq/nextjs
typescript
// next.config.js
const { withAxiom } = require("@axiomhq/nextjs");
module.exports = withAxiom({
// your config
});
What to Log
Always Log
- •User authentication events (login, logout, failed attempts)
- •Authorization failures (403s)
- •Payment events (success, failure, amounts)
- •Data mutations (create, update, delete)
- •External API calls (request, response, latency)
- •Errors with full context
Never Log
- •Passwords or secrets
- •Full credit card numbers
- •Personal health information
- •Session tokens (log hash only)
- •Full request bodies with PII
Log Levels
| Level | Use For |
|---|---|
error | Exceptions, failed operations, things that need alerts |
warn | Degraded service, approaching limits, recoverable issues |
info | Business events, request lifecycle, audit trail |
debug | Detailed debugging, only in development |
Alerts to Configure
| Alert | Threshold | Severity |
|---|---|---|
| Error rate spike | >1% of requests | Critical |
| Response time p95 | >2s | Warning |
| Health check failing | 3 consecutive | Critical |
| Memory usage | >80% | Warning |
| Failed payments | Any | Critical |
| Failed logins (same IP) | >10/minute | Warning |
Debugging Production Issues
1. Start with Health Check
bash
curl https://your-app.vercel.app/api/health | jq
2. Check Recent Logs
- •Vercel Dashboard → Logs
- •Filter by request ID from error report
3. Reproduce Locally
bash
# Get the request that failed
curl -v -X POST https://your-app.vercel.app/api/problematic-endpoint \
-H "Content-Type: application/json" \
-d '{"same": "payload"}'
4. Check External Dependencies
- •Database: Connection pool exhaustion?
- •External APIs: Rate limited? Down?
- •Payment provider: Webhook failures?
Checklist Before Production
- • Health check endpoint exists and returns meaningful data
- • Structured logging configured (not just console.log)
- • Error boundary wraps all pages
- • Sentry or equivalent configured
- • All payment events logged
- • All auth events logged
- • Log rotation/retention configured
- • Alert thresholds defined
- • Runbook for common issues exists