Production Observability with OpenTelemetry
Purpose
Implement production-grade observability using OpenTelemetry as the 2025 industry standard. Covers the three pillars (metrics, logs, traces), LGTM stack deployment, and critical log-trace correlation patterns.
When to Use
Use when:
- •Building production systems requiring visibility into performance and errors
- •Debugging distributed systems with multiple services
- •Setting up monitoring, logging, or tracing infrastructure
- •Implementing structured logging with trace correlation
- •Configuring alerting rules for production systems
Skip if:
- •Building proof-of-concept without production deployment
- •System has < 100 requests/day (console logging may suffice)
The OpenTelemetry Standard (2025)
OpenTelemetry is the CNCF graduated project unifying observability:
┌────────────────────────────────────────────────────────┐ │ OpenTelemetry: The Unified Standard │ ├────────────────────────────────────────────────────────┤ │ │ │ ONE SDK for ALL signals: │ │ ├── Metrics (Prometheus-compatible) │ │ ├── Logs (structured, correlated) │ │ ├── Traces (distributed, standardized) │ │ └── Context (propagates across services) │ │ │ │ Language SDKs: │ │ ├── Python: opentelemetry-api, opentelemetry-sdk │ │ ├── Rust: opentelemetry, tracing-opentelemetry │ │ ├── Go: go.opentelemetry.io/otel │ │ └── TypeScript: @opentelemetry/api │ │ │ │ Export to ANY backend: │ │ ├── LGTM Stack (Loki, Grafana, Tempo, Mimir) │ │ ├── Prometheus + Jaeger │ │ ├── Datadog, New Relic, Honeycomb (SaaS) │ │ └── Custom backends via OTLP protocol │ │ │ └────────────────────────────────────────────────────────┘
Context7 Reference: /websites/opentelemetry_io (Trust: High, Snippets: 5,888, Score: 85.9)
The Three Pillars of Observability
1. Metrics (What is happening?)
Track system health and performance over time.
Metric Types: Counters (always increase), Gauges (up/down), Histograms (distributions), Summaries (percentiles).
Brief Example (Python):
from opentelemetry import metrics
meter = metrics.get_meter(__name__)
http_requests = meter.create_counter("http.server.requests")
http_requests.add(1, {"method": "GET", "status": 200})
2. Logs (What happened?)
Record discrete events with context.
CRITICAL: Always inject trace_id/span_id for log-trace correlation.
Brief Example (Python + structlog):
import structlog
from opentelemetry import trace
logger = structlog.get_logger()
span = trace.get_current_span()
ctx = span.get_span_context()
logger.info(
"processing_request",
trace_id=format(ctx.trace_id, '032x'),
span_id=format(ctx.span_id, '016x'),
user_id=user_id
)
See: references/structured-logging.md for complete configuration.
3. Traces (Where did time go?)
Track request flow across distributed services.
Key Concepts: Trace (end-to-end journey), Span (individual operation), Parent-Child (nested operations).
Brief Example (Python + FastAPI):
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor app = FastAPI() FastAPIInstrumentor.instrument_app(app) # Auto-traces all HTTP requests
See: references/opentelemetry-setup.md for SDK installation by language.
The LGTM Stack (Self-Hosted Observability)
LGTM = Loki (Logs) + Grafana (Visualization) + Tempo (Traces) + Mimir (Metrics)
┌────────────────────────────────────────────────────────┐ │ LGTM Architecture │ ├────────────────────────────────────────────────────────┤ │ │ │ ┌──────────────────────────────────────────────┐ │ │ │ Grafana Dashboard (Port 3000) │ │ │ │ Unified UI for Logs, Metrics, Traces │ │ │ └──────┬──────────────┬─────────────┬─────────┘ │ │ │ │ │ │ │ ▼ ▼ ▼ │ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │ │ Loki │ │ Tempo │ │ Mimir │ │ │ │ (Logs) │ │ (Traces) │ │(Metrics) │ │ │ │Port 3100 │ │Port 3200 │ │Port 9009 │ │ │ └────▲─────┘ └────▲─────┘ └────▲─────┘ │ │ │ │ │ │ │ └──────────────┴─────────────┘ │ │ │ │ │ ┌───────▼────────┐ │ │ │ Grafana Alloy │ │ │ │ (Collector) │ │ │ │ Port 4317/8 │ ← OTLP gRPC/HTTP │ │ └───────▲────────┘ │ │ │ │ │ OpenTelemetry Instrumented Apps │ │ │ └────────────────────────────────────────────────────────┘
Quick Start: Run examples/lgtm-docker-compose/docker-compose.yml for a complete LGTM stack.
See: references/lgtm-stack.md for production deployment guide.
Critical Pattern: Log-Trace Correlation
The Problem: Logs and traces live in separate systems. You see an error log but can't find the related trace.
The Solution: Inject trace_id and span_id into every log record.
Python (structlog)
import structlog
from opentelemetry import trace
logger = structlog.get_logger()
span = trace.get_current_span()
ctx = span.get_span_context()
logger.info(
"request_processed",
trace_id=format(ctx.trace_id, '032x'), # 32-char hex
span_id=format(ctx.span_id, '016x'), # 16-char hex
user_id=user_id
)
Rust (tracing)
use tracing::{info, instrument};
#[instrument(fields(user_id = %user_id))]
async fn process_request(user_id: u64) -> Result<Response> {
// trace_id/span_id automatically included
info!(user_id = user_id, "processing request");
Ok(result)
}
See: references/trace-context.md for Go and TypeScript patterns.
Query in Grafana
{job="api-service"} |= "trace_id=4bf92f3577b34da6a3ce929d0e0e4736"
Quick Setup Guide
1. Choose Your Stack
Decision Tree:
- •Greenfield: OpenTelemetry SDK + LGTM Stack (self-hosted) or Grafana Cloud (managed)
- •Existing Prometheus: Add Loki (logs) + Tempo (traces)
- •Kubernetes: LGTM via Helm, Alloy DaemonSet
- •Zero-ops: Managed SaaS (Grafana Cloud, Datadog, New Relic)
2. Install OpenTelemetry SDK
Bootstrap Script:
python scripts/setup_otel.py --language python --framework fastapi
Manual (Python):
pip install opentelemetry-api opentelemetry-sdk \
opentelemetry-instrumentation-fastapi \
opentelemetry-exporter-otlp
See: references/opentelemetry-setup.md for Rust, Go, TypeScript installation.
3. Deploy LGTM Stack
Docker Compose (development):
cd examples/lgtm-docker-compose docker-compose up -d # Grafana: http://localhost:3000 (admin/admin) # OTLP: localhost:4317 (gRPC), localhost:4318 (HTTP)
See: references/lgtm-stack.md for production Kubernetes deployment.
4. Configure Structured Logging
See: references/structured-logging.md for complete setup (Python, Rust, Go, TypeScript).
5. Set Up Alerting
See: references/alerting-rules.md for Prometheus and Loki alert patterns.
Auto-Instrumentation
OpenTelemetry auto-instruments popular frameworks:
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor app = FastAPI() FastAPIInstrumentor.instrument_app(app) # Auto-trace all HTTP requests
Supported: FastAPI, Flask, Django, Express, Gin, Echo, Nest.js
See: references/opentelemetry-setup.md for framework-specific setup.
Common Patterns
Custom Spans
from opentelemetry import trace
tracer = trace.get_tracer(__name__)
with tracer.start_as_current_span("fetch_user_details") as span:
span.set_attribute("user_id", user_id)
user = await db.fetch_user(user_id)
span.set_attribute("user_found", user is not None)
Error Tracking
from opentelemetry.trace import Status, StatusCode
with tracer.start_as_current_span("process_payment") as span:
try:
result = process_payment(amount, card_token)
span.set_status(Status(StatusCode.OK))
except PaymentError as e:
span.set_status(Status(StatusCode.ERROR, str(e)))
span.record_exception(e)
raise
See: references/trace-context.md for background job tracing and context propagation.
Validation and Testing
# Test log-trace correlation
# 1. Make request to your app
# 2. Copy trace_id from logs
# 3. Query in Grafana: {job="myapp"} |= "trace_id=<TRACE_ID>"
# Validate metrics
python scripts/validate_metrics.py
Integration with Other Skills
- •Dashboards: Embed Grafana panels, query Prometheus metrics
- •Feedback: Alert routing (Slack, PagerDuty), notification UI
- •Data-Viz: Time-series charts, trace waterfall, latency heatmaps
See: examples/fastapi-otel/ for complete integration.
Progressive Disclosure
Setup Guides:
- •
references/opentelemetry-setup.md- SDK installation (Python, Rust, Go, TypeScript) - •
references/structured-logging.md- structlog, tracing, slog, pino configuration - •
references/lgtm-stack.md- LGTM deployment (Docker, Kubernetes) - •
references/trace-context.md- Log-trace correlation patterns - •
references/alerting-rules.md- Prometheus and Loki alert templates
Examples:
- •
examples/fastapi-otel/- FastAPI + OpenTelemetry + LGTM - •
examples/axum-tracing/- Rust Axum + tracing + LGTM - •
examples/lgtm-docker-compose/- Production-ready LGTM stack
Scripts:
- •
scripts/setup_otel.py- Bootstrap OpenTelemetry SDK - •
scripts/generate_dashboards.py- Generate Grafana dashboards - •
scripts/validate_metrics.py- Validate metric naming
Key Principles
- •OpenTelemetry is THE standard - Use OTel SDK, not vendor-specific SDKs
- •Auto-instrumentation first - Prefer auto over manual spans
- •Always correlate logs and traces - Inject trace_id/span_id into every log
- •Use structured logging - JSON format, consistent field names
- •LGTM stack for self-hosting - Production-ready open-source stack
Common Pitfalls
Don't:
- •Use vendor-specific SDKs (use OpenTelemetry)
- •Log without trace_id/span_id context
- •Manually instrument what auto-instrumentation covers
- •Mix logging libraries (pick one: structlog, tracing, slog, pino)
Do:
- •Start with auto-instrumentation
- •Add manual spans only for business-critical operations
- •Use semantic conventions for span attributes
- •Export to OTLP (gRPC preferred over HTTP)
- •Test locally with LGTM docker-compose before production
Success Metrics
- •100% of logs include trace_id when in request context
- •Mean time to resolution (MTTR) decreases by >50%
- •Developers use Grafana as first debugging tool
- •80%+ of telemetry from auto-instrumentation
- •Alert noise < 5% false positives