Implementing Observability
Goal
Make the system's internal state inferable from its external outputs. Answer "Why is it slow?" and "Why did it fail?" without SSH-ing into a server.
When to Use
- •Before launching to production.
- •When debugging a performance bottleneck.
- •When integrating a new microservice or external API.
Instructions
1. Structured Logging
Text logs are hard to query. Use JSON.
- •Context: Every log must have
trace_id,request_id,user_id. - •Levels:
INFOfor normal ops,WARNfor handled issues,ERRORfor unhandled crashes.
json
{"level": "info", "msg": "User logged in", "user_id": 123, "trace_id": "abc-123"}
2. Distributed Tracing (OpenTelemetry)
Trace a request across boundaries (Frontend -> API -> DB).
- •Instrument HTTP clients and server frameworks.
- •Visualize the "waterfall" to find the slow span.
3. Golden Signals (Metrics)
Track the four key metrics for every service:
- •Latency: Time to serve a request.
- •Traffic: Request rate (RPS).
- •Errors: Rate of 5xx responses.
- •Saturation: CPU/Memory/Disk usage.
4. Alerting
Alert on symptoms (High Error Rate), not causes (High CPU).
- •Page: If
Error Rate > 1%for 5 minutes. - •Ticket: If
Disk Usage > 80%.
Constraints
✅ Do
- •DO: Use OpenTelemetry standards for portability.
- •DO: Correlate logs and traces (inject trace ID into logs).
- •DO: Sample high-volume traces (10%) to save costs, but keep 100% of errors.
❌ Don't
- •DON'T: Log PII (Emails, Passwords, Credit Cards).
- •DON'T: Create alerts that auto-resolve in seconds (flapping).
- •DON'T: Rely solely on "system up" checks; check "business logic working".
Output Format
- •
docker-compose.ymlwith Prometheus/Grafana/Jaeger (for dev). - •Code instrumentation (e.g.,
tracing.py).
Dependencies
- •
backend/managing-flask-middleware/SKILL.md(where instrumentation lives) - •
shared/debugging/SKILL.md