Java Observability Metrics (Design + Cardinality + SLO Readiness)

Scope

In scope

Out of scope

•Counter: monotonic count (requests_total, errors_total).
•Gauge: instantaneous value (queue_depth, memory_used).
•Timer: duration measurements (request_latency).
•Histogram/Distribution: latency/size distributions suitable for percentile-ish analysis.
•Summary (if your backend supports): use carefully; often harder to aggregate across instances.

•Traffic: http_server_requests_total
•Errors: http_server_errors_total or http_server_requests_total{outcome="error"}
•Latency: http_server_request_duration_seconds (histogram)
•Saturation: thread pool utilization, queue size, DB pool usage

•Use snake_case.
•
Base unit in name if needed:
- •_seconds, _bytes, _total.
•
Prefer a consistent prefix/domain:
- •http_server_*, db_*, cache_*, mq_*.

•
Prefer histograms for latency:
- •enable bucketed distributions suitable for SLOs.
•
Choose buckets that match SLO thresholds:
- •e.g., 50ms, 100ms, 200ms, 500ms, 1s, 2s, 5s.
•Avoid per-endpoint custom histograms unless truly needed.

•
Define the questions:
- •"What is p95 latency for /orders in prod?"
- •"What is error rate for dependency X?"
•Define the minimal metric set to answer them.
•Define label set and enforce boundedness.
•
Implement instrumentation:
- •inbound HTTP
- •outbound HTTP clients
- •DB calls
- •cache
- •messaging consumers/producers
•
Validate on a staging environment:
- •confirm label cardinality is bounded
- •confirm naming conventions
•
Add dashboards and alerts:
- •SLO burn rate alerts (if used)
- •saturation alerts

•docs/metrics.md (metric catalog + labels + intent)
•metrics/metric-catalog.yml (machine-readable inventory)
•metrics/cardinality-budget.md
•
Code changes:
- •instrumentation wrappers
- •timers/histograms config
- •tests for bounded labels

•Symptom: Prometheus/metrics backend memory spikes -> Fix: remove high-cardinality labels, normalize routes.
•Symptom: metrics not aggregatable -> Fix: avoid instance-specific dimensions; use consistent naming.
•Symptom: percentiles inconsistent -> Fix: use histograms with defined buckets; avoid per-instance summaries.