Java Observability Tracing (OpenTelemetry + Propagation + Sampling)
Scope
In scope
- •OpenTelemetry tracing setup (SDK or auto-instrumentation agent).
- •Context propagation across services (W3C Trace Context, baggage policy).
- •Span naming + attributes using semantic conventions (HTTP, DB, messaging).
- •Sampling strategies that balance cost vs visibility.
- •Correlation with logs (traceId/spanId) and metrics.
Out of scope
- •Vendor-specific tracing UI configuration.
- •Full service topology governance.
Core mental model
Tracing answers: “Where did time go?” and “Which dependency caused failure?”
A trace is a tree/graph of spans. Each span represents a unit of work and captures timing, attributes, and events.
Propagation (must-have)
Standard propagation: W3C Trace Context
- •Use
traceparentandtracestateheaders for cross-service propagation. - •Ensure outbound HTTP and messaging producers inject propagation headers.
- •Ensure inbound handlers extract propagation headers.
Baggage policy
- •Use baggage sparingly: it propagates everywhere and can increase cost.
- •Never put secrets/PII in baggage.
- •Define an allowlist of baggage keys if you enable it.
Instrumentation approach (recommended default)
Option A: Auto-instrumentation first
- •Use the OpenTelemetry Java agent for fast baseline coverage:
- •inbound HTTP server spans
- •outbound HTTP client spans
- •common libraries (JDBC, messaging)
Pros: fastest adoption, fewer code changes.
Cons: may need manual spans for domain-specific operations.
Option B: Manual instrumentation for key domains
- •Add explicit spans for:
- •business workflows
- •critical background jobs
- •complex async pipelines
- •Use semantic attributes and stable span names.
Span design rules
Span naming
- •Use low-cardinality names:
- •
HTTP GET /v1/orders/{id} - •
DB SELECT orders - •
MQ consume orders.created
- •
- •Avoid embedding IDs in span names.
Attributes (tagging)
- •Use semantic conventions where available.
- •Keep attributes bounded; avoid IDs in attributes unless necessary and approved.
Events and errors
- •Record exceptions as span events (but do not leak secrets).
- •Use
statusto represent error outcome.
Sampling strategy (cost vs value)
Default recommended: Parent-based + TraceIdRatio
- •Keep sampling consistent across services.
- •Start with a low rate in prod (e.g., 1% to 10%), increase during incidents.
- •Always sample errors if your system supports tail sampling (advanced).
Operational knobs
- •Support runtime configuration for sampling (env-based).
- •Document how to temporarily increase sampling for debugging.
Export and resource identity
- •Export via OTLP (HTTP/gRPC) to your collector/backend.
- •Set resource attributes:
- •
service.name,service.version,deployment.environment. These must be consistent across services to enable cross-service queries.
- •
Correlation with logs
- •Inject
traceIdandspanIdinto logs:- •via MDC or OpenTelemetry-aware logging appenders.
- •This enables “logs <-> traces” pivoting during incidents.
Key spans map (template)
A minimal “span map” for a typical backend request:
- •Inbound HTTP span:
- •name:
HTTP <method> <route>
- •name:
- •Authentication/authorization span (optional, internal):
- •name:
auth.verify
- •name:
- •DB span:
- •name:
DB <operation> <table>
- •name:
- •Outbound dependency call span:
- •name:
HTTP <method> <upstream>
- •name:
- •Business workflow span:
- •name:
orders.create(manual)
- •name:
- •Response finalize span events:
- •errors, retries, timeouts as span events
Testing and validation
- •Smoke test: confirm traces exported in staging.
- •Propagation test: multi-service request produces a single trace.
- •Attribute linting: reject forbidden high-cardinality attributes.
- •Performance check: ensure overhead acceptable at target sampling rate.
Outputs / artifacts
- •
docs/tracing.md(propagation, sampling, span naming) - •
observability/span-naming.md - •
observability/attribute-policy.md - •Code changes:
- •OTel agent config or SDK bootstrap
- •manual spans in critical workflows
- •log correlation integration
- •Tests:
- •propagation tests
- •policy tests for forbidden attributes
Definition of Done (DoD)
- • Inbound and outbound requests generate spans.
- • W3C propagation works across services.
- • Sampling configured and documented.
- • Resource identity set consistently.
- • Logs include traceId/spanId for correlation.
- • Attribute policy prevents cardinality/security issues.
Guardrails (What NOT to do)
- •Never propagate secrets/PII via baggage or attributes.
- •Avoid high-cardinality attributes (IDs, full URLs, raw payloads).
- •Do not create a span for every tiny function (noise and overhead).
- •Avoid inconsistent sampling across services.
Common failure modes & fixes
- •Symptom: broken traces across services -> Fix: ensure
traceparentextraction/injection on all hops. - •Symptom: tracing cost too high -> Fix: reduce sampling; tighten attribute policy; focus manual spans.
- •Symptom: cannot find root cause -> Fix: add manual spans for domain workflows; record key events.
References (see references/)
- •
references/w3c-trace-context.md - •
references/w3c-baggage-policy.md - •
references/otel-java-agent.md - •
references/otel-semconv-http.md - •
references/sampling-playbook.md