Observability Triage (Log → Trace → Metrics)

Overview

This skill is for debugging with existing telemetry. It does not focus on adding instrumentation (use apply-observability-patterns when telemetry gaps block triage).

Goal: turn “something is broken/slow” into:

•a concrete symptom + impact statement,
•an evidence-backed hypothesis (or a small set of competing ones),
•a mitigation (rollback/flag/scale) when needed,
•a short list of fix + follow-up tasks.

Workflow

0) Establish ground truth (2–5 minutes)

Capture:

•Environment (local/dev/staging/prod) and time window (start/end).
•Symptom (what’s failing/slow) and impact (SLO/user-visible blast radius).
•One exemplar: request/trace ID, job run ID, message ID, or timestamped log line.

1) Logs (find the exemplar and its correlation IDs)

•Find the first error/timeout log line closest to the symptom window.
•
Identify correlation keys (prefer stable IDs):
- •traceId, requestId, spanId
- •op (route template / RPC method / job name / message type)
- •error code/type (typed error envelope, gRPC status, HTTP status)
•Pull the full log story for the exemplar (start → downstream call(s) → failure).

Copy/paste helpers live in references/commands.md.

2) Trace (turn the exemplar into a dependency hypothesis)

If you have a traceId, use it.

•Open the trace and confirm the root span matches the suspected operation (op).
•
Identify:
- •the slowest span(s),
- •the first error span(s),
- •retries (multiple similar child spans),
- •deadline/time budget signals (deadline exceeded, timeout errors).
•
Convert that to a dependency statement:
- •“service A is timing out calling service B method X”
- •“DB query Y is slow / missing index / deadlocked”
- •“Queue consumer is failing on message type T (poison message)”

If you cannot find/interpret traces, fall back to logs + metrics and consider adding missing telemetry via apply-observability-patterns.

3) Metrics (confirm blast radius + regression)

Use metrics to answer:

•Is this widespread or isolated to one tenant/route/method?
•Is it a new regression (deploy-correlated) or a gradual degradation (resource/saturation)?
•Is it primarily errors or latency?

Start with RED for the boundary (HTTP route / gRPC method / consumer group).

4) Decide: mitigate vs investigate

If impact is high and evidence points to a recent change:

•rollback / disable flag / reduce load / scale critical dependency

If impact is moderate or unclear:

•tighten the hypothesis with 1–2 targeted checks (another exemplar trace, compare two instances, check downstream health)

5) Capture learnings (don’t lose the fix)

If you found a systemic gap, capture it:

•missing telemetry field contracts → apply-observability-patterns
•retries without idempotency / missing time budgets → apply-resilience-patterns
•repeated boundary logic across services → shared-platform-library
•cross-service pattern confusion → select-architecture-pattern

Guardrails

•Don’t log secrets/PII while triaging (even “temporarily”).
•Don’t use unbounded IDs as metric labels; use logs/traces for per-entity investigation.
•Don’t add retries as a debugging “fix” without idempotency/dedupe.
•Prefer a small number of exemplars (2–3) over “grep everything forever”.

References

•Copy/paste commands: references/commands.md
•Scenario checklists (HTTP/gRPC/consumers): references/scenarios.md
•If telemetry is missing: apply-observability-patterns

Output Template

When using this skill, return:

•Symptom: what is failing/slow (include concrete ops: route/method/job/message type).
•Impact: who/what is affected and how badly (errors %, latency p95, backlog size).
•Time window: start/end and whether it correlates with deploy/config change.
•Evidence: exemplar IDs + the key log/trace/metric observations.
•Hypothesis: most likely cause + 1 alternative (if applicable).
•Mitigation: what you did / recommend doing now (rollback/flag/scale).
•Fix plan: code/config changes to make it correct and durable.
•Follow-ups: telemetry gaps, runbook updates, tests, new invariants.