AgentSkillsCN

observability-triage

采用日志→追踪→指标的工作流,对生产环境或本地环境的问题进行分类与处理(HTTP/gRPC/异步消费者)。在调试已部署的企业级Web应用中的事故、回归问题,或SLO违规时使用;不适用于新增埋点配置。

SKILL.md
--- frontmatter
name: observability-triage
description: Triage production/local issues using a log → trace → metrics workflow (HTTP/gRPC/async consumers). Use when debugging incidents, regressions, or SLO violations in an already-instrumented enterprise web app; not for adding new instrumentation.

Observability Triage (Log → Trace → Metrics)

Overview

This skill is for debugging with existing telemetry. It does not focus on adding instrumentation (use apply-observability-patterns when telemetry gaps block triage).

Goal: turn “something is broken/slow” into:

  • a concrete symptom + impact statement,
  • an evidence-backed hypothesis (or a small set of competing ones),
  • a mitigation (rollback/flag/scale) when needed,
  • a short list of fix + follow-up tasks.

Workflow

0) Establish ground truth (2–5 minutes)

Capture:

  • Environment (local/dev/staging/prod) and time window (start/end).
  • Symptom (what’s failing/slow) and impact (SLO/user-visible blast radius).
  • One exemplar: request/trace ID, job run ID, message ID, or timestamped log line.

1) Logs (find the exemplar and its correlation IDs)

  1. Find the first error/timeout log line closest to the symptom window.
  2. Identify correlation keys (prefer stable IDs):
    • traceId, requestId, spanId
    • op (route template / RPC method / job name / message type)
    • error code/type (typed error envelope, gRPC status, HTTP status)
  3. Pull the full log story for the exemplar (start → downstream call(s) → failure).

Copy/paste helpers live in references/commands.md.

2) Trace (turn the exemplar into a dependency hypothesis)

If you have a traceId, use it.

  1. Open the trace and confirm the root span matches the suspected operation (op).
  2. Identify:
    • the slowest span(s),
    • the first error span(s),
    • retries (multiple similar child spans),
    • deadline/time budget signals (deadline exceeded, timeout errors).
  3. Convert that to a dependency statement:
    • service A is timing out calling service B method X
    • “DB query Y is slow / missing index / deadlocked”
    • “Queue consumer is failing on message type T (poison message)”

If you cannot find/interpret traces, fall back to logs + metrics and consider adding missing telemetry via apply-observability-patterns.

3) Metrics (confirm blast radius + regression)

Use metrics to answer:

  • Is this widespread or isolated to one tenant/route/method?
  • Is it a new regression (deploy-correlated) or a gradual degradation (resource/saturation)?
  • Is it primarily errors or latency?

Start with RED for the boundary (HTTP route / gRPC method / consumer group).

4) Decide: mitigate vs investigate

If impact is high and evidence points to a recent change:

  • rollback / disable flag / reduce load / scale critical dependency

If impact is moderate or unclear:

  • tighten the hypothesis with 1–2 targeted checks (another exemplar trace, compare two instances, check downstream health)

5) Capture learnings (don’t lose the fix)

If you found a systemic gap, capture it:

Guardrails

  • Don’t log secrets/PII while triaging (even “temporarily”).
  • Don’t use unbounded IDs as metric labels; use logs/traces for per-entity investigation.
  • Don’t add retries as a debugging “fix” without idempotency/dedupe.
  • Prefer a small number of exemplars (2–3) over “grep everything forever”.

References

Output Template

When using this skill, return:

  • Symptom: what is failing/slow (include concrete ops: route/method/job/message type).
  • Impact: who/what is affected and how badly (errors %, latency p95, backlog size).
  • Time window: start/end and whether it correlates with deploy/config change.
  • Evidence: exemplar IDs + the key log/trace/metric observations.
  • Hypothesis: most likely cause + 1 alternative (if applicable).
  • Mitigation: what you did / recommend doing now (rollback/flag/scale).
  • Fix plan: code/config changes to make it correct and durable.
  • Follow-ups: telemetry gaps, runbook updates, tests, new invariants.