Specify Poker Observability Triage
Quick Start (Local Compose)
- •Grafana:
http://localhost:3001(datasources: Prometheus (Mimir), Loki, Tempo) - •Tail logs:
docker compose logs -f gateway
Workflow Decision Tree
A) You have a traceId (preferred)
- •Confirm the boundary error (usually
gateway):- •
docker compose logs --since 10m gateway | rg '<traceId>|\"level\":\"error\"'
- •
- •Loki (log correlation across services):
- •Explore → Loki → query:
- •
{service=~"gateway|game|balance|player|event|notify"} | json | traceId="<traceId>"
- •
- •Explore → Loki → query:
- •Tempo (distributed trace):
- •Explore → Tempo → paste
<traceId>(find the first error span; noteservice.name, RPC/HTTP route, and status).
- •Explore → Tempo → paste
- •Fix the root cause and re-run the smallest repro (one failing test or one request).
B) You do not have a traceId
- •Start from the “closest” boundary:
- •UI symptom →
gateway - •gRPC INTERNAL between services → callee service (
game,balance,player,event,notify)
- •UI symptom →
- •Loki search by stable identifiers:
- •Error string:
{service="gateway"} |= "ACCOUNT_NOT_FOUND" - •User/table IDs (from URL or payload):
{service="gateway"} |= "37d501f4-..."or{service="gateway"} | json | userId="..." - •HTTP path:
{service="gateway"} |= "/api/tables/"
- •Error string:
- •Extract
traceIdfrom the matching log line, then switch to workflow A.
Loki Queries (Useful Defaults)
- •Errors for one service:
{service="gateway"} | json | level="error" - •All services for one trace:
{service=~"gateway|game|balance|player|event|notify"} | json | traceId="<traceId>" - •gRPC failures:
{service=~"gateway|game|balance|player|event|notify"} | json | err.code!=0
Gameplay Stall Quick Checks
- •Stuck on
PREFLOP/ turn never advances:- •Inspect state:
curl -sS -H "Authorization: Bearer $TOKEN" "http://localhost:4000/api/tables/$TABLE_ID/state" - •Look for
hand.turnpointing at anEMPTY/inactive seat, or duplicateuser_idacross seats. - •Game logs:
docker compose logs --since 10m game | rg 'turn\\.repair\\.failed|NOT_YOUR_TURN' - •Deep dive: use
.codex/skills/specify-poker-game-stall-triage/SKILL.md.
- •Inspect state:
Tempo Tips
- •Gateway logs include
traceId/spanIdvia Pino + OTel context. - •If a trace is missing spans, confirm the failing request actually hit the compose stack (not a locally-started service) and that
OTEL_EXPORTER_OTLP_ENDPOINTis set for the service. - •If Grafana trace search fails with
invalid start ... value out of range, confirmtempo-grafana-proxyis running and Grafana’s Tempo datasource URL points tohttp://tempo-grafana-proxy:3200(Grafana uses ms, Tempoapi/searchexpects seconds). - •Service Map depends on metrics produced from traces (OTel Collector
servicegraph/spanmetrics); quick check:traces_service_graph_request_totalshould exist in the Prometheus datasource (backed by Mimir). - •For a focused “Grafana + Tempo traces/service map” playbook, use
.codex/skills/specify-poker-tempo-grafana-traces/SKILL.md.
Metrics (When It’s Intermittent)
- •Use Grafana dashboards first (they’re provisioned in
infra/grafana/dashboards/). - •If you need a quick “is it spiking?” view:
- •In Grafana Explore (Prometheus datasource), run
count by (job) (process_cpu_user_seconds_total)to confirm all services are emitting metrics. - •Look for request error-rate or latency panels; then pivot back to Loki/Tempo for the specific trace(s).
- •In Grafana Explore (Prometheus datasource), run
Scripts
scripts/
- •
scripts/loki_by_trace.py: Query Loki’s HTTP API for a giventraceIdand print matching log lines.