Root Cause Analysis with Kopai
Guide for debugging production issues using telemetry data (traces, logs, metrics) via Kopai CLI.
Prerequisites
Ensure access to Kopai app backend. Make sure the services are set up to send their OpenTelemetry data to Kopai. See otel-instrumentation skill for setup.
RCA Workflow Summary
- •Find error traces
- •Get full trace context
- •Correlate logs with trace
- •Check related metrics
- •Identify root cause
Rules
1. Workflow (CRITICAL)
- •
workflow-find-errors- Find Error Traces - •
workflow-get-context- Get Full Trace Context - •
workflow-correlate-logs- Correlate Logs with Trace - •
workflow-check-metrics- Check Related Metrics
2. Patterns (HIGH)
- •
pattern-http-errors- HTTP Error Debugging - •
pattern-slow-requests- Slow Request Analysis - •
pattern-distributed- Distributed Failure Tracing - •
pattern-log-driven- Log-Driven Investigation
Read rules/<rule-name>.md for details.
Tips
- •Always use
--jsonfor programmatic analysis - •Pipe to
jqfor filtering/aggregation - •Start with errors, then trace backwards
- •Check span Duration to find bottlenecks
- •Correlate TraceId across traces, logs, metrics
References
- •trace-filters - Trace search filter options
- •log-filters - Log search filter options
- •metric-filters - Metric search filter options