Jaeger Tracing Analysis
Authentication
IMPORTANT: Credentials are injected automatically by a proxy layer. Do NOT check for JAEGER_URL or other credentials in environment variables - they won't be visible to you. Just run the scripts directly; authentication is handled transparently.
Configuration environment variables you CAN check (non-secret):
- •
JAEGER_URL- Jaeger Query API URL (e.g.,http://jaeger-query:16686)
MANDATORY: Statistics-First Investigation
NEVER dump all traces. Always follow this pattern:
code
SERVICES → OPERATIONS → STATISTICS → SAMPLE TRACES
- •List Services - Know what services exist
- •List Operations - Understand endpoints/operations per service
- •Get Statistics - Error rates, latency percentiles
- •Sample Traces - Get specific traces after understanding the landscape
Available Scripts
All scripts are in .claude/skills/observability-jaeger/scripts/
SERVICE DISCOVERY
list_services.py - List All Traced Services
bash
python .claude/skills/observability-jaeger/scripts/list_services.py # Output: List of all services sending traces to Jaeger
list_operations.py - List Operations for a Service
bash
python .claude/skills/observability-jaeger/scripts/list_operations.py <service> # Example: python .claude/skills/observability-jaeger/scripts/list_operations.py frontend
TRACE INVESTIGATION
get_traces.py - Search for Traces
bash
python .claude/skills/observability-jaeger/scripts/get_traces.py --service SERVICE [OPTIONS] # Options: # --operation OPERATION Filter by operation name # --tags KEY=VALUE Filter by tags (can repeat) # --min-duration MS Minimum duration in milliseconds # --max-duration MS Maximum duration in milliseconds # --limit N Max traces to return (default: 20) # --lookback HOURS How far back to search (default: 1) # Examples: python .claude/skills/observability-jaeger/scripts/get_traces.py --service frontend --limit 10 python .claude/skills/observability-jaeger/scripts/get_traces.py --service checkout --min-duration 500 python .claude/skills/observability-jaeger/scripts/get_traces.py --service api --operation "HTTP GET /users" --limit 5 python .claude/skills/observability-jaeger/scripts/get_traces.py --service payment --tags error=true
get_trace.py - Get Full Trace by ID
bash
python .claude/skills/observability-jaeger/scripts/get_trace.py <trace-id> # Example: python .claude/skills/observability-jaeger/scripts/get_trace.py abc123def456789
LATENCY ANALYSIS
get_slow_traces.py - Find Slow Traces
bash
python .claude/skills/observability-jaeger/scripts/get_slow_traces.py --service SERVICE [OPTIONS] # Options: # --min-duration MS Minimum duration threshold (default: 1000) # --operation OPERATION Filter by specific operation # --limit N Max traces to return (default: 20) # --lookback HOURS How far back to search (default: 1) # Examples: python .claude/skills/observability-jaeger/scripts/get_slow_traces.py --service checkout --min-duration 500 python .claude/skills/observability-jaeger/scripts/get_slow_traces.py --service api --operation "POST /orders"
get_latency_stats.py - Latency Statistics
bash
python .claude/skills/observability-jaeger/scripts/get_latency_stats.py --service SERVICE [OPTIONS] # Options: # --operation OPERATION Filter by operation # --lookback HOURS Time window (default: 1) # Example: python .claude/skills/observability-jaeger/scripts/get_latency_stats.py --service frontend python .claude/skills/observability-jaeger/scripts/get_latency_stats.py --service checkout --operation "POST /checkout"
ERROR ANALYSIS
get_error_traces.py - Find Traces with Errors
bash
python .claude/skills/observability-jaeger/scripts/get_error_traces.py --service SERVICE [OPTIONS] # Options: # --operation OPERATION Filter by operation # --limit N Max traces (default: 20) # --lookback HOURS Time window (default: 1) # Example: python .claude/skills/observability-jaeger/scripts/get_error_traces.py --service payment python .claude/skills/observability-jaeger/scripts/get_error_traces.py --service api --operation "POST /checkout"
Investigation Workflow
Standard Latency Investigation
code
┌─────────────────────────────────────────────────────────────┐
│ 1. LIST SERVICES │
│ python list_services.py │
│ → Identify which service to investigate │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 2. GET LATENCY STATS │
│ python get_latency_stats.py --service X │
│ → See p50, p95, p99 latencies per operation │
└─────────────────────────────────────────────────────────────┘
│
▼
High Latency Found?
┌─────────────┴─────────────┐
│ │
YES NO
│ │
▼ ▼
┌─────────────────────────────┐ ┌───────────────────────────────────────────┐
│ 3a. GET SLOW TRACES │ │ 3b. CHECK ERRORS │
│ python get_slow_traces.py│ │ python get_error_traces.py --service X │
│ --service X │ │ → Look for error patterns │
│ → Analyze slow paths │ └───────────────────────────────────────────┘
└─────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 4. ANALYZE SPECIFIC TRACE │
│ python get_trace.py <trace-id> │
│ → See full span tree, find bottleneck │
└─────────────────────────────────────────────────────────────┘
Quick Commands Reference
| Goal | Command |
|---|---|
| List all services | list_services.py |
| List operations | list_operations.py <service> |
| Get latency stats | get_latency_stats.py --service X |
| Find slow traces | get_slow_traces.py --service X --min-duration 500 |
| Find error traces | get_error_traces.py --service X |
| Get specific trace | get_trace.py <trace-id> |
| Search with tags | get_traces.py --service X --tags http.status_code=500 |
Trace Anatomy
Span Structure
code
Trace (trace_id: abc123) ├── Span: frontend (span_id: 001, duration: 250ms) │ ├── Span: api-gateway (span_id: 002, duration: 200ms) │ │ ├── Span: auth-service (span_id: 003, duration: 50ms) │ │ └── Span: order-service (span_id: 004, duration: 120ms) ← bottleneck │ │ └── Span: database (span_id: 005, duration: 100ms) ← root cause │ └── Span: cache-lookup (span_id: 006, duration: 5ms)
Common Tags
- •
http.method- HTTP method (GET, POST, etc.) - •
http.url- Request URL - •
http.status_code- Response status code - •
error- Boolean, true if span has error - •
span.kind- client, server, producer, consumer - •
db.type- Database type (mysql, postgres, redis) - •
db.statement- Database query (may be truncated)
Finding Bottlenecks
- •Sort spans by duration (longest first)
- •Look for the critical path (spans on the main request flow)
- •Check if slow span has child spans (slow child = propagated latency)
- •Check tags for error=true or high status codes
Common Patterns
Find Slow Database Queries
bash
# Find traces with slow DB operations python .claude/skills/observability-jaeger/scripts/get_traces.py \ --service order-service \ --tags db.type=postgres \ --min-duration 100
Find HTTP Errors
bash
# Find 5xx errors python .claude/skills/observability-jaeger/scripts/get_traces.py \ --service api-gateway \ --tags http.status_code=500 # Or use error traces script python .claude/skills/observability-jaeger/scripts/get_error_traces.py --service api-gateway
Compare Latency Across Services
bash
# Get stats for each service python .claude/skills/observability-jaeger/scripts/get_latency_stats.py --service frontend python .claude/skills/observability-jaeger/scripts/get_latency_stats.py --service api python .claude/skills/observability-jaeger/scripts/get_latency_stats.py --service database
Anti-Patterns to Avoid
- •NEVER fetch all traces - Always use filters (service, time, duration)
- •Skip service discovery - Always start with
list_services.py - •Ignore latency stats - Get percentiles before diving into individual traces
- •Focus on single spans - Look at the full trace context
- •Miss error tags - Always check for
error=truein slow traces - •Unbounded time ranges - Always specify
--lookbackfor time bounds
Output Format
When reporting trace findings, use this structure:
code
## Trace Analysis Summary ### Service: [service name] ### Time Window: [start] to [end] ### Latency Statistics | Operation | p50 | p95 | p99 | Count | |-----------|-----|-----|-----|-------| | GET /api | 50ms| 150ms| 300ms| 1000| ### Slow Traces Found 1. **Trace ID**: abc123 - **Duration**: 2.5s - **Bottleneck**: database span (1.8s) - **Root Cause**: Slow query on orders table ### Error Traces Found 1. **Trace ID**: def456 - **Error**: Connection refused to payment-service - **Impact**: 5xx returned to client ### Root Cause Hypothesis [Based on trace analysis] ### Recommended Action [Specific remediation step]