Observability Analysis
Core Principle: Statistics Before Samples
NEVER start by reading raw logs. Always begin with aggregated statistics:
- •Volume: How many logs in the time window?
- •Distribution: Which services/levels/error types?
- •Trends: Is it increasing, stable, or decreasing?
- •THEN sample: Get specific entries after understanding the landscape
Available Backends
IMPORTANT: Credentials are injected automatically by a proxy layer. Do NOT check for API keys in environment variables - they won't be there. Just use the backend scripts directly; authentication is handled transparently.
Available backends (invoke with /skill-name):
- •Coralogix (DataPrime) -
/observability-coralogix - •Datadog -
/observability-datadog - •Splunk (SPL) -
/observability-splunk - •Elasticsearch/OpenSearch -
/observability-elasticsearch - •Jaeger (Tracing) -
/observability-jaeger
To check if a backend is working, try a simple query rather than checking env vars.
Backend-Specific Skills
- •Coralogix:
/observability-coralogix- DataPrime syntax, log/trace analysis - •Datadog:
/observability-datadog- DQL syntax, metrics and APM - •Splunk:
/observability-splunk- SPL syntax, saved searches - •Elasticsearch:
/observability-elasticsearch- Lucene/Query DSL - •Jaeger:
/observability-jaeger- Distributed tracing, latency analysis
Analysis Framework
Step 1: Get the Big Picture
- •Total log volume
- •Error rate and distribution
- •Which services are most affected
Step 2: Identify Patterns
- •Error clustering (many errors in short time)
- •Temporal patterns (started at X time)
- •Service correlation (Service A errors → Service B errors)
Step 3: Sample Strategically
- •Sample from error peaks
- •Get examples of each distinct error type
- •Compare against baseline period
Output Format
When reporting observability findings, use this structure:
code
## Log Analysis Summary ### Time Window - Start: [timestamp] - End: [timestamp] - Duration: X hours ### Statistics - Total logs: X events - Error count: Y events (Z%) - Services affected: N services - Error rate trend: [increasing/stable/decreasing] ### Top Error Services 1. [service1]: N errors 2. [service2]: M errors ### Error Patterns - Primary error type: [description] - First occurrence: [timestamp] - Correlation: [deployment/traffic/external event] ### Sample Errors [Quote 2-3 representative error messages with context] ### Root Cause Hypothesis [Based on patterns observed] ### Confidence Level [High/Medium/Low with explanation]