Log Analysis Methodology
Core Philosophy: Partition-First
NEVER start by reading raw log samples.
Logs can be overwhelming. The partition-first approach prevents:
- •Missing the forest for the trees
- •Wasting time on irrelevant data
- •Overwhelming context with noise
The 4-Step Process
Step 1: Get Statistics
Before ANY log search, understand the landscape:
CloudWatch Insights:
code
# How many errors? filter @message like /ERROR/ | stats count(*) as total # Error rate over time filter @message like /ERROR/ | stats count(*) by bin(5m) # What types of errors? filter @message like /ERROR/ | parse @message /(?<error_type>[\w.]+Exception)/ | stats count(*) by error_type | sort count desc
Datadog:
code
# Error distribution by service service:* status:error | stats count by service # Error types service:myapp status:error | stats count by @error.kind
Questions to answer:
- •What's the total error volume?
- •Is it increasing, stable, or decreasing?
- •What are the unique error types?
- •Which services/hosts are affected?
Step 2: Identify Patterns
Look for correlations:
Temporal patterns:
- •Did errors start at a specific time?
- •Is there periodicity (every hour, every day)?
- •Correlation with deployments or traffic spikes?
Service patterns:
- •Is one service the source?
- •Is the error propagating across services?
Error patterns:
- •What's the most frequent error?
- •Are errors clustered or distributed?
Step 3: Sample Strategically
Only NOW read actual log samples:
Sample from anomalies:
- •Get logs from the peak error time
- •Get logs from normal time for comparison
Sample by error type:
- •Get examples of each distinct error type
- •Limit to 5-10 per type
Sample around events:
- •Logs before/after a deployment
- •Logs around a specific incident timestamp
Step 4: Correlate with Events
Connect logs to system changes:
code
# Use git_log to find recent deployments git_log --since="2 hours ago" # Use get_deployment_history for K8s get_deployment_history deployment=api-server # Compare log patterns before/after changes
Platform-Specific Tips
CloudWatch Insights
Best practices:
code
# Always include time filter filter @timestamp > ago(1h) # Use parse for structured extraction parse @message /status=(?<status>\d+)/ # Aggregate before displaying stats count(*) by status | sort count desc | limit 10
Common queries:
code
# Latency distribution
filter @type = "REPORT"
| stats avg(@duration) as avg,
pct(@duration, 95) as p95,
pct(@duration, 99) as p99
# Error messages with context
filter @message like /ERROR/
| fields @timestamp, @message
| sort @timestamp desc
| limit 20
Datadog Logs
Query syntax:
code
# Filter by service and status service:api-gateway status:error # Field queries @http.status_code:>=500 # Wildcard @error.message:*timeout* # Time comparison service:api (now-1h TO now) vs (now-25h TO now-24h)
Kubernetes Logs
Use get_pod_logs wisely:
- •Always specify
tail_lines(default: 100) - •Filter to specific containers in multi-container pods
- •Use
get_pod_eventsfirst for crashes/restarts
Anti-Patterns to Avoid
- •Dumping all logs - Never request unbounded log queries
- •Starting with samples - Always get statistics first
- •Ignoring time windows - Narrow to incident window
- •Missing correlation - Always connect to deployments/changes
- •Single-service focus - Check upstream/downstream services
Investigation Template
code
## Log Analysis Report ### Statistics - Time window: [start] to [end] - Total log volume: X events - Error count: Y events (Z%) - Error rate trend: [increasing/stable/decreasing] ### Top Error Types 1. [ErrorType1]: N occurrences - [description] 2. [ErrorType2]: M occurrences - [description] ### Temporal Pattern - Errors started at: [timestamp] - Correlation: [deployment X / traffic spike / external event] ### Sample Errors [Quote 2-3 representative error messages] ### Root Cause Hypothesis [Based on patterns, what's the likely cause?]