Log Analysis Methodology

Core Philosophy: Partition-First

NEVER start by reading raw log samples.

Logs can be overwhelming. The partition-first approach prevents:

•Missing the forest for the trees
•Wasting time on irrelevant data
•Overwhelming context with noise

The 4-Step Process

Step 1: Get Statistics

Before ANY log search, understand the landscape:

CloudWatch Insights:

code

# How many errors?
filter @message like /ERROR/
| stats count(*) as total

# Error rate over time
filter @message like /ERROR/
| stats count(*) by bin(5m)

# What types of errors?
filter @message like /ERROR/
| parse @message /(?<error_type>[\w.]+Exception)/
| stats count(*) by error_type
| sort count desc

Datadog:

code

# Error distribution by service
service:* status:error | stats count by service

# Error types
service:myapp status:error | stats count by @error.kind

Questions to answer:

•What's the total error volume?
•Is it increasing, stable, or decreasing?
•What are the unique error types?
•Which services/hosts are affected?

Step 2: Identify Patterns

Look for correlations:

Temporal patterns:

•Did errors start at a specific time?
•Is there periodicity (every hour, every day)?
•Correlation with deployments or traffic spikes?

Service patterns:

•Is one service the source?
•Is the error propagating across services?

Error patterns:

•What's the most frequent error?
•Are errors clustered or distributed?

Step 3: Sample Strategically

Only NOW read actual log samples:

Sample from anomalies:

•Get logs from the peak error time
•Get logs from normal time for comparison

Sample by error type:

•Get examples of each distinct error type
•Limit to 5-10 per type

Sample around events:

•Logs before/after a deployment
•Logs around a specific incident timestamp

Step 4: Correlate with Events

Connect logs to system changes:

code

# Use git_log to find recent deployments
git_log --since="2 hours ago"

# Use get_deployment_history for K8s
get_deployment_history deployment=api-server

# Compare log patterns before/after changes

Platform-Specific Tips

CloudWatch Insights

Best practices:

code

# Always include time filter
filter @timestamp > ago(1h)

# Use parse for structured extraction
parse @message /status=(?<status>\d+)/

# Aggregate before displaying
stats count(*) by status | sort count desc | limit 10

Common queries:

code

# Latency distribution
filter @type = "REPORT"
| stats avg(@duration) as avg,
        pct(@duration, 95) as p95,
        pct(@duration, 99) as p99

# Error messages with context
filter @message like /ERROR/
| fields @timestamp, @message
| sort @timestamp desc
| limit 20

Datadog Logs

Query syntax:

code

# Filter by service and status
service:api-gateway status:error

# Field queries
@http.status_code:>=500

# Wildcard
@error.message:*timeout*

# Time comparison
service:api (now-1h TO now) vs (now-25h TO now-24h)

Kubernetes Logs

Use get_pod_logs wisely:

•Always specify tail_lines (default: 100)
•Filter to specific containers in multi-container pods
•Use get_pod_events first for crashes/restarts

Anti-Patterns to Avoid

•Dumping all logs - Never request unbounded log queries
•Starting with samples - Always get statistics first
•Ignoring time windows - Narrow to incident window
•Missing correlation - Always connect to deployments/changes
•Single-service focus - Check upstream/downstream services

Investigation Template

code

## Log Analysis Report

### Statistics
- Time window: [start] to [end]
- Total log volume: X events
- Error count: Y events (Z%)
- Error rate trend: [increasing/stable/decreasing]

### Top Error Types
1. [ErrorType1]: N occurrences - [description]
2. [ErrorType2]: M occurrences - [description]

### Temporal Pattern
- Errors started at: [timestamp]
- Correlation: [deployment X / traffic spike / external event]

### Sample Errors
[Quote 2-3 representative error messages]

### Root Cause Hypothesis
[Based on patterns, what's the likely cause?]