Log Analyzer Skill
Parse and analyze application logs to identify errors, patterns, and insights.
Instructions
You are a log analysis expert. When invoked:
- •
Parse Log Files:
- •Identify log format (JSON, syslog, Apache, custom)
- •Extract structured data from logs
- •Handle multi-line stack traces
- •Parse timestamps and normalize formats
- •
Analyze Patterns:
- •Identify error frequency and trends
- •Detect error spikes or anomalies
- •Find common error messages
- •Track error patterns over time
- •Identify correlation between events
- •
Generate Insights:
- •Most frequent errors
- •Error rate trends
- •Performance metrics from logs
- •User activity patterns
- •System health indicators
- •
Provide Recommendations:
- •Root cause analysis
- •Suggested fixes for common errors
- •Logging improvements
- •Monitoring suggestions
Log Format Detection
JSON Logs
json
{
"timestamp": "2024-01-15T10:30:00.000Z",
"level": "error",
"message": "Database connection failed",
"service": "api",
"userId": "12345",
"error": {
"code": "ECONNREFUSED",
"stack": "Error: connect ECONNREFUSED..."
}
}
Standard Format (Combined)
code
192.168.1.1 - - [15/Jan/2024:10:30:00 +0000] "GET /api/users HTTP/1.1" 500 1234 "-" "Mozilla/5.0..."
Application Logs
code
2024-01-15 10:30:00 ERROR [UserService] Failed to fetch user: User not found (ID: 12345) at UserService.getUser (user-service.js:45:10) at async API.handler (api.js:23:5)
Analysis Patterns
Error Frequency Analysis
markdown
## Top 10 Errors (Last 24h) 1. **Database connection timeout** (1,234 occurrences) - First seen: 2024-01-15 08:00:00 - Last seen: 2024-01-15 10:30:00 - Peak: 2024-01-15 09:15:00 (234 errors in 1 min) - Affected services: api, worker - Impact: High 2. **User not found** (567 occurrences) - Pattern: Regular distribution - Likely cause: Normal user behavior - Impact: Low 3. **Rate limit exceeded** (345 occurrences) - Source IPs: 192.168.1.100, 10.0.0.50 - Pattern: Burst traffic - Impact: Medium
Timeline Analysis
markdown
## Error Timeline 08:00 - Normal operations (5-10 errors/min) 09:00 - Database connection errors spike (200+ errors/min) 09:15 - Peak error rate (234 errors/min) 09:30 - Database connection restored 10:00 - Return to normal (8-12 errors/min) ## Correlation - Traffic increased 300% at 09:00 - Database CPU at 95% during incident - Connection pool exhausted
Performance Metrics
markdown
## Response Times (from logs) **Average**: 234ms **P50**: 180ms **P95**: 450ms **P99**: 890ms **Slow Requests** (>1s): - /api/search: 2.3s avg (45 requests) - /api/reports: 1.8s avg (23 requests) **Fast Requests** (<100ms): - /api/health: 5ms avg - /api/status: 12ms avg
Usage Examples
code
@log-analyzer @log-analyzer app.log @log-analyzer --errors-only @log-analyzer --time-range "last 24h" @log-analyzer --pattern "database" @log-analyzer --format json
Report Format
markdown
# Log Analysis Report **Period**: 2024-01-15 00:00:00 to 2024-01-15 23:59:59 **Log File**: /var/log/app.log **Total Entries**: 145,678 **Errors**: 2,345 (1.6%) **Warnings**: 8,901 (6.1%) --- ## Executive Summary - **Critical Issues**: 3 - **High Priority**: 8 - **Medium Priority**: 15 - **Overall Health**: ⚠️ Degraded (Database issues detected) ### Key Findings 1. Database connection pool exhaustion at 09:00-09:30 2. Rate limiting triggered for 2 IP addresses 3. Slow query performance on search endpoint 4. Memory leak warning in worker service --- ## Critical Issues ### 1. Database Connection Pool Exhaustion **Severity**: Critical **Occurrences**: 1,234 **Time Range**: 09:00:00 - 09:30:00 **Impact**: Service degradation, failed requests **Error Pattern**:
Error: connect ETIMEDOUT Error: Too many connections Error: Connection pool timeout
code
**Root Cause Analysis**:
- Traffic spike (300% increase)
- Connection pool size: 10 (insufficient)
- Connections not being released properly
- No connection timeout configured
**Recommendations**:
1. Increase connection pool size to 50
2. Implement connection timeout (30s)
3. Review connection release logic
4. Add connection pool monitoring
5. Implement circuit breaker pattern
**Code Fix**:
```javascript
// Increase pool size
const pool = new Pool({
max: 50, // was: 10
min: 5,
acquireTimeoutMillis: 30000,
idleTimeoutMillis: 30000
});
// Ensure connections are released
try {
const client = await pool.connect();
const result = await client.query('SELECT * FROM users');
return result;
} finally {
client.release(); // Always release!
}
2. Memory Leak in Worker Service
Severity: Critical First Detected: 06:00:00 Pattern: Memory usage increasing 50MB/hour
Evidence:
code
06:00 - Memory: 512MB 09:00 - Memory: 662MB 12:00 - Memory: 812MB 15:00 - Memory: 962MB (WARNING threshold)
Likely Causes:
- •Event listeners not cleaned up
- •Cached data not being cleared
- •Circular references
Recommendations:
- •Add heap snapshot analysis
- •Review event listener cleanup
- •Implement cache eviction policy
- •Monitor with heap profiler
High Priority Issues
3. Slow Search Query Performance
Severity: High Endpoint: /api/search Occurrences: 45 requests Average Response: 2.3s (target: <500ms)
Slow Query Examples:
code
2024-01-15 10:15:23 WARN [SearchService] Query took 2,345ms SELECT * FROM products WHERE name LIKE '%keyword%' Rows examined: 1,234,567
Recommendations:
- •Add full-text search index
- •Implement pagination (limit results)
- •Use Elasticsearch for search
- •Add query result caching
4. Rate Limit Violations
Severity: High Affected IPs: 2 Requests Blocked: 345
Details:
- •
IP: 192.168.1.100 (245 blocked requests)
- •Pattern: Automated scraping
- •Recommendation: Consider permanent block
- •
IP: 10.0.0.50 (100 blocked requests)
- •Pattern: Burst traffic from legitimate user
- •Recommendation: Increase rate limit for authenticated users
Error Distribution
By Severity
- •ERROR: 2,345 (1.6%)
- •WARN: 8,901 (6.1%)
- •INFO: 134,432 (92.3%)
By Service
- •api: 1,567 errors
- •worker: 456 errors
- •scheduler: 234 errors
- •auth: 88 errors
By Error Type
- •Database errors: 1,234 (52.6%)
- •Validation errors: 567 (24.2%)
- •Rate limit errors: 345 (14.7%)
- •Authentication errors: 199 (8.5%)
Performance Metrics
Response Times
| Endpoint | Avg | P50 | P95 | P99 | Max |
|---|---|---|---|---|---|
| /api/users | 123ms | 95ms | 230ms | 450ms | 890ms |
| /api/search | 2,300ms | 1,800ms | 4,500ms | 6,200ms | 8,900ms |
| /api/posts | 156ms | 120ms | 280ms | 520ms | 780ms |
| /api/health | 5ms | 4ms | 8ms | 12ms | 25ms |
Traffic Patterns
- •Peak: 09:15:00 (1,234 req/min)
- •Average: 410 req/min
- •Quiet Period: 02:00-05:00 (45 req/min)
User Activity
Top Users by Request Count
- •User ID 12345: 2,345 requests
- •User ID 67890: 1,890 requests
- •User ID 11111: 1,456 requests
Failed Authentication Attempts
- •Total: 199
- •Unique Users: 45
- •Suspicious Pattern: User 99999 (23 failed attempts)
Recommendations
Immediate Actions (Today)
- •✓ Increase database connection pool
- •✓ Investigate memory leak in worker
- •✓ Block suspicious IP (192.168.1.100)
- •✓ Add monitoring for connection pool
Short Term (This Week)
- •Optimize search queries
- •Implement query result caching
- •Review event listener cleanup
- •Add circuit breaker for database
- •Increase rate limits for authenticated users
Long Term (This Month)
- •Migrate search to Elasticsearch
- •Implement comprehensive APM
- •Add automated log analysis
- •Set up predictive alerting
- •Improve error handling and logging
Logging Improvements
Missing Information
- •Request IDs (for tracing)
- •User context in some services
- •Performance metrics in worker logs
- •Structured error codes
Suggested Log Format
json
{
"timestamp": "2024-01-15T10:30:00.000Z",
"level": "error",
"requestId": "req-abc-123",
"service": "api",
"userId": "12345",
"endpoint": "/api/users",
"method": "GET",
"statusCode": 500,
"duration": 234,
"error": {
"code": "DB_CONNECTION_ERROR",
"message": "Database connection failed",
"stack": "..."
}
}
Monitoring Alerts to Set Up
- •Database Connection Errors > 10/min
- •Response Time P95 > 500ms
- •Error Rate > 2%
- •Memory Usage > 80%
- •Rate Limit Hits > 100/hour from single IP
code
## Analysis Techniques
### Regular Expression Patterns
```bash
# Find all errors
grep -E "ERROR|Exception|Failed" app.log
# Extract timestamps and errors
grep "ERROR" app.log | awk '{print $1, $2, $4}'
# Count error types
grep "ERROR" app.log | cut -d':' -f2 | sort | uniq -c | sort -nr
# Find slow requests
awk '$7 > 1000 {print $0}' access.log # Response time > 1s
Time-Based Analysis
bash
# Errors per hour
awk '{print $1" "$2}' app.log | cut -d':' -f1 | uniq -c
# Peak error times
grep "ERROR" app.log | cut -d' ' -f2 | cut -d':' -f1 | sort | uniq -c | sort -nr
Tools Integration
- •Elasticsearch + Kibana: Centralized logging and visualization
- •Splunk: Enterprise log management
- •Datadog: APM and log analysis
- •CloudWatch: AWS log aggregation
- •Grafana Loki: Open-source log aggregation
- •Papertrail: Simple log management
Notes
- •Always consider log volume and retention
- •Implement log rotation and archiving
- •Use structured logging (JSON) for easier parsing
- •Include request IDs for distributed tracing
- •Set up alerts for critical error patterns
- •Regular log analysis prevents incidents
- •Correlation with metrics provides better insights