Error Analysis Skill
Aggregate and analyze errors from discovered error tracking sources with pattern detection.
When to Use
- •Error spike investigation
- •Pattern detection across services
- •User impact assessment
- •Correlation with deployments
- •Error trend analysis
- •User asks "errors", "exceptions", "why failing"
Error Analysis Approach
1. Discover Error Tracking Sources
Use capability-based discovery:
# List available MCP servers ListMcpResourcesTool # Scan for error tracking capabilities by analyzing server descriptions: # - error-tracking capability: Dedicated exception/error tracking # - apm capability: APM tools that include error tracking # - logging capability: Log platforms with error search/aggregation
2. Aggregate Error Data
For each discovered source:
- •Current error rate vs baseline
- •Top errors by frequency
- •New vs recurring errors
- •User sessions affected
- •Error distribution by service/region
3. Detect Patterns
Look for:
- •Error spikes: Sudden increase in error rate
- •New errors: First seen in recent time window
- •Error clusters: Related errors happening together
- •User impact: Same users hitting multiple errors
- •Service correlation: Errors across dependent services
4. Correlate with Changes
Check for correlation with:
- •Recent deployments
- •Code changes via wicked-search
- •Infrastructure changes
- •Traffic patterns
- •External dependency changes
5. Provide Investigation Path
Based on patterns:
- •Root cause hypothesis
- •Investigation steps
- •Recommended actions (rollback, hotfix, etc.)
- •Integration with wicked-engineering for debugging
Integration Discovery
| Capability | What to Look For | Provides |
|---|---|---|
| error-tracking | Exception tracking, crash reporting, error grouping | Stack traces, user context, grouping |
| apm | Performance monitoring with error tracking features | Errors with performance context |
| logging | Log platforms with error filtering and search | Error logs, patterns, search |
Fallback: Search code for error patterns via wicked-search (catch blocks, error handling, throw statements).
Output Format
## Error Analysis Report
**Analysis Time**: {timestamp}
**Time Range**: {period analyzed}
**Data Sources**: {list of integrations}
### Error Summary
**Current Error Rate**: {rate} ({change} from baseline)
**Total Errors**: {count} in last {period}
**Unique Errors**: {count} distinct error types
**Affected Users**: {count or percentage}
### Top Errors
| Error | Count | Users | First Seen | Trend |
|-------|-------|-------|------------|-------|
| {message} | {count} | {users} | {time} | {↑↓→} |
### Pattern Detection
**Pattern: {Pattern Name}**
- Type: [ERROR_SPIKE | NEW_ERROR | CASCADING | USER_CLUSTER]
- Description: {what the pattern indicates}
- Affected: {services, users, regions}
- Started: {timestamp}
- Correlation: {deployment, code change, etc.}
### Investigation Path
**Hypothesis**: {most likely root cause}
**Evidence**:
1. {supporting evidence point}
2. {supporting evidence point}
**Next Steps**:
1. {specific action to take}
2. {specific action to take}
**Engage**: wicked-engineering:debugger for code-level analysis
Common Error Patterns
Four main patterns to detect. See refs/patterns.md for detailed analysis guides.
ERROR_SPIKE
Sudden increase in error rate (>2x baseline). Often deployment-related.
NEW_ERROR
Error never seen before. Indicates new code path or edge case.
CASCADING_FAILURE
One error causing downstream errors. Check service dependencies.
USER_CLUSTER
Same users experiencing multiple errors. User-specific data issue.
Integration with wicked-crew
When crew completes build phase:
- •Compare error rates pre/post deployment
- •Alert on new errors
- •Detect error spikes
- •Recommend rollback if critical
Emit events:
- •
observe:error:spike:warning - •
observe:error:pattern:warning - •
observe:correlation:found:success
Integration with wicked-engineering
When errors detected, engage debugger with context:
- •Error stack traces from discovered sources
- •Log context from logging integrations
- •Recent code changes from git
- •Deployment timeline
Severity Classification
See refs/severity.md for detailed classification.
CRITICAL: Error rate >10x baseline, critical path errors, data corruption HIGH: Error rate >3x baseline, affecting >10% users, payment/security errors MEDIUM: Error rate >1.5x baseline, affecting <10% users, non-critical features LOW: Error rate <1.5x baseline, cosmetic issues, logging errors
Notes
- •Focus on error rate changes, not absolute numbers
- •New errors often more critical than recurring ones
- •Always correlate with deployments and code changes
- •Distinguish between symptoms and root causes
- •Use traces to understand error propagation
- •Consider user impact, not just error count