AgentSkillsCN

error-patterns

掌握基础设施故障排查中的错误识别与诊断模式。在跨平台识别、分类或解决各类错误时使用此功能。

SKILL.md
--- frontmatter
name: error-patterns
description: Error recognition and diagnosis patterns for infrastructure troubleshooting. Use when identifying, categorizing, or resolving errors across platforms.
allowed-tools: Bash, Grep, Read, mcp__plugin_supabase_supabase__get_logs, mcp__plugin_supabase_supabase__execute_sql

Error Patterns Skill

Overview

This skill provides knowledge for recognizing, categorizing, and resolving common infrastructure errors. It covers error classification, diagnostic techniques, and resolution strategies.

Error Classification Framework

By Severity

SeverityDefinitionResponse TimeExample
CriticalService completely downImmediateDatabase unreachable
HighMajor functionality broken< 1 hourAuth failures
MediumPartial functionality affected< 4 hoursSlow queries
LowMinor issues, workarounds exist< 24 hoursDeprecation warnings

By Category

CategorySubcategoriesTypical Causes
DatabaseConnection, Query, Transaction, ReplicationPool exhaustion, locks, slow queries
NetworkDNS, Timeout, ConnectionMisconfiguration, service down
AuthenticationToken, Permission, ProviderExpired tokens, wrong credentials
ApplicationLogic, Memory, TimeoutBugs, resource leaks
InfrastructureDisk, CPU, MemoryResource exhaustion
ExternalAPI, Service, Rate limitThird-party issues

By Pattern Type

PatternDescriptionExample
TransientSelf-resolving, retry worksNetwork blip
PersistentConsistent, needs fixMisconfiguration
CascadingOne failure causes othersDB down → API errors
IntermittentRandom occurrenceRace condition
Load-dependentAppears under loadConnection exhaustion

Diagnostic Methodology

The 5 Whys

Dig deeper for root cause:

code
Symptom: API returning 500 errors
  Why? → Database query failing
    Why? → Connection timeout
      Why? → Connection pool exhausted
        Why? → Connections not released
          Why? → Missing finally block in error handler

ROOT CAUSE: Code bug in error handling

Timeline Analysis

Map events chronologically:

code
T-60m: Deployment completed
T-45m: Memory usage started climbing
T-30m: First slow query warning
T-15m: Connection pool warnings
T-0:   Service unavailable

Fault Tree

Break down possible causes:

code
                [Service Down]
                      |
        +-------------+-------------+
        |             |             |
    [Database]    [Network]    [Application]
        |             |             |
    +---+---+     +---+---+     +---+---+
    |       |     |       |     |       |
 [Conn]  [Query] [DNS]  [FW]  [OOM]  [Bug]

Error Resolution Process

Step 1: Identify

  • What is the exact error message?
  • When did it start?
  • What's the impact?

Step 2: Categorize

  • Which category does this fall into?
  • Is it transient or persistent?
  • What's the severity?

Step 3: Investigate

  • Gather relevant logs
  • Check recent changes
  • Look for patterns

Step 4: Diagnose

  • Apply 5 Whys
  • Build timeline
  • Identify root cause

Step 5: Remediate

  • Apply immediate fix
  • Verify resolution
  • Document for prevention

Error Correlation Techniques

Cross-Platform Correlation

Match errors across systems:

code
14:30:01 [Railway]  Connection refused to db:5432
14:30:01 [Supabase] Too many connections
14:30:00 [GitHub]   Deployment completed
↑ Deployment triggered connection spike

Error Chains

Follow the cascade:

code
[1] Initial: Database connection timeout
[2] Result:  API endpoint returns 500
[3] Result:  Frontend shows error page
[4] Result:  User reports "site is down"

Impact Mapping

code
Error: Auth service down
├── Direct Impact
│   └── No new logins
├── Cascade Impact
│   ├── API requests fail (no token validation)
│   └── Realtime connections drop
└── User Impact
    └── All users affected

Resolution Strategies

Immediate Mitigation

StrategyUse WhenExample
RollbackRecent deployment caused issuegit revert
RestartService stuck/crashedContainer restart
Scale upResource exhaustionAdd replicas
FailoverPrimary system downSwitch to backup
Rate limitOverloadBlock/throttle traffic
Circuit breakCascading failuresDisable failing component

Root Cause Fix

CauseFix Approach
Code bugDeploy fix, add tests
ConfigurationUpdate config, validate
Resource limitIncrease limits or optimize
External dependencyAdd retry/fallback
InfrastructureScale or redesign

Prevention

IssuePrevention
Connection leaksConnection pooling, timeouts
Memory leaksProfiling, limits
Slow queriesIndexes, query optimization
Deployment failuresCanary deployments, rollback automation
External failuresCircuit breakers, fallbacks

Common Resolution Templates

Database Connection Issues

markdown
## Issue: Database Connection Error

### Immediate Actions
1. Check connection count:
   SELECT count(*) FROM pg_stat_activity;
2. Identify idle connections:
   SELECT * FROM pg_stat_activity WHERE state = 'idle in transaction';
3. Kill stuck connections if safe:
   SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE ...;

### Root Cause Fix
- Add connection pooling (PgBouncer)
- Implement connection timeouts
- Fix connection leak in application code

### Prevention
- Monitor connection metrics
- Alert on pool usage > 80%
- Regular connection audits

API Error Spike

markdown
## Issue: API 500 Errors

### Immediate Actions
1. Check API logs for error pattern
2. Identify failing endpoint(s)
3. Check downstream dependencies

### Root Cause Fix
- Fix code bug causing exception
- Handle edge cases
- Add proper error handling

### Prevention
- Add error monitoring
- Implement circuit breakers
- Add integration tests

See common-errors.md for a catalog of specific errors and solutions.