AgentSkillsCN

error-recovery

WHAT:以重试逻辑、自动恢复与优雅降级的方式,从容应对各类错误。WHEN:当用户说“检查错误”、“恢复系统”、“重试失败”时触发。触发条件:系统故障、API 错误、短暂性问题、看门狗警报。

SKILL.md
--- frontmatter
name: error-recovery
description: "WHAT: Handle errors gracefully with retry logic, automatic recovery, and graceful degradation. WHEN: User says 'check errors', 'recover system', 'retry failed'. Trigger on: system failures, API errors, transient issues, watchdog alerts."

Error Recovery & Graceful Degradation

When to Use

  • Handling transient errors (network timeouts, rate limits)
  • Recovering from authentication failures
  • Managing component outages gracefully
  • Retrying failed operations
  • Quarantining corrupted data
  • Running watchdog health checks

Error Categories

CategoryExamplesRecovery Strategy
TransientNetwork timeout, rate limitExponential backoff retry
AuthenticationExpired token, revoked accessToken refresh or human alert
LogicMisinterpreted dataHuman review queue
DataCorrupted file, missing fieldQuarantine + alert
SystemProcess crash, disk fullWatchdog + auto-restart

Instructions

  1. Retry Failed Operation:

    bash
    python3 .claude/skills/error-recovery/scripts/main_operation.py --action retry \
      --operation-id OP_12345 \
      --max-attempts 3 \
      --backoff exponential
    
  2. Refresh Authentication Token:

    bash
    python3 .claude/skills/error-recovery/scripts/main_operation.py --action refresh-token \
      --service xero
    

    Services: gmail|xero|linkedin|facebook|instagram|twitter

  3. Quarantine Corrupted File:

    bash
    python3 .claude/skills/error-recovery/scripts/main_operation.py --action quarantine \
      --file "Needs_Action/corrupted_file.md" \
      --reason "JSON parse error"
    
  4. Check Component Health:

    bash
    python3 .claude/skills/error-recovery/scripts/main_operation.py --action health-check
    
  5. Process Recovery Queue:

    bash
    python3 .claude/skills/error-recovery/scripts/main_operation.py --action process-queue
    

    Retries queued operations that failed earlier.

  6. Run Watchdog Check:

    bash
    python3 .claude/skills/error-recovery/scripts/main_operation.py --action watchdog
    

    Checks and restarts crashed processes.

  7. View Error Log:

    bash
    python3 .claude/skills/error-recovery/scripts/main_operation.py --action errors --since 24h
    

Retry Configuration

python
RETRY_CONFIG = {
    "max_attempts": 3,
    "base_delay": 1,      # seconds
    "max_delay": 60,      # seconds
    "backoff": "exponential",  # or "linear"
    "jitter": True        # Add randomness to prevent thundering herd
}

Graceful Degradation Rules

Component DownSystem Behavior
Gmail APIQueue emails, process other watchers
Odoo APISkip financial sync, use cached data
Social APIsQueue posts, continue other operations
OrchestratorWatchdog restarts within 30s

Queue Locations

  • AI_Employee_Vault/Recovery_Queue/ - Failed operations awaiting retry
  • AI_Employee_Vault/Quarantine/ - Corrupted files with timestamps
  • AI_Employee_Vault/Alerts/ - Human review requests

Watchdog Configuration

yaml
# ecosystem.config.js (PM2)
watch_restart_delay: 5000
max_restarts: 10
min_uptime: 30000

Validation

  • Retry logic executes correctly
  • Token refresh works
  • Quarantine isolates bad data
  • Watchdog restarts processes
  • Error logs are comprehensive

See REFERENCE.md for error handling patterns.