AgentSkillsCN

error-recovery

适用于弹性智能体工作流的错误处理、检查点管理与恢复模式。 基于Anthropic关于长期运行智能体与优雅降级的最佳实践。 适用场景: - 当您需要实施可能失败的复杂工作流时; - 当您需要具备检查点与恢复功能时; - 当您需要优雅地处理工具故障时; - 当您需要管理智能体的错误与重试逻辑时; - 当您需要构建稳健的自动化流程时。 触发短语:错误处理、恢复、检查点、恢复运行、优雅降级、重试、回退。

SKILL.md
--- frontmatter
name: error-recovery
description: |
  Error handling, checkpoint management, and recovery patterns for resilient agent workflows.
  Based on Anthropic's guidance for long-running agents and graceful degradation.

  Use when:
  - Implementing complex workflows that may fail
  - Need checkpoint/resume capabilities
  - Handling tool failures gracefully
  - Managing agent errors and retries
  - Building robust automation

  Trigger phrases: error handling, recovery, checkpoint, resume, graceful degradation, retry, fallback
allowed-tools: Read, Write, Edit, Glob, Grep, Bash, TodoWrite
model: sonnet
user-invocable: true

Error Recovery Patterns

Strategies for building resilient agent workflows that handle errors gracefully, checkpoint progress, and enable recovery from failures.

From Building Effective Agents:

"Agents should gain ground truth from the environment at each step (such as tool call results or code execution) to assess its progress."

"Agents can then pause for human feedback at checkpoints or when encountering blockers."

Core Principles

1. Fail Fast, Recover Quickly

Detect errors early, checkpoint frequently, resume from last known good state.

2. Ground Truth Verification

Always verify the result of each action before proceeding to the next.

3. Graceful Degradation

When ideal path fails, fall back to alternative approaches rather than complete failure.

Checkpoint System

When to Checkpoint

EventCheckpoint Action
Feature completedUpdate feature-list.json, commit
Significant file changeUpdate progress log
Before risky operationDocument current state
After successful testRecord passing state
Before external API callsSave request context

Checkpoint Format

json
{
  "checkpointId": "CP-001",
  "timestamp": "2025-01-16T14:30:00Z",
  "position": {
    "phase": "Implementation",
    "feature": "F003",
    "step": "Writing AuthService"
  },
  "state": {
    "filesModified": ["src/services/auth.ts"],
    "testsStatus": "passing",
    "lastSuccessfulAction": "Created AuthService class"
  },
  "recovery": {
    "nextAction": "Add login method to AuthService",
    "dependencies": ["User model exists", "Database connected"],
    "rollbackTo": "git commit abc123"
  }
}

Checkpoint Implementation

markdown
## After each significant action:

1. **Verify Success**
   - Check tool output for errors
   - Run quick validation (lint, type check)
   - Confirm file was written correctly

2. **Record State**
   - Update .claude/workspaces/{workspace-id}/claude-progress.json
   - Add entry to progress log
   - Note files modified

3. **Document Recovery Path**
   - What to do if next step fails
   - How to roll back if needed
   - Dependencies for resumption

Error Categories and Responses

Category 1: Transient Errors

Temporary failures that may succeed on retry.

Error TypeExampleResponse
Network timeoutAPI call failedRetry with exponential backoff
Rate limitingToo many requestsWait and retry
Temporary file lockFile in useWait briefly, retry

Retry Strategy with Exponential Backoff + Jitter:

code
max_retries = 3
base_delay = 1s

for attempt in 1..max_retries:
    result = try_operation()
    if success:
        return result
    # Exponential backoff with jitter prevents thundering herd
    jitter = random(0, 0.5 * base_delay)
    wait(base_delay * 2^attempt + jitter)

escalate_to_user("Operation failed after 3 retries")

Why Jitter Matters: Without jitter, multiple agents retrying simultaneously can overwhelm the system at the same intervals (thundering herd problem). Adding randomness spreads retry attempts.

Category 2: Recoverable Errors

Errors that require different approach but can be handled.

Error TypeExampleResponse
File not foundExpected file missingSearch for alternatives
Permission deniedCan't write to directoryRequest user permission
Dependency missingPackage not installedInstall or use alternative
Test failureNew code breaks testAnalyze failure, fix code

Recovery Strategy:

markdown
1. Log the error with full context
2. Analyze root cause
3. Determine alternative approach
4. If alternative exists:
   - Document deviation from original plan
   - Execute alternative
   - Verify success
5. If no alternative:
   - Document blocker
   - Ask user for guidance

Category 3: Fatal Errors

Errors that require human intervention.

Error TypeExampleResponse
Authentication requiredMissing API keyAsk user to provide
Data corruptionInvalid stateStop and alert user
Security concernSuspicious operationHalt and report
Scope creepRequest exceeds boundariesClarify with user

Fatal Error Protocol:

markdown
1. STOP all operations immediately
2. Checkpoint current state
3. Document error with full context:
   - What was attempted
   - What failed
   - Current state of files
   - Potential impact
4. Present clear options to user:
   - Fix and continue
   - Roll back and retry
   - Abort workflow

Graceful Degradation Patterns

Pattern 1: Fallback Chain

Try primary approach, fall back to alternatives.

code
Primary: Use preferred library
   │
   └─ (failed) ─▶ Fallback 1: Use alternative library
                      │
                      └─ (failed) ─▶ Fallback 2: Manual implementation
                                          │
                                          └─ (failed) ─▶ Ask user

Pattern 2: Partial Success

Complete what's possible, report what's not.

markdown
## Partial Success Report

### Completed (3/5 features)
- [x] User registration
- [x] User login
- [x] Password reset

### Failed (2/5 features)
- [ ] OAuth integration - Error: Missing client_id
- [ ] 2FA - Error: SMS provider not configured

### Next Steps
1. Provide OAuth client_id in .env
2. Configure SMS provider in settings
3. Re-run /spec-plan for remaining features

Pattern 3: Safe Mode

Continue with reduced functionality when errors occur.

markdown
Normal Mode:
- Full implementation with all features
- Complete test coverage
- Performance optimization

Safe Mode (on error):
- Core functionality only
- Basic tests
- Skip optimization
- Document what was skipped for later

Pattern 4: Circuit Breaker

Prevent cascading failures by stopping requests to failing services.

From AI Agent Best Practices:

"Retries and fallbacks try to recover from failures. Circuit breakers prevent a bad situation from spiraling further."

States:

code
CLOSED (normal) ──[failures >= threshold]──▶ OPEN (blocking)
     ▲                                           │
     │                                    [timeout expires]
     │                                           ▼
     └────────[success]────────── HALF-OPEN (testing)
                                           │
                                    [failure]
                                           ▼
                                      OPEN (blocking)

Implementation:

code
circuit_breaker:
  failure_threshold: 3       # Consecutive failures to open
  timeout: 30s               # Time before testing again
  state: CLOSED

on_operation():
  if state == OPEN:
    if timeout_expired:
      state = HALF_OPEN
    else:
      return "Service unavailable (circuit open)"

  result = try_operation()

  if success:
    if state == HALF_OPEN:
      state = CLOSED
    failure_count = 0
    return result
  else:
    failure_count++
    if failure_count >= failure_threshold:
      state = OPEN
      start_timeout()
    return error

When to Use:

ScenarioUse Circuit Breaker
External API callsYes - prevents overwhelming failing API
Database operationsYes - prevents connection exhaustion
File system operationsMaybe - depends on failure mode
In-memory operationsNo - failures are immediate

Agent-Specific Considerations:

When chaining multiple AI agents:

  • If each agent is 95% reliable, 3 agents = 86% overall reliability
  • Circuit breakers at each stage prevent cascading failures
  • Use partial results when possible instead of complete failure

Subagent Circuit Breaker Thresholds

Production-tested thresholds for autonomous agent loops (from Claude Code engineering best practices):

ThresholdValueAction
NO_PROGRESS3 loopsStop after 3 loops with no file changes
SAME_ERROR5 timesEscalate after 5 identical errors
OUTPUT_DECLINE70%Pause if output quality drops >70%

Application: Track files_changed, error_messages, and output_quality for each subagent invocation. When thresholds are hit, stop retrying and communicate clearly with the user about the situation and options.

Recovery Workflows

Workflow 1: Resume After Crash

markdown
1. Identify current workspace ID (branch + path hash)
2. Read .claude/workspaces/{workspace-id}/claude-progress.json
3. Identify last checkpoint:
   - Position: "Phase 5, Feature F003, step 2"
   - Last action: "Created AuthService class"
   - Next action: "Add login method"
4. Verify file state:
   - Run `git status` to check uncommitted changes
   - Compare files to checkpoint expectation
5. If state is valid:
   - Continue from documented next action
6. If state is corrupted:
   - Roll back to last commit: `git checkout -- .`
   - Resume from that checkpoint

Workflow 2: Test Failure Recovery

markdown
1. Test fails after implementation
2. Analyze failure:
   - Read error message
   - Identify failing assertion
   - Trace to code change
3. Determine fix:
   - If bug in new code: Fix and re-run
   - If bug in test: Review test expectations
   - If design issue: Consult architect
4. Apply fix
5. Run full test suite
6. Update checkpoint only when all tests pass

Workflow 3: Merge Conflict Recovery

markdown
1. Conflict detected during pull/merge
2. Checkpoint current branch state
3. Analyze conflicts:
   - List conflicting files
   - Understand both versions
4. Resolve conflicts:
   - For each file, decide correct version
   - Test resolution locally
5. Commit resolution
6. Continue workflow

Integration with Progress Tracking

Error Logging in Progress File

json
{
  "log": [
    {
      "timestamp": "2025-01-16T14:30:00Z",
      "action": "Attempted OAuth integration",
      "status": "failed",
      "error": {
        "type": "ConfigurationError",
        "message": "Missing OAUTH_CLIENT_ID in environment",
        "recoverable": true,
        "resolution": "User must provide OAuth credentials"
      }
    }
  ],
  "blockers": [
    {
      "id": "B001",
      "description": "OAuth credentials required",
      "status": "waiting_for_user",
      "createdAt": "2025-01-16T14:30:00Z"
    }
  ]
}

Blocker Management

markdown
## Blocker Protocol

1. **Detect**: Identify that progress is blocked
2. **Document**: Add to blockers array in progress file
3. **Notify**: Inform user with clear description
4. **Wait**: Do not proceed past blocker
5. **Resolve**: Once user provides resolution:
   - Mark blocker as resolved
   - Log resolution action
   - Continue workflow

## Blocker States

| State | Meaning |
|-------|---------|
| waiting_for_user | User input/action required |
| investigating | Analyzing potential solutions |
| resolved | Blocker cleared |
| escalated | Requires external help |

Claude Code Specific Features

Using Checkpoints

Claude Code automatically creates checkpoints before each edit.

  • Safe experimentation: Try approaches without fear
  • Use /rewind: Roll back to previous state if needed
  • Esc twice: Cancel current operation and discuss

Recovery Commands

CommandUse When
/rewindNeed to undo recent changes
/clearContext too polluted, restart clean
git checkout -- fileDiscard specific file changes
git stashTemporarily save work in progress

Anti-Patterns

Anti-PatternWhy BadInstead
Ignoring errorsProblems compoundHandle immediately
No checkpointsCan't recoverCheckpoint frequently
Retry without backoffMay worsen issueUse exponential backoff
Silent failuresProblems hiddenAlways log and report
Continuing past blockersInvalid stateStop and resolve

Rules (L1 - Hard)

Critical for reliable recovery and data safety.

  • ALWAYS checkpoint before risky operations (enables rollback)
  • ALWAYS verify success after each significant action (ground truth)
  • NEVER ignore error messages or warnings (problems compound)
  • NEVER continue past a blocker without user confirmation
  • NEVER lose work - commit early and often
  • ALWAYS apply circuit breaker thresholds for autonomous agent loops:
    • NO_PROGRESS: Stop after 3 loops with no file changes
    • SAME_ERROR: Escalate to user after 5 identical errors
    • OUTPUT_DECLINE: Pause if output quality drops >70%
  • MUST escalate to user when any circuit breaker threshold is reached

Defaults (L2 - Soft)

Important for operational quality. Override with reasoning when appropriate.

  • Document errors with full context (aids debugging)
  • Provide recovery options when reporting errors
  • Use exponential backoff for retries
  • Log all failure attempts with timestamps

Guidelines (L3)

Recommendations for robust error handling.

  • Consider testing recovery paths during development
  • Prefer graceful degradation over complete failure
  • Consider using Claude Code's /rewind for quick rollbacks