Error Recovery Patterns
Strategies for building resilient agent workflows that handle errors gracefully, checkpoint progress, and enable recovery from failures.
From Building Effective Agents:
"Agents should gain ground truth from the environment at each step (such as tool call results or code execution) to assess its progress."
"Agents can then pause for human feedback at checkpoints or when encountering blockers."
Core Principles
1. Fail Fast, Recover Quickly
Detect errors early, checkpoint frequently, resume from last known good state.
2. Ground Truth Verification
Always verify the result of each action before proceeding to the next.
3. Graceful Degradation
When ideal path fails, fall back to alternative approaches rather than complete failure.
Checkpoint System
When to Checkpoint
| Event | Checkpoint Action |
|---|---|
| Feature completed | Update feature-list.json, commit |
| Significant file change | Update progress log |
| Before risky operation | Document current state |
| After successful test | Record passing state |
| Before external API calls | Save request context |
Checkpoint Format
{
"checkpointId": "CP-001",
"timestamp": "2025-01-16T14:30:00Z",
"position": {
"phase": "Implementation",
"feature": "F003",
"step": "Writing AuthService"
},
"state": {
"filesModified": ["src/services/auth.ts"],
"testsStatus": "passing",
"lastSuccessfulAction": "Created AuthService class"
},
"recovery": {
"nextAction": "Add login method to AuthService",
"dependencies": ["User model exists", "Database connected"],
"rollbackTo": "git commit abc123"
}
}
Checkpoint Implementation
## After each significant action:
1. **Verify Success**
- Check tool output for errors
- Run quick validation (lint, type check)
- Confirm file was written correctly
2. **Record State**
- Update .claude/workspaces/{workspace-id}/claude-progress.json
- Add entry to progress log
- Note files modified
3. **Document Recovery Path**
- What to do if next step fails
- How to roll back if needed
- Dependencies for resumption
Error Categories and Responses
Category 1: Transient Errors
Temporary failures that may succeed on retry.
| Error Type | Example | Response |
|---|---|---|
| Network timeout | API call failed | Retry with exponential backoff |
| Rate limiting | Too many requests | Wait and retry |
| Temporary file lock | File in use | Wait briefly, retry |
Retry Strategy with Exponential Backoff + Jitter:
max_retries = 3
base_delay = 1s
for attempt in 1..max_retries:
result = try_operation()
if success:
return result
# Exponential backoff with jitter prevents thundering herd
jitter = random(0, 0.5 * base_delay)
wait(base_delay * 2^attempt + jitter)
escalate_to_user("Operation failed after 3 retries")
Why Jitter Matters: Without jitter, multiple agents retrying simultaneously can overwhelm the system at the same intervals (thundering herd problem). Adding randomness spreads retry attempts.
Category 2: Recoverable Errors
Errors that require different approach but can be handled.
| Error Type | Example | Response |
|---|---|---|
| File not found | Expected file missing | Search for alternatives |
| Permission denied | Can't write to directory | Request user permission |
| Dependency missing | Package not installed | Install or use alternative |
| Test failure | New code breaks test | Analyze failure, fix code |
Recovery Strategy:
1. Log the error with full context 2. Analyze root cause 3. Determine alternative approach 4. If alternative exists: - Document deviation from original plan - Execute alternative - Verify success 5. If no alternative: - Document blocker - Ask user for guidance
Category 3: Fatal Errors
Errors that require human intervention.
| Error Type | Example | Response |
|---|---|---|
| Authentication required | Missing API key | Ask user to provide |
| Data corruption | Invalid state | Stop and alert user |
| Security concern | Suspicious operation | Halt and report |
| Scope creep | Request exceeds boundaries | Clarify with user |
Fatal Error Protocol:
1. STOP all operations immediately 2. Checkpoint current state 3. Document error with full context: - What was attempted - What failed - Current state of files - Potential impact 4. Present clear options to user: - Fix and continue - Roll back and retry - Abort workflow
Graceful Degradation Patterns
Pattern 1: Fallback Chain
Try primary approach, fall back to alternatives.
Primary: Use preferred library
│
└─ (failed) ─▶ Fallback 1: Use alternative library
│
└─ (failed) ─▶ Fallback 2: Manual implementation
│
└─ (failed) ─▶ Ask user
Pattern 2: Partial Success
Complete what's possible, report what's not.
## Partial Success Report ### Completed (3/5 features) - [x] User registration - [x] User login - [x] Password reset ### Failed (2/5 features) - [ ] OAuth integration - Error: Missing client_id - [ ] 2FA - Error: SMS provider not configured ### Next Steps 1. Provide OAuth client_id in .env 2. Configure SMS provider in settings 3. Re-run /spec-plan for remaining features
Pattern 3: Safe Mode
Continue with reduced functionality when errors occur.
Normal Mode: - Full implementation with all features - Complete test coverage - Performance optimization Safe Mode (on error): - Core functionality only - Basic tests - Skip optimization - Document what was skipped for later
Pattern 4: Circuit Breaker
Prevent cascading failures by stopping requests to failing services.
From AI Agent Best Practices:
"Retries and fallbacks try to recover from failures. Circuit breakers prevent a bad situation from spiraling further."
States:
CLOSED (normal) ──[failures >= threshold]──▶ OPEN (blocking)
▲ │
│ [timeout expires]
│ ▼
└────────[success]────────── HALF-OPEN (testing)
│
[failure]
▼
OPEN (blocking)
Implementation:
circuit_breaker:
failure_threshold: 3 # Consecutive failures to open
timeout: 30s # Time before testing again
state: CLOSED
on_operation():
if state == OPEN:
if timeout_expired:
state = HALF_OPEN
else:
return "Service unavailable (circuit open)"
result = try_operation()
if success:
if state == HALF_OPEN:
state = CLOSED
failure_count = 0
return result
else:
failure_count++
if failure_count >= failure_threshold:
state = OPEN
start_timeout()
return error
When to Use:
| Scenario | Use Circuit Breaker |
|---|---|
| External API calls | Yes - prevents overwhelming failing API |
| Database operations | Yes - prevents connection exhaustion |
| File system operations | Maybe - depends on failure mode |
| In-memory operations | No - failures are immediate |
Agent-Specific Considerations:
When chaining multiple AI agents:
- •If each agent is 95% reliable, 3 agents = 86% overall reliability
- •Circuit breakers at each stage prevent cascading failures
- •Use partial results when possible instead of complete failure
Subagent Circuit Breaker Thresholds
Production-tested thresholds for autonomous agent loops (from Claude Code engineering best practices):
| Threshold | Value | Action |
|---|---|---|
| NO_PROGRESS | 3 loops | Stop after 3 loops with no file changes |
| SAME_ERROR | 5 times | Escalate after 5 identical errors |
| OUTPUT_DECLINE | 70% | Pause if output quality drops >70% |
Application: Track files_changed, error_messages, and output_quality for each subagent invocation. When thresholds are hit, stop retrying and communicate clearly with the user about the situation and options.
Recovery Workflows
Workflow 1: Resume After Crash
1. Identify current workspace ID (branch + path hash)
2. Read .claude/workspaces/{workspace-id}/claude-progress.json
3. Identify last checkpoint:
- Position: "Phase 5, Feature F003, step 2"
- Last action: "Created AuthService class"
- Next action: "Add login method"
4. Verify file state:
- Run `git status` to check uncommitted changes
- Compare files to checkpoint expectation
5. If state is valid:
- Continue from documented next action
6. If state is corrupted:
- Roll back to last commit: `git checkout -- .`
- Resume from that checkpoint
Workflow 2: Test Failure Recovery
1. Test fails after implementation 2. Analyze failure: - Read error message - Identify failing assertion - Trace to code change 3. Determine fix: - If bug in new code: Fix and re-run - If bug in test: Review test expectations - If design issue: Consult architect 4. Apply fix 5. Run full test suite 6. Update checkpoint only when all tests pass
Workflow 3: Merge Conflict Recovery
1. Conflict detected during pull/merge 2. Checkpoint current branch state 3. Analyze conflicts: - List conflicting files - Understand both versions 4. Resolve conflicts: - For each file, decide correct version - Test resolution locally 5. Commit resolution 6. Continue workflow
Integration with Progress Tracking
Error Logging in Progress File
{
"log": [
{
"timestamp": "2025-01-16T14:30:00Z",
"action": "Attempted OAuth integration",
"status": "failed",
"error": {
"type": "ConfigurationError",
"message": "Missing OAUTH_CLIENT_ID in environment",
"recoverable": true,
"resolution": "User must provide OAuth credentials"
}
}
],
"blockers": [
{
"id": "B001",
"description": "OAuth credentials required",
"status": "waiting_for_user",
"createdAt": "2025-01-16T14:30:00Z"
}
]
}
Blocker Management
## Blocker Protocol 1. **Detect**: Identify that progress is blocked 2. **Document**: Add to blockers array in progress file 3. **Notify**: Inform user with clear description 4. **Wait**: Do not proceed past blocker 5. **Resolve**: Once user provides resolution: - Mark blocker as resolved - Log resolution action - Continue workflow ## Blocker States | State | Meaning | |-------|---------| | waiting_for_user | User input/action required | | investigating | Analyzing potential solutions | | resolved | Blocker cleared | | escalated | Requires external help |
Claude Code Specific Features
Using Checkpoints
Claude Code automatically creates checkpoints before each edit.
- •Safe experimentation: Try approaches without fear
- •Use
/rewind: Roll back to previous state if needed - •Esc twice: Cancel current operation and discuss
Recovery Commands
| Command | Use When |
|---|---|
/rewind | Need to undo recent changes |
/clear | Context too polluted, restart clean |
git checkout -- file | Discard specific file changes |
git stash | Temporarily save work in progress |
Anti-Patterns
| Anti-Pattern | Why Bad | Instead |
|---|---|---|
| Ignoring errors | Problems compound | Handle immediately |
| No checkpoints | Can't recover | Checkpoint frequently |
| Retry without backoff | May worsen issue | Use exponential backoff |
| Silent failures | Problems hidden | Always log and report |
| Continuing past blockers | Invalid state | Stop and resolve |
Rules (L1 - Hard)
Critical for reliable recovery and data safety.
- •ALWAYS checkpoint before risky operations (enables rollback)
- •ALWAYS verify success after each significant action (ground truth)
- •NEVER ignore error messages or warnings (problems compound)
- •NEVER continue past a blocker without user confirmation
- •NEVER lose work - commit early and often
- •ALWAYS apply circuit breaker thresholds for autonomous agent loops:
- •NO_PROGRESS: Stop after 3 loops with no file changes
- •SAME_ERROR: Escalate to user after 5 identical errors
- •OUTPUT_DECLINE: Pause if output quality drops >70%
- •MUST escalate to user when any circuit breaker threshold is reached
Defaults (L2 - Soft)
Important for operational quality. Override with reasoning when appropriate.
- •Document errors with full context (aids debugging)
- •Provide recovery options when reporting errors
- •Use exponential backoff for retries
- •Log all failure attempts with timestamps
Guidelines (L3)
Recommendations for robust error handling.
- •Consider testing recovery paths during development
- •Prefer graceful degradation over complete failure
- •Consider using Claude Code's
/rewindfor quick rollbacks