AgentSkillsCN

failure-recovery

针对陷入僵局的代理、上下文耗竭、构建失败以及思维混乱等情况,制定相应的恢复策略。当代理陷入循环、回复质量不断下降、构建屡次失败,或当用户说“你卡住了”“重新来过”时,此技能便能派上用场。

SKILL.md
--- frontmatter
name: failure-recovery
description: "Recovery protocols for stuck agents, context exhaustion, build failures, and confusion. Use when agent is looping, responses degrading, builds failing repeatedly, or user says 'you're stuck' or 'start over'."
allowed-tools: Read, Bash, Glob, Grep

Failure Recovery

Recovery protocols when things go wrong. Don't panic - follow these systematic approaches.

When To Use

  • Agent is looping or repeating the same action
  • Responses are getting shorter or missing context
  • Build/tests failing repeatedly (3+ times)
  • User says "you're stuck", "start over", "lost", "confused"
  • Project state feels inconsistent

Build Failure Recovery

Trigger: Tests fail, build errors, runtime crashes

yaml
recovery_build_failure:
  step_1_isolate:
    - "pytest tests/ -x --tb=short"  # Stop at first failure
    - "git diff HEAD~1"              # What changed?

  step_2_rollback_test:
    - "git stash"
    - "git checkout HEAD~1"
    - "pytest tests/"                # Verify it works

  step_3_bisect_if_needed:
    - "git bisect start"
    - "git bisect bad HEAD"
    - "git bisect good [last-good-commit]"

  step_4_fix:
    rule: "Fix the bug, don't refactor"
    avoid: "No cleanup, no improvements, just fix"

  step_5_verify:
    - "pytest tests/"
    - "git stash pop"  # If we stashed changes

Key principle: Smallest change that fixes the issue. Do NOT refactor while debugging.


Agent Confusion Recovery

Trigger: Looping, contradicting self, asking already-answered questions

yaml
recovery_agent_confusion:
  symptoms:
    - "Repeating same action without progress"
    - "Contradicting previous statements"
    - "Asking questions already answered"
    - "Making changes that break previously working code"

  step_1_restate:
    action: |
      STOP. Let's reset.
      Current phase: [Phase X]
      Current task: [Task Y]
      Last successful action: [What worked]

  step_2_narrow_focus:
    action: |
      Focus only on: [single file]
      Specific change: [exact change needed]
      Do not touch other files.

  step_3_verify_understanding:
    action: |
      Before proceeding, confirm:
      1. What file are we changing?
      2. What exact change?
      3. Why this change?

  step_4_checkpoint:
    action: "Update TODO.md before continuing"

Context Window Exhaustion

Trigger: Responses getting shorter, missing earlier context, forgetting decisions

yaml
recovery_context_exhaustion:
  symptoms:
    - "Responses noticeably shorter"
    - "Forgetting earlier decisions"
    - "Re-asking questions already answered"
    - "Missing important context from earlier"

  prevention:
    - "Use TODO.md for task state"
    - "Keep LLM-OVERVIEW.md updated"
    - "Don't paste entire files unnecessarily"
    - "Reference file:line instead of copying code"

  recovery:
    step_1_handoff:
      action: "Generate handoff state document"
      template: |
        ## HANDOFF STATE
        **Project**: [name]
        **Timestamp**: [now]

        ### Completed
        - [list from TODO.md Done section]

        ### In Progress
        - [current task + status]

        ### Next Action (BE SPECIFIC)
        1. Open file: [exact path]
        2. Find: [function/line]
        3. Do: [exact change]

        ### Critical Context
        - [decisions made]
        - [blockers encountered]

    step_2_new_session:
      action: |
        Start new session. Read in order:
        1. LLM-OVERVIEW.md
        2. TODO.md
        3. Handoff document
        Confirm understanding before proceeding.

Predictive Context Management (PROACTIVE)

NEW in v7.4: Don't wait for exhaustion - predict and prevent.

Token Usage Estimation

Track rolling average of tokens per operation:

OperationAvg Tokens
File read~500
Grep result (per match)~200
Tool output~1000
Subagent summary~300
User message~100

Pre-emptive Action Thresholds

Context LevelPredictionAction
<30%Safe zoneContinue normally
30-40%Approaching limitStart delegating exploration to subagents
40-50%Warning zoneDelegate ALL remaining discovery work
50-60%Critical zoneCreate handoff checkpoint, delegate implementation
>60%Danger zoneStop, create handoff, instruct /compact

Pre-emptive Delegation Pattern

At 30% context, start offloading to subagents:

code
Context at 35%. Complex exploration ahead.

Action: Delegate to Explore agent
Prompt: "Find all files matching X, return summary only"
Result: Agent explores in isolated context
Main context: receives 300-token summary, not 5000-token search results

Pre-emptive Checkpoint Pattern

At 40% context, create checkpoint before continuing:

code
Context at 42%.

Action:
1. bd sync (save beads state)
2. Update TODO.md with current position
3. Continue with caution flag

If next operation would push >50%:
  → Create handoff immediately
  → Delegate remaining work to background agent
  → Report: "Checkpointed at [position]. Background agent continuing."

Background Handoff Pattern

At 50% context, continue work via delegation instead of stopping:

code
Context at 52%. Would normally stop here.

Instead:
1. Create handoff document
2. Spawn background agent with remaining tasks:

   Task:
     subagent_type: general-purpose
     description: "Continue implementation"
     prompt: |
       Continuing from handoff: [path]
       Remaining tasks from beads:
       - [task 1]
       - [task 2]
       - [task 3]

       Implement each, commit, close in beads.
     run_in_background: true

3. Report to user:
   "Context high. Handed off to background agent.
    Agent ID: [id]
    Use 'bd poll-all' to check progress.
    Or start new session and 'resume handoff'."

4. User can:
   - Wait for background agent to finish
   - Start new session with /compact
   - Poll agent results via TaskOutput

Context Prediction Indicators

Watch for these signals to predict exhaustion:

SignalMeaning
Large file reads queuedWill consume significant context
Many grep matches expectedHigh token consumption ahead
User requesting "find all"Exploration will be heavy
Complex implementation aheadMultiple file reads/writes
Already at 25% with more to goPre-emptively delegate now

Proactive vs Reactive

Reactive (OLD)Proactive (NEW)
Wait until 50% to warnStart delegating at 30%
Stop work at high contextContinue via background agents
User must /compact manuallyBackground agent keeps working
Context lost on compactHandoff preserves everything
Resume requires reading files againBackground agent has full context

"I'm Lost" Recovery

Trigger: Request doesn't match any category, project state unclear, contradictory context

yaml
recovery_lost:
  symptoms:
    - "User request doesn't fit normal patterns"
    - "Project state is inconsistent"
    - "Previous context missing or contradictory"
    - "Agent unsure what to do next"

  recovery_protocol:
    step_1: "STOP. Don't guess or hallucinate."

    step_2_ask_user:
      action: |
        I want to make sure I help you correctly. Can you tell me:
        1. What's the end goal you're trying to achieve?
        2. Is this a new project, existing project, or continuation?
        3. What's the most important thing to get right?

    step_3: "Based on answers, re-run triage"

  anti_patterns:
    - "DO NOT pretend to understand when confused"
    - "DO NOT make up context that wasn't provided"
    - "DO NOT assume - ask"

Dependency Hell Recovery

Trigger: Package conflicts, version mismatches, environment issues

yaml
recovery_dependency_hell:
  step_1_isolate:
    - "python -m venv .venv-clean"
    - "source .venv-clean/bin/activate"

  step_2_minimal_install:
    - "pip install [core-deps-only]"
    - "Test if basic functionality works"

  step_3_add_incrementally:
    - "Add deps one at a time"
    - "Test after each addition"
    - "Stop when you find the conflict"

  step_4_pin_versions:
    - "pip freeze > requirements.lock"
    - "Document what versions work"

Loop Detection (Autonomous Mode)

Trigger: Agent running autonomously, possibly stuck in infinite loop

yaml
recovery_loop_detection:
  detection_methods:
    iteration_count:
      threshold: 100
      action: "Stop after MAX_ITERATIONS"

    stuck_detection:
      check: "Compare beads state hash between iterations"
      threshold: 5  # Same state for 5 iterations
      action: "Stop, write to .agent/LAST_ERROR.md"

    error_threshold:
      consecutive_errors: 3
      action: "Stop and report"

  indicators:
    - "Same bd ready output for multiple iterations"
    - "Repeating same commit messages"
    - "No progress on any beads task"
    - "Iteration count > 50 with few tasks completed"

  recovery:
    step_1: "Write state to .agent/STATUS.md"
    step_2: "bd sync (save beads state)"
    step_3: "Exit gracefully (exit 1)"
    step_4: "User reviews .agent/ and decides next action"

  manual_check:
    commands:
      - "cat .agent/ITERATIONS.md"    # How many iterations?
      - "cat .agent/LAST_ERROR.md"    # What went wrong?
      - "bd list --json"              # What's the task state?
      - "git log --oneline -10"       # What was committed?

Key principle: Better to stop early and let user review than to burn compute looping.


Recovery Decision Tree

code
Problem detected
    ├─ Build/test failure?     → Build Failure Recovery
    ├─ Agent acting weird?     → Agent Confusion Recovery
    ├─ Responses degrading?    → Context Window Exhaustion
    ├─ Package conflicts?      → Dependency Hell Recovery
    ├─ Infinite loop?          → Loop Detection Recovery
    ├─ Completely lost?        → "I'm Lost" Recovery
    └─ Unknown?                → Generate handoff, start fresh session

Anti-Patterns

During recovery, AVOID:

  • Refactoring while debugging
  • Making multiple changes at once
  • Guessing at solutions without verification
  • Continuing when confused
  • Skipping verification steps

ALWAYS:

  • Make one change at a time
  • Verify after each change
  • Update TODO.md with current state
  • Ask user when genuinely stuck

Keywords

stuck, looping, confused, lost, broken, failing, recovery, reset, start over, context exhaustion, dependency hell, build failure