ECOS Failure Recovery Skill
Overview
This skill teaches the Emasoft Chief of Staff (ECOS) how to detect, classify, and recover from agent failures in a multi-agent system coordinated via AI Maestro messaging.
When to use this skill:
- •When monitoring remote agent health
- •When an agent becomes unresponsive
- •When an agent crashes or terminates unexpectedly
- •When work must be transferred due to agent failure
- •When a replacement agent must be created and onboarded
Prerequisites
Before using this skill, ensure:
- •AI Maestro is running locally
- •Agent registry is accessible
- •Recovery scripts are available in scripts/
Instructions
- •Detect the failure using heartbeat monitoring, message delivery status, or task completion timeouts
- •Classify the failure as transient, recoverable, or terminal using the classification criteria
- •Apply the appropriate recovery strategy based on failure type
- •If recovery fails or failure is terminal, initiate agent replacement protocol
- •For critical deadlines, use emergency handoff to transfer work immediately
- •Document all incidents in the incident log and notify EAMA of outcomes
Checklist
Copy this checklist and track your progress:
## ECOS Failure Response Checklist Agent: _______________ Failure detected: _______________ ### Detection - [ ] Heartbeat status checked - [ ] AI Maestro agent status queried - [ ] Message delivery verified - [ ] Task progress reviewed ### Classification - [ ] Failure type determined: [ ] Transient [ ] Recoverable [ ] Terminal - [ ] Evidence documented - [ ] Incident logged ### Response (choose path) #### If Transient: - [ ] Waited for auto-recovery (< 5 min) - [ ] Verified agent responsive - [ ] Resumed normal monitoring #### If Recoverable: - [ ] Manager notified - [ ] Recovery strategy selected - [ ] Recovery attempted - [ ] Recovery verified OR escalated to replacement #### If Terminal: - [ ] Manager notified - [ ] Replacement approval requested - [ ] Artifacts preserved - [ ] Replacement agent created - [ ] Orchestrator notified - [ ] Handoff documentation sent - [ ] New agent acknowledged - [ ] Incident closed ### Emergency Handoff (if deadline critical): - [ ] Critical tasks identified - [ ] Orchestrator notified - [ ] Receiving agent assigned - [ ] Handoff documentation created - [ ] Work transferred - [ ] Deadline met OR escalated
Output
| Recovery Type | Output |
|---|---|
| Agent restart | Agent back online, state restored |
| Communication | Message queue cleared, connection restored |
| State | Corrupted state replaced with backup |
Quick Reference: Failure Response Workflow
DETECT --> CLASSIFY --> RESPOND
| | |
v v v
Heartbeat Transient? Wait & Retry
timeout? --> Yes --> (auto-recover)
| |
Message No
delivery |
failed? v
| Recoverable?
Agent --> Yes --> Restart / Wake
offline? | (intervention needed)
|
No
|
v
Terminal --> Replace Agent
(full protocol)
Procedures Summary
| Phase | Action | Reference Document |
|---|---|---|
| 1 | Detect failure | failure-detection.md |
| 2 | Classify severity | failure-classification.md |
| 3 | Attempt recovery | recovery-strategies.md |
| 4 | Replace if terminal | agent-replacement-protocol.md |
| 5 | Emergency handoff | work-handoff-during-failure.md |
Phase 1: Failure Detection
Before responding to a failure, ECOS must first detect that a failure has occurred.
Read references/failure-detection.md for:
- •1.1-1.2 Overview and when to use
- •1.3 Heartbeat monitoring via AI Maestro
- •1.4 Message delivery failure detection
- •1.5 Task completion timeout detection
- •1.6 Agent status API queries
- •1.7 Decision flowchart
| Mechanism | Signal | Response Time |
|---|---|---|
| Heartbeat timeout | Missed pings | 30-60 seconds |
| Message delivery failure | API error | Immediate |
| Message acknowledgment timeout | No ACK | 5-15 minutes |
| Task completion timeout | Stalled progress | Variable |
Phase 2: Failure Classification
Once detected, classify severity to determine response.
Read references/failure-classification.md for:
- •2.1-2.2 Overview and categories
- •2.3 Transient failures (auto-recover < 5 min)
- •2.4 Recoverable failures (intervention needed)
- •2.5 Terminal failures (replacement required)
- •2.6-2.8 Decision matrix and escalation
| Category | Severity | Recovery | Example |
|---|---|---|---|
| Transient | Low | Automatic (< 5 min) | Network hiccup, API rate limit |
| Recoverable | Medium | With intervention | Session hibernated, out of memory |
| Terminal | High | Replacement required | Host crash, disk corruption |
Phase 3: Recovery Strategies
For transient and recoverable failures, attempt recovery before escalating.
Read references/recovery-strategies.md for:
- •3.1-3.2 Overview and strategy selection
- •3.3 Wait and Retry (transient)
- •3.4 Restart Agent (soft/hard procedures)
- •3.5 Hibernate-Wake Cycle
- •3.6 Resource Adjustment
- •3.7-3.8 When to replace and flowchart
| Strategy | When to Use | Time to Recover |
|---|---|---|
| Wait and Retry | Transient failures | 1-5 minutes |
| Restart | Hung/crashed agent | 5-15 minutes |
| Hibernate-Wake | Idle/suspended session | 2-5 minutes |
| Resource Adjustment | Memory/disk exhaustion | 15-60 minutes |
| Replace | All above failed | 30-120 minutes |
Phase 4: Agent Replacement Protocol
When recovery fails or failure is terminal, create a replacement agent.
Read references/agent-replacement-protocol.md for:
- •4.1-4.2 Overview
- •4.3 Phase 1: Failure confirmation and artifact preservation
- •4.4 Phase 2: Manager notification and approval
- •4.5 Phase 3: Creating replacement agent
- •4.6 Phase 4: Orchestrator notification
- •4.7 Phase 5: Work handoff to new agent
- •4.8-4.9 Cleanup and complete checklist
Replacement Protocol Summary
ECOS detects terminal failure
|
v
ECOS notifies EAMA (manager) --> EAMA approves
|
v
ECOS coordinates new agent creation
|
v
ECOS notifies EOA (orchestrator) to:
- Generate handoff document
- Update GitHub Project kanban
|
v
ECOS sends handoff docs to new agent
|
v
New agent acknowledges and begins work
Key Consideration: Memory Loss
CRITICAL: The replacement agent has NO MEMORY of the old agent.
The new agent does not know what tasks were assigned, what work was in progress, or the project context. Therefore:
- •Orchestrator (EOA) must generate handoff documentation
- •EOA must reassign tasks in GitHub Project kanban
- •ECOS must send handoff docs to new agent
ROLE BOUNDARY: ECOS creates agents and sends context. EOA owns task assignment.
Phase 5: Emergency Work Handoff
When critical work cannot wait for full replacement protocol.
Read references/work-handoff-during-failure.md for:
- •5.1-5.2 Overview and when to use
- •5.3 Triggering emergency handoff
- •5.4 Creating emergency handoff documentation
- •5.5 Reassigning work during failure
- •5.6 Emergency handoff message formats
- •5.7 Post-failure work reconciliation
| Aspect | Regular Handoff | Emergency Handoff |
|---|---|---|
| Timing | After replacement ready | Immediately |
| Completeness | Full context | Minimum viable |
| Recipient | Replacement agent | Any available agent |
| Duration | Permanent | Temporary |
File Locations
| Data | Location |
|---|---|
| Heartbeat configuration | $CLAUDE_PROJECT_DIR/.ecos/agent-health/heartbeat-config.json |
| Task tracking | $CLAUDE_PROJECT_DIR/.ecos/agent-health/task-tracking.json |
| Incident log | $CLAUDE_PROJECT_DIR/.ecos/agent-health/incident-log.jsonl |
| Recovery log | $CLAUDE_PROJECT_DIR/.ecos/agent-health/recovery-log.jsonl |
| Handoff documents | $CLAUDE_PROJECT_DIR/thoughts/shared/handoffs/AGENT_NAME/ |
| Emergency handoffs | $CLAUDE_PROJECT_DIR/thoughts/shared/handoffs/emergency/ |
Manager Notification Priorities
| Situation | Priority | Message Type |
|---|---|---|
| Transient failure (pattern) | normal | escalation |
| Recoverable failure detected | high | failure-report |
| Recovery attempt failed | high | failure-report |
| Terminal failure detected | urgent | replacement-request |
| Emergency handoff initiated | urgent | emergency-handoff-notification |
| Replacement complete | normal | replacement-complete |
Troubleshooting
Common issues when recovering from agent failures.
Read references/troubleshooting.md for:
- •Agent shows online but unresponsive -> Verify hooks and send status inquiry
- •Cannot determine failure type -> Default to recoverable, attempt strategies in order
- •Manager does not respond -> Wait 15 min, send reminder, escalate to user if needed
- •New replacement agent fails to register -> Verify AI Maestro health and hooks
- •Emergency handoff deadline missed -> Document, notify stakeholders, conduct post-mortem
Handoff Validation Checklist
Before sending any handoff document (regular or emergency), validate using this checklist:
### Handoff Validation Checklist Before sending handoff: - [ ] All required fields present (from/to/type/UUID/task) - [ ] UUID is unique (check existing handoffs: `ls $CLAUDE_PROJECT_DIR/thoughts/shared/handoffs/`) - [ ] Target agent exists and is alive (`curl -s "http://localhost:23000/api/agents" | jq -r '.agents[].name'`) - [ ] All referenced files exist (`test -f <path> && echo "EXISTS" || echo "MISSING"`) - [ ] No placeholder [TBD] markers (`grep -r "\[TBD\]" handoff.md`) - [ ] Document is valid markdown (no broken links, proper formatting) - [ ] Acceptance criteria clearly defined - [ ] Current state accurately reflects reality - [ ] Contact information for questions provided
Required fields for failure recovery handoffs:
| Field | Description | Example |
|---|---|---|
from | Sending agent name | ecos-chief-of-staff |
to | Target agent name | replacement-agent-001 |
type | Handoff type | emergency-handoff, replacement-handoff |
UUID | Unique handoff identifier | EH-20250204-svgbbox-001 |
task | Task being handed off | Implement bounding box calculation |
failed_agent | Name of failed agent | libs-svg-svgbbox |
failure_reason | Why agent failed | Terminal crash - disk corruption |
Error Handling
| Error | Cause | Resolution |
|---|---|---|
| Agent unresponsive | Network issue or crash | Send ping, wait 30s, then classify |
| Recovery failed | State corrupted | Escalate to terminal, request replacement |
| Handoff rejected | Target agent busy | Queue handoff, retry in 5 minutes |
| AI Maestro unavailable | Server down | Use fallback file-based communication |
Examples
Recovery scenarios with step-by-step commands.
Read references/examples.md for:
- •Agent crash recovery (recoverable -> restart -> verify)
- •Terminal failure with replacement (3 crashes -> replace -> handoff)
- •Transient network failure (wait -> auto-recover)
- •Emergency handoff with deadline (immediate reassignment)
- •Quick command reference (heartbeat, status, restart, approval, handoff)
Resources
- •references/failure-detection.md - Detection procedures
- •references/failure-classification.md - Classification criteria
- •references/recovery-strategies.md - Recovery steps
- •references/agent-replacement-protocol.md - Replacement workflow
- •references/work-handoff-during-failure.md - Work transfer procedures
- •references/troubleshooting.md - Common issues and solutions
- •references/examples.md - Complete recovery examples