Backup Skill -- Automated Backup & Recovery
Purpose
Execute automated backups on schedule to enable fast recovery if agents fail:
- •Daily backup (07:00): Scoring weights and memory files from all agents
- •Weekly backup (Sunday 23:00): Full agent snapshots (SOUL.md, skills, memory)
- •On-change backup: Triggered when scoring weights or rules are adjusted
Backup types
Daily backup
- •What: Scoring weights (Z's priority_weights.json, etc.), memory files (agent registry, kaizen journal, lessons learned)
- •Where:
/backups/daily/YYYY-MM-DD/ - •When: 07:00 every day
- •Retention: Keep last 30 daily backups
- •Size: Small (~100KB)
Weekly backup
- •What: Full agent snapshots for all agents (Z, Jay, Rick, Leroy)
- •SOUL.md (agent charter)
- •All skill directories with Python scripts
- •All memory files
- •Where:
/backups/weekly/YYYY-Www/(ISO week format) - •When: Sunday 23:00
- •Retention: Keep last 12 weekly backups (3 months)
- •Size: Medium (~5-10MB per backup)
On-change backup
- •What: Triggered immediately when scoring weights or system rules are changed (human approval required)
- •Where:
/backups/on-change/YYYY-MM-DD-HHmmss/ - •Retention: Keep last 10 on-change backups
- •Purpose: Audit trail of all config changes
Recovery procedure
When an agent fails (state = DEAD):
- •EM triggers restart from most recent backup (daily or weekly)
- •Copy backup files to
/agents/[agent_id]/overwriting current state - •Restart agent process
- •Verify agent responds to heartbeat within 5 minutes
- •If recovery succeeds: log event, resume operations
- •If recovery fails: escalate to human with full diagnostic info
Implementation
backup_agent.py:
- •Reads
backup_schedule.jsonto determine what/when/where to back up - •Executes backup (copy files to backup directory)
- •Logs backup event to system-history.jsonl
- •Manages retention (delete old backups exceeding retention policy)
backup_schedule.json:
- •Configuration file with backup schedule and retention policies
- •Human-editable (to adjust retention, frequency, scope)
Backup verification
After each backup:
- •Verify all files copied successfully (checksum comparison)
- •Verify backup directory readable (no permission errors)
- •Log backup completion timestamp and file count
- •Alert human if backup fails
Disaster recovery
If EM itself crashes:
- •Restart EM from most recent backup
- •EM reads message queue from persistent log
- •Resume message routing from where it left off
- •Agents should have local queues (message buffering) while EM is down
If multiple agents fail simultaneously:
- •Attempt to restart all from backup in parallel
- •Respect dependency graph (start Z first, then Jay/Rick, then Leroy)
- •Verify each agent before continuing pipeline
- •If critical agents don't recover, escalate to human immediately