Long Task Execution Framework
Robust protocol for executing tasks that take multiple hours with confidence.
Core Principles
- •Atomicity: Every task must be decomposed into atomic sub-tasks < 10 minutes
- •Validation: Validate after every change, before every checkpoint
- •Checkpointing: Create checkpoint after every completed atomic task
- •Recovery: Automatic retry → rollback → escalate on failures
- •Determinism: Use scripts for critical operations (not LLM decisions)
Atomic Task Requirements
Each atomic task MUST:
- •Take < 10 minutes to execute
- •Have clear, verifiable success criteria
- •Be independently rollback-able
- •Have explicit dependencies listed
- •Have a rollback plan defined
Validation Protocol
After Every Code Change:
bash
npx tsc --noEmit --skipLibCheck # TypeScript (2-5s) npx eslint src --max-warnings 0 # Linting (1-3s)
After Atomic Task Completion:
bash
bash .claude/framework/validation-gate.sh atomic
Before Final Completion:
bash
bash .claude/framework/validation-gate.sh final
NEVER continue if validation fails. Stop and recover.
Checkpoint Strategy
Create checkpoint:
- •✅ AFTER each atomic task completes successfully
- •✅ BEFORE risky operations (migrations, major refactors)
- •✅ Every 15 minutes MAX (even if task incomplete)
- •❌ NEVER on validation failures
- •❌ NEVER during rollback
- •📁 Snapshot stored at
.claude/checkpoints/{id}/snapshot/(excludes.claude/checkpoints,.git,node_modules)
Checkpoint naming:
- •Pattern:
ckpt_{taskId}_{atomicTaskId}_{timestamp} - •Example:
ckpt_sprint-1_task1.3_1729612345
Checkpoint metadata:
Include context:
- •Decisions made ("Chose approach X over Y because...")
- •Blockers encountered
- •Files changed
- •Next task
Recovery Protocol
On failure, execute in order:
Level 1: Retry (3 attempts max)
bash
node .claude/framework/recovery.js --task-id=X --error='{"type":"...","message":"..."}'
- •Same approach, maybe transient error
- •Agent retries task automatically
Level 2: Rollback
- •Restore to last valid checkpoint
- •Try alternative approach
- •Agent retries with different strategy
Level 3: Escalate
- •Mark task as BLOCKED
- •Notify human with error details and options
- •Wait for manual intervention
State Machine
Valid state transitions:
code
INITIALIZED → PLANNING PLANNING → PLANNED | FAILED PLANNED → EXECUTING EXECUTING → VALIDATING | FAILED VALIDATING → CHECKPOINTING | FAILED CHECKPOINTING → EXECUTING | COMPLETED FAILED → ROLLING_BACK | ABANDONED ROLLING_BACK → EXECUTING | ABANDONED
Invalid transitions are errors.
Scripts Reference
All scripts are in .claude/framework/:
bash
# Logger (structured JSON logs)
node logger.js --event=EVENT_NAME --data='{"key":"value"}'
# Checkpoint Manager
node checkpoint-manager.js create --task-id=X --context='{"decisions":"..."}'
node checkpoint-manager.js restore --checkpoint-id=ckpt_xyz
node checkpoint-manager.js list
# Validation Gates
bash validation-gate.sh atomic # Fast: tsc + eslint
bash validation-gate.sh checkpoint # Medium: + tests
bash validation-gate.sh final # Full: + build (optional)
# Recovery
node recovery.js --task-id=X --error='{"type":"...","message":"..."}'
# Conditional Hook (called by hooks automatically)
node conditional-hook.js validate-ts
node conditional-hook.js checkpoint
node conditional-hook.js final-validation
Execution Loop
code
FOR EACH atomic task WHERE status !== 'completed':
1. Mark task as 'in_progress' (TodoWrite)
2. Execute task (write code, modify files)
→ Write hook triggers: validate-ts
3. Validate explicitly: validation-gate.sh atomic
4. If validation fails: call recovery.js
5. Mark task as 'completed' (TodoWrite)
→ TodoWrite hook triggers: checkpoint (automatic)
6. Continue to next task
WHEN all tasks done:
7. Final validation: validation-gate.sh final
8. Create final checkpoint
9. Generate report
10. Mark task as COMPLETED
Success Criteria
A long task is successfully completed when:
- •✅ All atomic tasks status = 'completed'
- •✅ Final validation passes (TypeScript + ESLint + Tests)
- •✅ Final checkpoint created
- •✅ Execution report generated
- •✅ Zero blockers or unresolved errors
Iron Laws
- •NEVER skip task decomposition - Long task WITHOUT atomic tasks = BLOCKED
- •NEVER continue on validation failure - Failed validation = STOP + RECOVER
- •NEVER exceed maxTime (10min) - If task takes longer = something is wrong, STOP
- •ALWAYS checkpoint after completed task - No exceptions
- •ALWAYS use state machine - Invalid transitions = ERROR
- •ALWAYS log structured - Every event, every decision (JSON format)
- •NEVER lose context - Checkpoints carry FULL context
- •ALWAYS self-heal - Try recovery before escalating to human
Performance Targets
- •Validation per task: < 5 seconds (npx tsc --noEmit)
- •Checkpoint creation: < 1 second
- •Recovery decision: < 30 seconds
- •Total overhead: < 5% of execution time
Use npx tsc --noEmit instead of npm run build for validation (90% faster).