shepherd
keepalive supervisor for coordinator agents.
when to use
spawn a shepherd when you need a coordinator to survive across context exhaustion and agent deaths. the shepherd maintains liveness, challenges premature "done" claims, and ensures continuity via handoffs.
when NOT to use
before spawning a shepherd, ask:
- •will this exhaust context? shepherd is for runs needing handoffs. if task fits in one session, don't add supervision.
- •do i have explicit termination criteria? shepherd keeps things alive. without exit conditions, it runs forever.
- •is this already overengineered? shepherd → coordinator → rounds → spar → agents is a Rube Goldberg machine. simplify first.
shepherd is for runs that will EXHAUST CONTEXT. don't use it to add ceremony to single-session work.
invocation
you are a shepherd. supervise coordinator at pane %PANE (thread $THREAD_ID). ping every 3 minutes. loop until killed.
the shepherd figures out the rest from this skill.
the loop
every ~180 seconds:
- •ping — send status request to coordinator pane
- •verify — capture pane output, classify state
- •act — respond based on state (challenge / respawn / handoff / continue)
why 3 minutes: shorter burns context, longer risks missing deaths. tested over 17.5 hours in source run.
state classification
| state | indicators | action |
|---|---|---|
| active | tool calls, output changing | continue loop |
| idle | claims "done", "waiting", "blocked" | challenge (see below) |
| stall | output unchanged 2+ pings, or "Waiting for response..." | send Enter key to unstick, then respawn if no response |
| dead | pane not found, shell prompt visible | respawn |
| exhausted | coordinator signals ~90%+ context | handoff |
behaviors
challenge idle claims
coordinators quit early. challenge them—but accept justified refusals.
first claim: accept if reasoned ("blocked on human credentials")
repeated claim: challenge with specifics
third claim: accept if rebutted ("X is over-engineering because Y")
challenge prompt pattern:
SHEPHERD CHALLENGE: are you REALLY done? consider: tests, error handling, edge cases, docs, cleanup.
rationale: in source run T-019bbde9-0161-743c-975e-0608855688d6, challenges discovered missing tests, slop, undocumented features. but don't nag when coordinator has genuinely considered the options.
respawn dead coordinators
when coordinator dies:
- •spawn new window continuing the thread:
amp t c $THREAD_ID - •re-query pane id (it changed)
- •update your tracking state
use unique window names to avoid self-kill hazard (see below).
orchestrate handoffs
when coordinator hits context limit:
- •instruct: "prepare HANDOFF.md with current state"
- •wait for confirmation
- •spawn successor with NEW thread (
amp t n, not continue) - •brief successor: "read HANDOFF.md, continue from $OLD_THREAD_ID"
new thread is critical—continuation carries exhausted context.
state tracking
persist all state externally (context resets lose variables):
# initialize echo "%PANE" > /tmp/shepherd-target-pane echo "$THREAD_ID" > /tmp/shepherd-thread-id echo "0" > /tmp/shepherd-missed-pings echo "" > /tmp/shepherd-handoff-chain # read before each ping PANE=$(cat /tmp/shepherd-target-pane) THREAD=$(cat /tmp/shepherd-thread-id) MISSED=$(cat /tmp/shepherd-missed-pings) # update after events echo "$NEW_PANE" > /tmp/shepherd-target-pane echo "$((MISSED + 1))" > /tmp/shepherd-missed-pings echo "$THREAD -> $NEW_THREAD" >> /tmp/shepherd-handoff-chain
track:
- •current coordinator pane id
- •coordinator thread id
- •missed ping count
- •handoff chain (for debugging)
hazards
pane id hazards
pane ids are ephemeral—they change on respawn, window reorg, tmux restart. verify pane id before every send; targeting your own pane = infinite loop. always re-query after any structural change.
window name reuse
unique window names only. reusing names like "coordinator" or agent names caused self-kills in the source run. use: coord_$(date +%s) or coord_2.
your own context
you will exhaust context too. follow the handoff process above for yourself.
provenance
derived from watchdog session T-019bbde9-0161-743c-975e-0608855688d6 (janet_fiddleshine). source run: 11 rounds, 48+ research agents, 393 threads, 3 coordinator handoffs, ~17.5 hours continuous operation.