Iterate Campaign
Diagnose and fix campaign failures through targeted investigation and iterative retries. This is the debugging counterpart to run-campaign — use it when a campaign has completed but the win rate is below target.
Arguments
- •
$1(required): Campaign ID to analyze (e.g.100) - •
$2(optional): Target win rate to aim for (default: 90%)
Overview
The iteration loop:
Analyze failures → Diagnose root cause → Fix code → Test → Rebuild images → Retry → Evaluate
↑ │
└──────────────────────────────────────────────────────────────────────────────-─┘
Steps
1. Analyze Campaign Failures
Get overall results and per-trial breakdown:
source $PROJECT_ROOT/.env uv run eval show <campaign_id> --remote
Then get per-trial chaos types for all failed trials:
uv run eval show --trial <trial_id> --remote
Group failures by chaos type. Common patterns:
- •All trials of a chaos type failed with 0 commands: Monitor never created a ticket (missing or broken invariant)
- •Trials resolved but scored FAILURE: Agent fixed the issue but final state check failed (timing or scoring bug)
- •Negative detect times: Pre-chaos ticket leakage (startup tickets not properly cleared)
- •Timeout (not resolved): Agent couldn't fix the issue within the time limit
2. Investigate Root Causes
For each failing chaos type, investigate systematically:
Check worker logs
If worker logs are available locally:
tail -100 /tmp/worker-*.log
Check trial details
uv run eval show --trial <id> --remote # Timing, commands, chaos type uv run eval show --trial <id> --remote --json # Full state data
Look at:
- •Timing: Is detect time negative (pre-chaos ticket)? Is it 0s (pre-existing ticket)?
- •Commands: Did the agent run any? Were they relevant to the chaos type?
- •Final state: Is it empty
{}? Does it show unhealthy stores?
Check invariant coverage
# What invariants exist? grep -n "def check_" subjects/tikv/observer/src/tikv_observer/invariants.py # What violations can the checker produce? grep -n "name=" subjects/tikv/observer/src/tikv_observer/invariants.py | grep CONFIG
Common root cause checklist
| Symptom | Likely Cause | Fix Location |
|---|---|---|
| 0 commands, no ticket | Missing invariant for chaos type | subjects/*/observer/invariants.py |
| 0 commands, ticket exists | Agent stuck on startup ticket | eval/src/eval/runner/worker.py (restart_operator) |
| Resolved but FAILURE | Final state captured too early | eval/src/eval/runner/worker.py (wait_healthy) |
| Resolved but FAILURE | State capture queries wrong endpoint | eval/src/eval/subjects/cloud/gcp/subject.py |
| Negative detect time | Startup ticket leaked past force-resolve | eval/src/eval/runner/worker.py (pre-chaos flow) |
| All PD-related chaos fails | Monitor uses single PD endpoint | eval/src/eval/runner/remote_operator.py (PD endpoints) |
| Invariant too sensitive | Fires during startup/stabilization | Adjust threshold or grace period in invariants.py |
| Invariant not sensitive enough | Doesn't fire for the chaos effect | Lower threshold in invariants.py |
3. Fix
Make code changes. Key files by concern:
| Concern | Files |
|---|---|
| Invariant detection | subjects/tikv/observer/src/tikv_observer/invariants.py |
| Observation data | subjects/tikv/observer/src/tikv_observer/subject.py |
| PD client / failover | subjects/tikv/observer/src/tikv_observer/pd_client.py |
| Worker trial flow | eval/src/eval/runner/worker.py |
| Remote operator | eval/src/eval/runner/remote_operator.py |
| State capture | eval/src/eval/subjects/cloud/gcp/subject.py |
| Trial scoring | eval/src/eval/analysis/scoring.py |
| Chaos injection | eval/src/eval/subjects/tikv/chaos.py (local), cloud/gcp/subject.py (cloud) |
4. Test
Run tests for all modified packages:
# Always run these uv run pytest packages/operator-core/tests/ -x -q uv run pytest subjects/tikv/observer/tests/ -x -q cd eval && uv run pytest tests/ -x -q --ignore=tests/test_chat_db_app_e2e.py
5. Rebuild & Push Images
Determine which images need rebuilding based on what changed:
| Changed | Rebuild |
|---|---|
subjects/*/observer/ | Operator image |
packages/operator-core/ | Operator image |
eval/src/eval/ | Worker image |
| Both | Both images |
# Operator (invariant/observer changes) cd $PROJECT_ROOT docker build --platform linux/amd64 -f subjects/tikv/Dockerfile.operator -t us-central1-docker.pkg.dev/operator-486214/eval/operator:latest . docker push us-central1-docker.pkg.dev/operator-486214/eval/operator:latest # Worker (eval code changes) docker build --platform linux/amd64 -t eval-worker -f eval/Dockerfile . docker tag eval-worker us-central1-docker.pkg.dev/operator-486214/eval/worker:latest docker push us-central1-docker.pkg.dev/operator-486214/eval/worker:latest
All builds MUST use --platform linux/amd64 on ARM Macs.
6. Create Targeted Retry Campaign
Create a campaign YAML with just the failing chaos types:
# eval/campaigns/debug/tikv-retry-<chaos>.yaml
name: tikv-retry-<description>
subjects: [tikv]
chaos_types:
- type: <failing_chaos_type_1>
- type: <failing_chaos_type_2>
trials_per_combination: 3
parallel: 3
cooldown_seconds: 10
include_baseline: false
cloud:
provider: gcp
operator:
enabled: true
image: us-central1-docker.pkg.dev/operator-486214/eval/operator:latest
Enqueue and start workers:
source $PROJECT_ROOT/.env
uv run eval run campaign campaigns/debug/tikv-retry-<chaos>.yaml --cloud=gcp --parallel 3
# Start local workers (3 is typical)
for i in 1 2 3; do
nohup uv run eval worker start --cloud=gcp --id=worker-$i \
--operator-image=us-central1-docker.pkg.dev/operator-486214/eval/operator:latest \
> /tmp/worker-$i.log 2>&1 &
done
7. Monitor & Evaluate
Wait for the campaign to complete with live progress:
source $PROJECT_ROOT/.env && uv run eval wait <new_campaign_id> --remote
Run this as a background Bash command so you can continue working while it runs. It will show live progress and exit with a summary when all trials finish.
When complete, compare results against the previous campaign:
- •Did the targeted chaos types improve?
- •Any regressions in previously-passing types?
- •Is the win rate at or above target?
8. Iterate or Finalize
If failures remain: Go back to step 2 with the new campaign's failed trials. Use eval logs <trial_id> --remote if Cloud Logging is enabled.
If target reached on retry: Run a full campaign to validate no regressions:
uv run eval run campaign campaigns/operations/tikv-all-chaos-cloud.yaml --cloud=gcp --parallel 3
When done: Kill workers, mark notable campaigns, commit fixes:
# Kill workers
ps aux | grep 'eval worker' | grep -v grep | awk '{print $2}' | xargs kill -9
# Mark campaign as notable
uv run eval notable <campaign_id> --remote
uv run eval note <campaign_id> --remote "<description of what was fixed>"
# Commit
git add <changed-files>
git commit -m "fix(eval): <description>"
git push
Environment
- •
$PROJECT_ROOTis the git repo root (parent of eval/) - •
.envat project root containsEVAL_DATABASE_URLandANTHROPIC_API_KEY - •Working directory is
eval/