AgentSkillsCN

iterate-campaign

深入分析营销活动失败原因,精准定位根本症结,持续修复、重构并重试,直至胜率显著提升

SKILL.md
--- frontmatter
name: iterate-campaign
description: Analyze campaign failures, diagnose root causes, fix, rebuild, and retry until win rate improves
disable-model-invocation: true
allowed-tools: Bash, Read, Write, Edit, Glob, Grep, TaskCreate, TaskUpdate, TaskList, TaskOutput
argument-hint: <campaign-id> [--target-rate N%]

Iterate Campaign

Diagnose and fix campaign failures through targeted investigation and iterative retries. This is the debugging counterpart to run-campaign — use it when a campaign has completed but the win rate is below target.

Arguments

  • $1 (required): Campaign ID to analyze (e.g. 100)
  • $2 (optional): Target win rate to aim for (default: 90%)

Overview

The iteration loop:

code
Analyze failures → Diagnose root cause → Fix code → Test → Rebuild images → Retry → Evaluate
     ↑                                                                              │
     └──────────────────────────────────────────────────────────────────────────────-─┘

Steps

1. Analyze Campaign Failures

Get overall results and per-trial breakdown:

bash
source $PROJECT_ROOT/.env
uv run eval show <campaign_id> --remote

Then get per-trial chaos types for all failed trials:

bash
uv run eval show --trial <trial_id> --remote

Group failures by chaos type. Common patterns:

  • All trials of a chaos type failed with 0 commands: Monitor never created a ticket (missing or broken invariant)
  • Trials resolved but scored FAILURE: Agent fixed the issue but final state check failed (timing or scoring bug)
  • Negative detect times: Pre-chaos ticket leakage (startup tickets not properly cleared)
  • Timeout (not resolved): Agent couldn't fix the issue within the time limit

2. Investigate Root Causes

For each failing chaos type, investigate systematically:

Check worker logs

If worker logs are available locally:

bash
tail -100 /tmp/worker-*.log

Check trial details

bash
uv run eval show --trial <id> --remote        # Timing, commands, chaos type
uv run eval show --trial <id> --remote --json  # Full state data

Look at:

  • Timing: Is detect time negative (pre-chaos ticket)? Is it 0s (pre-existing ticket)?
  • Commands: Did the agent run any? Were they relevant to the chaos type?
  • Final state: Is it empty {}? Does it show unhealthy stores?

Check invariant coverage

bash
# What invariants exist?
grep -n "def check_" subjects/tikv/observer/src/tikv_observer/invariants.py

# What violations can the checker produce?
grep -n "name=" subjects/tikv/observer/src/tikv_observer/invariants.py | grep CONFIG

Common root cause checklist

SymptomLikely CauseFix Location
0 commands, no ticketMissing invariant for chaos typesubjects/*/observer/invariants.py
0 commands, ticket existsAgent stuck on startup ticketeval/src/eval/runner/worker.py (restart_operator)
Resolved but FAILUREFinal state captured too earlyeval/src/eval/runner/worker.py (wait_healthy)
Resolved but FAILUREState capture queries wrong endpointeval/src/eval/subjects/cloud/gcp/subject.py
Negative detect timeStartup ticket leaked past force-resolveeval/src/eval/runner/worker.py (pre-chaos flow)
All PD-related chaos failsMonitor uses single PD endpointeval/src/eval/runner/remote_operator.py (PD endpoints)
Invariant too sensitiveFires during startup/stabilizationAdjust threshold or grace period in invariants.py
Invariant not sensitive enoughDoesn't fire for the chaos effectLower threshold in invariants.py

3. Fix

Make code changes. Key files by concern:

ConcernFiles
Invariant detectionsubjects/tikv/observer/src/tikv_observer/invariants.py
Observation datasubjects/tikv/observer/src/tikv_observer/subject.py
PD client / failoversubjects/tikv/observer/src/tikv_observer/pd_client.py
Worker trial floweval/src/eval/runner/worker.py
Remote operatoreval/src/eval/runner/remote_operator.py
State captureeval/src/eval/subjects/cloud/gcp/subject.py
Trial scoringeval/src/eval/analysis/scoring.py
Chaos injectioneval/src/eval/subjects/tikv/chaos.py (local), cloud/gcp/subject.py (cloud)

4. Test

Run tests for all modified packages:

bash
# Always run these
uv run pytest packages/operator-core/tests/ -x -q
uv run pytest subjects/tikv/observer/tests/ -x -q
cd eval && uv run pytest tests/ -x -q --ignore=tests/test_chat_db_app_e2e.py

5. Rebuild & Push Images

Determine which images need rebuilding based on what changed:

ChangedRebuild
subjects/*/observer/Operator image
packages/operator-core/Operator image
eval/src/eval/Worker image
BothBoth images
bash
# Operator (invariant/observer changes)
cd $PROJECT_ROOT
docker build --platform linux/amd64 -f subjects/tikv/Dockerfile.operator -t us-central1-docker.pkg.dev/operator-486214/eval/operator:latest .
docker push us-central1-docker.pkg.dev/operator-486214/eval/operator:latest

# Worker (eval code changes)
docker build --platform linux/amd64 -t eval-worker -f eval/Dockerfile .
docker tag eval-worker us-central1-docker.pkg.dev/operator-486214/eval/worker:latest
docker push us-central1-docker.pkg.dev/operator-486214/eval/worker:latest

All builds MUST use --platform linux/amd64 on ARM Macs.

6. Create Targeted Retry Campaign

Create a campaign YAML with just the failing chaos types:

yaml
# eval/campaigns/debug/tikv-retry-<chaos>.yaml
name: tikv-retry-<description>
subjects: [tikv]
chaos_types:
  - type: <failing_chaos_type_1>
  - type: <failing_chaos_type_2>
trials_per_combination: 3
parallel: 3
cooldown_seconds: 10
include_baseline: false
cloud:
  provider: gcp
  operator:
    enabled: true
    image: us-central1-docker.pkg.dev/operator-486214/eval/operator:latest

Enqueue and start workers:

bash
source $PROJECT_ROOT/.env
uv run eval run campaign campaigns/debug/tikv-retry-<chaos>.yaml --cloud=gcp --parallel 3

# Start local workers (3 is typical)
for i in 1 2 3; do
  nohup uv run eval worker start --cloud=gcp --id=worker-$i \
    --operator-image=us-central1-docker.pkg.dev/operator-486214/eval/operator:latest \
    > /tmp/worker-$i.log 2>&1 &
done

7. Monitor & Evaluate

Wait for the campaign to complete with live progress:

bash
source $PROJECT_ROOT/.env && uv run eval wait <new_campaign_id> --remote

Run this as a background Bash command so you can continue working while it runs. It will show live progress and exit with a summary when all trials finish.

When complete, compare results against the previous campaign:

  • Did the targeted chaos types improve?
  • Any regressions in previously-passing types?
  • Is the win rate at or above target?

8. Iterate or Finalize

If failures remain: Go back to step 2 with the new campaign's failed trials. Use eval logs <trial_id> --remote if Cloud Logging is enabled.

If target reached on retry: Run a full campaign to validate no regressions:

bash
uv run eval run campaign campaigns/operations/tikv-all-chaos-cloud.yaml --cloud=gcp --parallel 3

When done: Kill workers, mark notable campaigns, commit fixes:

bash
# Kill workers
ps aux | grep 'eval worker' | grep -v grep | awk '{print $2}' | xargs kill -9

# Mark campaign as notable
uv run eval notable <campaign_id> --remote
uv run eval note <campaign_id> --remote "<description of what was fixed>"

# Commit
git add <changed-files>
git commit -m "fix(eval): <description>"
git push

Environment

  • $PROJECT_ROOT is the git repo root (parent of eval/)
  • .env at project root contains EVAL_DATABASE_URL and ANTHROPIC_API_KEY
  • Working directory is eval/