Perform Sweep

Name: perform-sweep
Rating: 92
Author: bglick13

End-to-end workflow for running ablation experiments on the Diplomacy GRPO training pipeline.

Quick Reference

Phase	Action	Command
Configure	Create sweep.yaml	See YAML Reference
Validate	Dry run	`python scripts/launch_sweep.py <path> --dry-run`
Info	Show config	`python scripts/launch_sweep.py <path> --info`
Launch	Start sweep	`python scripts/launch_sweep.py <path>`
Status	Check progress	`python scripts/launch_sweep.py <path> --status`
List	List all sweeps	`python scripts/launch_sweep.py --list`
Analyze	Compare results	Use `experiment-analysis` skill

Workflow

1. Hypothesis Design

•Review recent experiments in experiments/experiment-tracker.md
•Identify one variable to test (e.g., horizon length, scoring function)
•Predict expected outcome
•Document reasoning in sweep.yaml hypothesis field

2. YAML Configuration

Create experiments/sweeps/<name>/sweep.yaml:

yaml

metadata:
  name: "my-ablation"
  description: "Testing hypothesis X"
  hypothesis: "Longer horizons should improve strategic play"
  experiment_tag_prefix: "my-ablation"

defaults:
  total_steps: 100

runs:
  A:
    name: "control"
    description: "Baseline configuration"
    config:
      experiment_tag: "${metadata.experiment_tag_prefix}-A"
  B:
    name: "treatment"
    description: "With longer horizon"
    config:
      rollout_horizon_years: 8
      experiment_tag: "${metadata.experiment_tag_prefix}-B"

See YAML Reference for full schema.

3. Validate Configuration

bash

# Show sweep info
python scripts/launch_sweep.py experiments/sweeps/<name>/ --info

# Dry run (validates config, shows what would run)
python scripts/launch_sweep.py experiments/sweeps/<name>/ --dry-run

4. Launch and Monitor

bash

# Launch (fire-and-forget - runs in cloud)
python scripts/launch_sweep.py experiments/sweeps/<name>/

# Check status anytime
python scripts/launch_sweep.py experiments/sweeps/<name>/ --status

# List all sweeps
python scripts/launch_sweep.py --list

5. Analysis

After sweep completes, use the experiment-analysis skill:

bash

# Full analysis for each run
uv run python .claude/skills/experiment-analysis/analyze_elo.py <run-name>

# Compare in WandB
# Filter by experiment_tag_prefix (e.g., "my-ablation")

Key Features

•Fire-and-forget: Launch and close laptop - sweep runs in Modal cloud
•Auto-resume: If Modal times out (24hr max), sweep automatically respawns
•Sequential execution: Runs one training at a time (infra constraint)
•Progress tracking: State saved after each run for recovery

Example Sweeps

See existing sweeps in experiments/sweeps/:

•longer-horizon-inverted-weight-ablation/ - 2x2 ablation on horizon and scoring

Integration

•Use experiment-analysis skill for post-sweep metrics analysis
•Results logged to WandB with experiment_tag for filtering
•Document findings in sweep directory's results.md