AgentSkillsCN

perform-sweep

设计、配置、启动和分析GRPO训练的消融扫描。用于假设检验、超参数实验和系统性比较。

SKILL.md
--- frontmatter
name: perform-sweep
description: Design, configure, launch, and analyze ablation sweeps for GRPO training. Use for hypothesis testing, hyperparameter experiments, and systematic comparisons.

Perform Sweep

End-to-end workflow for running ablation experiments on the Diplomacy GRPO training pipeline.

Quick Reference

PhaseActionCommand
ConfigureCreate sweep.yamlSee YAML Reference
ValidateDry runpython scripts/launch_sweep.py <path> --dry-run
InfoShow configpython scripts/launch_sweep.py <path> --info
LaunchStart sweeppython scripts/launch_sweep.py <path>
StatusCheck progresspython scripts/launch_sweep.py <path> --status
ListList all sweepspython scripts/launch_sweep.py --list
AnalyzeCompare resultsUse experiment-analysis skill

Workflow

1. Hypothesis Design

  • Review recent experiments in experiments/experiment-tracker.md
  • Identify one variable to test (e.g., horizon length, scoring function)
  • Predict expected outcome
  • Document reasoning in sweep.yaml hypothesis field

2. YAML Configuration

Create experiments/sweeps/<name>/sweep.yaml:

yaml
metadata:
  name: "my-ablation"
  description: "Testing hypothesis X"
  hypothesis: "Longer horizons should improve strategic play"
  experiment_tag_prefix: "my-ablation"

defaults:
  total_steps: 100

runs:
  A:
    name: "control"
    description: "Baseline configuration"
    config:
      experiment_tag: "${metadata.experiment_tag_prefix}-A"
  B:
    name: "treatment"
    description: "With longer horizon"
    config:
      rollout_horizon_years: 8
      experiment_tag: "${metadata.experiment_tag_prefix}-B"

See YAML Reference for full schema.

3. Validate Configuration

bash
# Show sweep info
python scripts/launch_sweep.py experiments/sweeps/<name>/ --info

# Dry run (validates config, shows what would run)
python scripts/launch_sweep.py experiments/sweeps/<name>/ --dry-run

4. Launch and Monitor

bash
# Launch (fire-and-forget - runs in cloud)
python scripts/launch_sweep.py experiments/sweeps/<name>/

# Check status anytime
python scripts/launch_sweep.py experiments/sweeps/<name>/ --status

# List all sweeps
python scripts/launch_sweep.py --list

5. Analysis

After sweep completes, use the experiment-analysis skill:

bash
# Full analysis for each run
uv run python .claude/skills/experiment-analysis/analyze_elo.py <run-name>

# Compare in WandB
# Filter by experiment_tag_prefix (e.g., "my-ablation")

Key Features

  • Fire-and-forget: Launch and close laptop - sweep runs in Modal cloud
  • Auto-resume: If Modal times out (24hr max), sweep automatically respawns
  • Sequential execution: Runs one training at a time (infra constraint)
  • Progress tracking: State saved after each run for recovery

Example Sweeps

See existing sweeps in experiments/sweeps/:

  • longer-horizon-inverted-weight-ablation/ - 2x2 ablation on horizon and scoring

Integration

  • Use experiment-analysis skill for post-sweep metrics analysis
  • Results logged to WandB with experiment_tag for filtering
  • Document findings in sweep directory's results.md