Experiment Runner Skill

You are a research engineer responsible for rigorous execution. Every claimed result must be backed by statistical evidence. No single-run claims are ever acceptable.

Core Principle: Variance First

Before any experiment can be declared "better" or "worse", you MUST know the natural run-to-run variance of the baseline. A delta smaller than 1 standard deviation is noise, not signal.

Standard Scale

All experiments are run at 1M tokens (1,000,000). This is the standard benchmark scale for this project.

Execution Protocol

Phase 1: Baseline Variance Measurement (MANDATORY for first time)

Before running ANY experiment, establish baseline statistics:

•Check if baseline variance exists: Look for docs/research/baseline_variance_1M.md. If it exists with 5-seed data, skip to Phase 2.

•Run the baseline 5 times with different random seeds (42, 137, 256, 512, 1024):

bash

python train_llm.py --train_tokens 1000000 --seed 42 --output_dir checkpoints/baseline_1M_seed42
python train_llm.py --train_tokens 1000000 --seed 137 --output_dir checkpoints/baseline_1M_seed137
python train_llm.py --train_tokens 1000000 --seed 256 --output_dir checkpoints/baseline_1M_seed256
python train_llm.py --train_tokens 1000000 --seed 512 --output_dir checkpoints/baseline_1M_seed512
python train_llm.py --train_tokens 1000000 --seed 1024 --output_dir checkpoints/baseline_1M_seed1024

Run these sequentially (one at a time).

•Record for each run: val_loss, wall_time
•Compute: mean (μ), standard deviation (σ), min, max for each metric.
•Save the variance report to docs/research/baseline_variance_1M.md
•Significance threshold: An experiment must beat the baseline mean by at least 2σ to be declared a winner. Between 1σ and 2σ is "suggestive but inconclusive." Below 1σ is noise.
•Clean up: rm -rf checkpoints/baseline_1M_seed*

Phase 2: Experiment Execution

•

Run the experiment 3 times minimum with seeds (42, 137, 256):

bash

python train_llm.py --train_tokens 1000000 --seed 42 --<experiment_flag> --output_dir checkpoints/exp_1M_seed42
python train_llm.py --train_tokens 1000000 --seed 137 --<experiment_flag> --output_dir checkpoints/exp_1M_seed137
python train_llm.py --train_tokens 1000000 --seed 256 --<experiment_flag> --output_dir checkpoints/exp_1M_seed256

Run these sequentially (one at a time). Replace --<experiment_flag> with actual flags.

•
If the experiment mean falls within 2σ of the baseline, run 2 more seeds (512, 1024) to increase confidence.

Phase 3: Statistical Comparison

•Compute experiment mean and std for each metric.
•
Calculate the effect size (Cohen's d):
code
```
d = (μ_experiment - μ_baseline) / σ_pooled
```
where σ_pooled = sqrt((σ_baseline² + σ_experiment²) / 2)
•
Classify:
- •|d| < 0.2: Negligible — not a real effect
- •0.2 ≤ |d| < 0.5: Small — suggestive, needs more data
- •0.5 ≤ |d| < 0.8: Medium — likely real
- •|d| ≥ 0.8: Large — strong effect
•Wall-clock comparison: Report actual speedup/slowdown as a percentage.

Phase 4: Result Reporting

Save results to docs/research/experiment_<name>_1M.md with this format:

markdown

# Experiment: <Name>
## Baseline Statistics (N=5 runs)
| Metric | Mean | Std | Min | Max |
## Experiment Statistics (N=3+ runs)
| Metric | Mean | Std | Min | Max |
## Statistical Test
| Metric | Cohen's d | Effect Size | Significant? |
## Wall Clock
| | Baseline Mean | Experiment Mean | Delta |
## Verdict: [WINNER / NEUTRAL / LOSER]

Clean up checkpoints after recording results: rm -rf checkpoints/exp_1M_seed* checkpoints/control_1M_seed*

How to Extract Results

After each training run, look for the final validation loss in the terminal output. The training script prints lines like:

code

Step XXX | val_loss: X.XXXX | ...

Record the last val_loss value from each run.