Name: training-pipeline
Rating: 92
Author: mzqef

Purpose

Entry point for RL training tasks. Routes to specialized skills.

Quick Start — Just Run Training

powershell

# === FAST PROBE RUN (benchmark hardware + validate changes) ===
# 200K steps: ~16s. Good for smoke testing reward changes.
uv run scripts/train.py --env vbot_navigation_section001 --train-backend torch --max-env-steps 200000 --check-point-interval 50

# === 2M PROBE (see actual learning trends) ===
# ~4 min. Enough to confirm reward gradient is alive.
uv run scripts/train.py --env vbot_navigation_section001 --train-backend torch --max-env-steps 2000000 --check-point-interval 100

# === SINGLE TRAINING RUN (full) ===
uv run scripts/train.py --env vbot_navigation_section001 --train-backend torch

# === AUTOML / HP + REWARD SEARCH (recommended for optimization) ===
uv run starter_kit_schedule/scripts/automl.py `
    --mode stage `
    --budget-hours 8 `
    --hp-trials 15

# === MONITOR AUTOML STATE ===
Get-Content starter_kit_schedule/progress/automl_state.yaml

# === PROGRESS WATCHER (generates WAKE_UP.md for agent context) ===
uv run starter_kit_schedule/scripts/progress_watcher.py --snapshot
uv run starter_kit_schedule/scripts/progress_watcher.py --watch --interval 120

# === MANUAL TRAINING WITH RENDERING ===
uv run scripts/train.py --env vbot_navigation_section001 --train-backend torch --render

# === EVALUATE A CHECKPOINT ===
uv run scripts/play.py --env vbot_navigation_section001

# === TENSORBOARD ===
uv run tensorboard --logdir runs/vbot_navigation_section001

Hardware Performance (measured)

Metric	Value
Backend	PyTorch (JAX NOT available)
GPU	NVIDIA (torch_gpu=True)
Training speed	~7,500-12,500 steps/sec
200K steps	~16s
2M steps	~4 min
5M steps (HP trial)	~7-8 min
50M steps (full run)	~70 min
100M steps	~2.2 hours

Critical Reward Function Knowledge

These issues were root-caused and fixed. Do NOT revert:

Problem	Root Cause	Fix
Robot never reaches target	`exp(-12.6/0.5)≈0` — dead gradient at far distance	Widened sigma to 5.0: `exp(-d/5.0)`
Robot learns "don't move"	`termination=-200` dominates per-step reward (~2-5)	Reduced to `-50`
No exploration	`entropy_loss_scale=0.0` in base PPOCfg	Set to `0.005` in VBotSection001PPOConfig
Only fixed 12.6m target	`pose_command_range=[0,10.2,0,0,10.2,0]` — no randomization	Randomized: `[-3,3,-π,3,12,π]`
Hardcoded reward scales	`approach_scale`, `arrival_bonus` etc not in scales dict	Moved all to `RewardConfig.scales`
No gradient at distance	Only exponential decay reward	Added `distance_progress` linear reward (1-d/d_max)
Robot falls and stops learning	Only penalties, no incentive to stay alive	Added `alive_bonus=0.5` per step

Skill Routing

Task	Skill
Multi-stage curriculum	→ `curriculum-learning`
Hyperparameter search	→ `hyperparameter-optimization`
Reward/penalty tuning	→ `reward-penalty-engineering`
Training execution	→ `training-campaign`
Competition strategy	→ `quadruped-competition-tutor`
Robot model configuration	→ `mjcf-xml-reasoning`
Visual debugging	→ `subagent-copilot-cli`

Key Pipeline Scripts

Script	Purpose
`scripts/train.py`	Single training run
`scripts/play.py`	Evaluate / play a checkpoint
`scripts/view.py`	View environment (no training)
`starter_kit_schedule/scripts/automl.py`	AutoML entry point — HP search, reward search, curriculum
`starter_kit_schedule/scripts/train_one.py`	Single trial subprocess (called by automl)
`starter_kit_schedule/scripts/evaluate.py`	Read TensorBoard logs for metrics
`starter_kit_schedule/scripts/progress_watcher.py`	Progress monitoring + WAKE_UP.md generation
`starter_kit_schedule/scripts/analyze.py`	Compare experiment results
`starter_kit_schedule/scripts/status.py`	Check training status

When to Read Each Skill

Question	Read
"How do I progress from flat to obstacles?"	`curriculum-learning`
"What learning rate should I use?"	`hyperparameter-optimization`
"Why is my robot bouncing?"	`reward-penalty-engineering` → diagnose phase
"How do I resume interrupted training?"	`training-campaign`
"What reward ideas have we already tried?"	`starter_kit_schedule/reward_library/`
"How do I systematically test a reward idea?"	`reward-penalty-engineering`
"How do I analyze training screenshots?"	`subagent-copilot-cli`
"Where are the robot's feet?"	`mjcf-xml-reasoning`
"What are the important competition rules?"	`quadruped-competition-tutor`
"Launch automl / HP search"	This skill — use Quick Start above

Known Issues (Already Fixed)

These issues have been resolved. Do NOT re-investigate or re-fix them:

Issue	Fix Applied	Location
numpy int64/float64 not JSON serializable	Added `_NumpyEncoder` class + `sample_from_space()` returns native Python types	`automl.py`
Import order: `@rlcfg` fails if env not registered	`train_one.py` imports `vbot` before `motrix_rl`	`train_one.py`
Dead reward gradient at distance	Changed sigma from 0.5 to 5.0, added `distance_progress` linear reward	`vbot_section001_np.py`
Zero entropy prevents exploration	Set `entropy_loss_scale=0.005`	`motrix_rl/cfgs.py`
Termination penalty too harsh	Reduced from -200 to -50, dominates per-step reward	`cfg.py`
Hardcoded reward scales	Moved `approach_scale`, `arrival_bonus`, `stop_scale`, `zero_ang_bonus` into RewardConfig.scales	`cfg.py` + `vbot_section001_np.py`
AutoML no TensorBoard data	`check_point_interval` not passed to rl_overrides. Now auto-calculated to ensure ≥5 data points	`automl.py`
Wrong TensorBoard tag for evaluation	Tag is `Reward / Instantaneous reward (mean)` not `Total reward`	`evaluate.py`
Fixed training target	`pose_command_range` was [0,10.2,0,0,10.2,0] (no randomization)	`cfg.py`

Managed Directories

Directory	Purpose
`starter_kit_schedule/`	Plans, configs, progress, automl state
`starter_kit_schedule/reward_library/`	Archived reward/penalty components & configs
`starter_kit_schedule/progress/`	AutoML state, run tracking
`starter_kit_log/`	Experiment logs, metrics
`runs/`	Training outputs (checkpoints, TensorBoard logs)

code