AgentSkillsCN

training-pipeline

为 VBot 四足 RL 训练提供索引技能。通往课程学习、超参数优化、奖励/惩罚工程,以及训练计划管理等专业技能。

SKILL.md
--- frontmatter
name: training-pipeline
description: Index skill for VBot quadruped RL training. Routes to specialized skills for curriculum learning, hyperparameter optimization, reward/penalty engineering, and campaign management.

Purpose

Entry point for RL training tasks. Routes to specialized skills.

Quick Start — Just Run Training

powershell
# === FAST PROBE RUN (benchmark hardware + validate changes) ===
# 200K steps: ~16s. Good for smoke testing reward changes.
uv run scripts/train.py --env vbot_navigation_section001 --train-backend torch --max-env-steps 200000 --check-point-interval 50

# === 2M PROBE (see actual learning trends) ===
# ~4 min. Enough to confirm reward gradient is alive.
uv run scripts/train.py --env vbot_navigation_section001 --train-backend torch --max-env-steps 2000000 --check-point-interval 100

# === SINGLE TRAINING RUN (full) ===
uv run scripts/train.py --env vbot_navigation_section001 --train-backend torch

# === AUTOML / HP + REWARD SEARCH (recommended for optimization) ===
uv run starter_kit_schedule/scripts/automl.py `
    --mode stage `
    --budget-hours 8 `
    --hp-trials 15

# === MONITOR AUTOML STATE ===
Get-Content starter_kit_schedule/progress/automl_state.yaml

# === PROGRESS WATCHER (generates WAKE_UP.md for agent context) ===
uv run starter_kit_schedule/scripts/progress_watcher.py --snapshot
uv run starter_kit_schedule/scripts/progress_watcher.py --watch --interval 120

# === MANUAL TRAINING WITH RENDERING ===
uv run scripts/train.py --env vbot_navigation_section001 --train-backend torch --render

# === EVALUATE A CHECKPOINT ===
uv run scripts/play.py --env vbot_navigation_section001

# === TENSORBOARD ===
uv run tensorboard --logdir runs/vbot_navigation_section001

Hardware Performance (measured)

MetricValue
BackendPyTorch (JAX NOT available)
GPUNVIDIA (torch_gpu=True)
Training speed~7,500-12,500 steps/sec
200K steps~16s
2M steps~4 min
5M steps (HP trial)~7-8 min
50M steps (full run)~70 min
100M steps~2.2 hours

Critical Reward Function Knowledge

These issues were root-caused and fixed. Do NOT revert:

ProblemRoot CauseFix
Robot never reaches targetexp(-12.6/0.5)≈0 — dead gradient at far distanceWidened sigma to 5.0: exp(-d/5.0)
Robot learns "don't move"termination=-200 dominates per-step reward (~2-5)Reduced to -50
No explorationentropy_loss_scale=0.0 in base PPOCfgSet to 0.005 in VBotSection001PPOConfig
Only fixed 12.6m targetpose_command_range=[0,10.2,0,0,10.2,0] — no randomizationRandomized: [-3,3,-π,3,12,π]
Hardcoded reward scalesapproach_scale, arrival_bonus etc not in scales dictMoved all to RewardConfig.scales
No gradient at distanceOnly exponential decay rewardAdded distance_progress linear reward (1-d/d_max)
Robot falls and stops learningOnly penalties, no incentive to stay aliveAdded alive_bonus=0.5 per step

Skill Routing

TaskSkill
Multi-stage curriculumcurriculum-learning
Hyperparameter searchhyperparameter-optimization
Reward/penalty tuningreward-penalty-engineering
Training executiontraining-campaign
Competition strategyquadruped-competition-tutor
Robot model configurationmjcf-xml-reasoning
Visual debuggingsubagent-copilot-cli

Key Pipeline Scripts

ScriptPurpose
scripts/train.pySingle training run
scripts/play.pyEvaluate / play a checkpoint
scripts/view.pyView environment (no training)
starter_kit_schedule/scripts/automl.pyAutoML entry point — HP search, reward search, curriculum
starter_kit_schedule/scripts/train_one.pySingle trial subprocess (called by automl)
starter_kit_schedule/scripts/evaluate.pyRead TensorBoard logs for metrics
starter_kit_schedule/scripts/progress_watcher.pyProgress monitoring + WAKE_UP.md generation
starter_kit_schedule/scripts/analyze.pyCompare experiment results
starter_kit_schedule/scripts/status.pyCheck training status

When to Read Each Skill

QuestionRead
"How do I progress from flat to obstacles?"curriculum-learning
"What learning rate should I use?"hyperparameter-optimization
"Why is my robot bouncing?"reward-penalty-engineering → diagnose phase
"How do I resume interrupted training?"training-campaign
"What reward ideas have we already tried?"starter_kit_schedule/reward_library/
"How do I systematically test a reward idea?"reward-penalty-engineering
"How do I analyze training screenshots?"subagent-copilot-cli
"Where are the robot's feet?"mjcf-xml-reasoning
"What are the important competition rules?"quadruped-competition-tutor
"Launch automl / HP search"This skill — use Quick Start above

Known Issues (Already Fixed)

These issues have been resolved. Do NOT re-investigate or re-fix them:

IssueFix AppliedLocation
numpy int64/float64 not JSON serializableAdded _NumpyEncoder class + sample_from_space() returns native Python typesautoml.py
Import order: @rlcfg fails if env not registeredtrain_one.py imports vbot before motrix_rltrain_one.py
Dead reward gradient at distanceChanged sigma from 0.5 to 5.0, added distance_progress linear rewardvbot_section001_np.py
Zero entropy prevents explorationSet entropy_loss_scale=0.005motrix_rl/cfgs.py
Termination penalty too harshReduced from -200 to -50, dominates per-step rewardcfg.py
Hardcoded reward scalesMoved approach_scale, arrival_bonus, stop_scale, zero_ang_bonus into RewardConfig.scalescfg.py + vbot_section001_np.py
AutoML no TensorBoard datacheck_point_interval not passed to rl_overrides. Now auto-calculated to ensure ≥5 data pointsautoml.py
Wrong TensorBoard tag for evaluationTag is Reward / Instantaneous reward (mean) not Total rewardevaluate.py
Fixed training targetpose_command_range was [0,10.2,0,0,10.2,0] (no randomization)cfg.py

Managed Directories

DirectoryPurpose
starter_kit_schedule/Plans, configs, progress, automl state
starter_kit_schedule/reward_library/Archived reward/penalty components & configs
starter_kit_schedule/progress/AutoML state, run tracking
starter_kit_log/Experiment logs, metrics
runs/Training outputs (checkpoints, TensorBoard logs)
code