Purpose
Guide competitors to top rankings in MotrixArena S1 quadruped navigation challenge:
- •Quadruped fundamentals - VBot 12-DOF robot architecture, gait patterns, stability
- •RL for robotics - PPO training, observation design, action spaces
- •Reward engineering - Sigmoid distance, collision detection, checkpoint training
- •Terrain strategies - Stairs, waves, rolling obstacles, celebration zones
- •Score optimization - Maximize points through bonus zones and efficient traversal
- •Submission preparation - Code, weights, video, technical documentation
Competition Overview: MotrixArena S1
Two-Phase Structure
| Phase | Terrain | Max Points | Time Limit |
|---|---|---|---|
| Stage 1 (navigation1) | Flat ground | 20 pts | Per-dog limit |
| Stage 2 (navigation2) | Obstacle course | 105 pts | Episode limits |
Scoring Summary
Stage 1 (20 pts total): └── 10 navigation dogs × 2 pts each Stage 2 Section 1 (20 pts): ├── Base traverse: ~5 pts ├── Smiley circles: 2×2 + 2×4 = 12 pts ├── Red packets: 3×2 = 6 pts └── Celebration dance: 2 pts Stage 2 Section 2 (60 pts): ├── Wave terrain: 8-12 pts ├── Stairs: 15-20 pts ├── Bridge/Riverbed: 10-15 pts └── Red packets: 6-12 pts (scattered) Stage 2 Section 3 (25 pts): ├── Rolling balls: 10-15 pts (dynamic avoidance) ├── Random terrain: 5 pts └── Final celebration: 5 pts
Competition Environment IDs
| Environment ID | Terrain | Training Focus |
|---|---|---|
vbot_navigation_section001 | Flat ground | Basic locomotion, Stage 1 (navigation1) |
vbot_navigation_section01 | Section 1 challenges | Bonus collection (navigation2) |
vbot_navigation_section02 | Section 2 terrain | Terrain traversal (navigation2) |
vbot_navigation_section03 | Section 3 | Rolling balls, finale (navigation2) |
vbot_navigation_stairs | Stairs + platforms | Stair climbing (navigation2) |
vbot_navigation_long_course | Full 30m course | End-to-end training (navigation2) |
VBot Robot Architecture
12-DOF Quadruped Structure
VBot Kinematic Tree: base (floating, 6 DOF via freejoint) ├── FR_hip → FR_thigh → FR_calf (3 DOF) ├── FL_hip → FL_thigh → FL_calf (3 DOF) ├── RR_hip → RR_thigh → RR_calf (3 DOF) └── RL_hip → RL_thigh → RL_calf (3 DOF)
Joint Configuration
| Joint Group | Name Pattern | Range (rad) | Function |
|---|---|---|---|
| Hip Abduction/Adduction | *_hip_joint | ±0.6 ~ ±1.0 | Lateral stability |
| Hip Flexion/Extension | *_thigh_joint | 0.5 ~ 1.2 | Forward stride |
| Knee Flexion | *_calf_joint | -2.5 ~ -1.2 | Ground clearance |
Default Standing Pose (radians):
default_joint_angles = {
"FR_hip_joint": 0.0, "FR_thigh_joint": 0.9, "FR_calf_joint": -1.8,
"FL_hip_joint": 0.0, "FL_thigh_joint": 0.9, "FL_calf_joint": -1.8,
"RR_hip_joint": 0.0, "RR_thigh_joint": 0.9, "RR_calf_joint": -1.8,
"RL_hip_joint": 0.0, "RL_thigh_joint": 0.9, "RL_calf_joint": -1.8,
}
Actuator Configuration
| Parameter | Value | Purpose |
|---|---|---|
| Control Mode | Position Servo (PD) | Stable joint tracking |
| kp (stiffness) | 80.0 N·m/rad | Position gain |
| kv (damping) | 6.0 N·m·s/rad | Velocity damping |
| action_scale | 0.25 | Scale [-1,1] actions to radians |
| Torque Limits | Hip/Thigh: ±17 N·m, Calf: ±34 N·m | From XML forcerange |
Observation Space (54 dimensions)
obs = [
base_linear_velocity, # 3D, normalized by 2.0
base_angular_velocity, # 3D (gyro), normalized by 0.25
projected_gravity, # 3D, gravity in body frame
joint_positions_relative, # 12D, normalized by 1.0
joint_velocities, # 12D, normalized by 0.05
last_actions, # 12D
velocity_commands, # 3D (vx, vy, yaw_rate)
position_error, # 2D (to target)
heading_error, # 1D
distance_normalized, # 1D
reached_flag, # 1D
stop_ready_flag, # 1D
]
Action Space (12 dimensions)
Output: 12 joint position offsets in range [-1, 1], scaled by action_scale:
target_position = default_angles + action * action_scale torque = kp * (target - current_pos) - kv * current_vel
Training Infrastructure
Campaign management, checkpoints, and monitoring: See
training-campaignskill.
Commands
# === PREFERRED: AutoML pipeline (handles HP search automatically) ===
uv run starter_kit_schedule/scripts/automl.py `
--mode stage `
--budget-hours 12 `
--hp-trials 8
# Basic training (2048 parallel envs)
uv run scripts/train.py --env vbot_navigation_section001
# With rendering (slower but visual feedback)
uv run scripts/train.py --env vbot_navigation_section001 --render
# PyTorch backend (Windows recommended)
uv run scripts/train.py --env vbot_navigation_stairs --train-backend torch
# View trained policy
uv run scripts/play.py --env vbot_navigation_section001
PPO Hyperparameters (VBotSection001PPOConfig)
Tuning and search: See
hyperparameter-optimizationskill.
@dataclass
class VBotSection001PPOConfig(PPOCfg):
policy_hidden_layer_sizes: tuple = (256, 128, 64)
value_hidden_layer_sizes: tuple = (256, 128, 64)
rollouts: int = 24 # Steps before update
learning_epochs: int = 8 # PPO epochs per update
mini_batches: int = 4 # Batches per epoch
discount_factor: float = 0.99 # Gamma
lambda_param: float = 0.95 # GAE lambda
learning_rate: float = 3e-4
learning_rate_scheduler_kl_threshold: float = 0.008
ratio_clip: float = 0.2 # PPO clip
value_clip: float = 0.2
entropy_loss_scale: float = 0.0 # No entropy bonus
value_loss_scale: float = 2.0
grad_norm_clip: float = 1.0
Training Timeline Expectations
| Environment | Convergence | Reward Range | Notes |
|---|---|---|---|
| Flat navigation | 500K steps | 20-40 | Fast, stable |
| Stairs | 2-5M steps | 15-30 | Medium difficulty |
| Section 001-013 | 5-10M steps | Variable | Curriculum helps |
| Long course | 10-20M steps | 50-80 | Requires staged training |
Reward Function Engineering
Current implementation:
starter_kit/navigation*/vbot/vbot_*_np.py→_compute_reward(). Reward weights:starter_kit/navigation*/vbot/cfg.py→RewardConfig.scalesdict. Exploration methodology (how to find/test/archive rewards): Seereward-penalty-engineeringskill. Archived configurations: Seestarter_kit_schedule/reward_library/.
Core Reward Principles
- •Dense rewards - Provide continuous feedback, not just goal completion
- •Balance exploration vs exploitation - Entropy bonus in early training
- •Avoid reward hacking - Test that high rewards = desired behavior
- •Smooth gradients - Sigmoid/exponential better than step functions
Navigation Task Reward Structure
reward_scales = {
# === Primary Navigation ===
"position_tracking": 2.0, # Exponential decay with distance
"fine_position_tracking": 2.0, # Extra reward when close
"heading_tracking": 1.0, # Face movement direction
"forward_velocity": 0.5, # Encourage speed toward goal
# === Stability Penalties ===
"orientation": -0.05, # Penalize body tilt
"lin_vel_z": -0.5, # Penalize vertical bounce
"ang_vel_xy": -0.05, # Penalize body rotation
"torques": -1e-5, # Energy efficiency
"dof_vel": -5e-5, # Smooth joint motion
"dof_acc": -2.5e-7, # Jerk reduction
"action_rate": -0.01, # Smooth policy
# === Terminal States ===
"termination": -200.0, # Body-ground collision
}
Advanced Reward Techniques
1. Sigmoid Distance Reward
def sigmoid_distance_reward(distance, distance_scale=5.0):
"""
Smoother than linear, provides gradient even far from goal.
R = 1 / (1 + exp(0.5 * distance / scale))
"""
distance_ratio = distance / distance_scale
return 1.0 / (1.0 + np.exp(0.5 * distance_ratio))
Comparison:
| Distance | Linear (1-d/5) | Sigmoid |
|---|---|---|
| 0m | 1.0 | 0.5 |
| 2m | 0.6 | 0.45 |
| 5m | 0.0 | 0.38 |
| 10m | -1.0 (clipped) | 0.27 |
2. Swing Time Reward (Gait Quality)
def swing_time_reward(foot_contacts, dt=0.01):
"""
Encourage natural gait rhythm (0.3-0.7s swing, optimal 0.5s).
Track time since last contact per foot.
"""
for foot in ['FR', 'FL', 'RR', 'RL']:
if foot_contacts[foot]:
swing_time = time_since_contact[foot]
if 0.3 <= swing_time <= 0.7:
reward += gaussian(swing_time, mean=0.5, std=0.1) * 0.2
time_since_contact[foot] = 0
else:
time_since_contact[foot] += dt
return reward
3. Collision Detection Reward
def collision_penalty(contact_forces, foot_names=['FR', 'FL', 'RR', 'RL']):
"""
Detect non-foot collisions using force sensor threshold.
F > 5 * |Fz| suggests horizontal collision (not standing).
"""
for geom, force in contact_forces.items():
if geom not in foot_names:
Fxy = np.linalg.norm(force[:2])
Fz = abs(force[2])
if Fxy > 5 * Fz: # Significant horizontal force
return -5.0 # Collision penalty
return 0.0
4. Checkpoint Training
def checkpoint_curriculum(checkpoints, robot_pos, reached_set):
"""
Incremental rewards for reaching waypoints.
Stage 2 Section 2 has ~8 checkpoints along the course.
"""
reward = 0.0
checkpoint_rewards = [5.0, 5.5, 6.0, 6.5, 7.0, 7.5, 8.0, 8.5] # Increasing
for i, (cp_pos, cp_radius) in enumerate(checkpoints):
if i not in reached_set:
dist = np.linalg.norm(robot_pos[:2] - cp_pos)
if dist < cp_radius:
reward += checkpoint_rewards[i]
reached_set.add(i)
return reward
5. Anti-Stagnation Mechanism
def stagnation_penalty(position_history, threshold=0.1, window=10):
"""
Penalize if robot moves < 0.1m in 10 control cycles.
Progressive penalty increases with stagnation duration.
"""
if len(position_history) < window:
return 0.0
displacement = np.linalg.norm(position_history[-1] - position_history[-window])
if displacement < threshold:
stagnation_count += 1
return -0.1 * min(stagnation_count, 10) # Cap at -1.0
else:
stagnation_count = 0
return 0.0
6. Roll/Pitch Safety Boundary
def orientation_safety(roll, pitch, limit_deg=60):
"""
Terminate or heavily penalize dangerous tilts.
"""
limit_rad = np.deg2rad(limit_deg)
if abs(roll) > limit_rad or abs(pitch) > limit_rad:
return -10.0 # Emergency penalty
return 0.0
Terrain Traversal Strategies
Wave Terrain (Section 2A)
Challenges: Undulating surface, momentum maintenance, foot placement timing
Strategies:
- •Initial stabilization - Train standing on waves before walking
- •Adaptive stride length - Shorter steps on downslopes, longer on upslopes
- •COM (Center of Mass) control - Keep body low during transitions
- •Velocity modulation - Slow down approaching wave peaks
Reward additions:
# Height variance penalty (encourage smooth traversal)
height_variance = np.var(foot_heights)
reward -= 0.1 * height_variance
# Forward progress bonus on waves
if terrain_type == "wave" and forward_vel > 0.3:
reward += 0.5
Stairs (Section 2B)
Challenges: Step height clearance, balance during ascent/descent, edge detection
Strategies:
- •Slope adaptation - Detect inclination, adjust body pitch
- •Foot edge distance - Train to avoid step edges (slip risk)
- •Dynamics compensation - Higher knee lift during ascent
- •Slower velocity - Prioritize stability over speed
Reward additions:
# Knee lift bonus during stair ascent
if terrain_gradient > 0.1: # Ascending
knee_lift = -joint_pos['calf'] # More negative = more bent
reward += 0.2 * max(0, knee_lift - 1.5)
# Penalize foot slip (large tangential forces on steps)
if stair_contact and tangential_force > threshold:
reward -= 0.5
Rolling Balls (Section 3)
Challenges: Dynamic obstacles, trajectory prediction, reactive avoidance
Strategies:
- •Peripheral awareness - Include ball positions in observation
- •Conservative paths - Prefer edges over center
- •Pause & proceed - Wait for safe window if needed
- •Recovery training - Train to recover from ball impacts
Observation addition:
# Add ball observations (if visible) ball_positions = get_ball_positions() # [N, 3] relative positions ball_velocities = get_ball_velocities() # [N, 3] obs = np.concatenate([obs, ball_positions.flatten(), ball_velocities.flatten()])
Celebration Zones
Bonus points for:
- •Smiley circles: Stand still inside for 1-2 seconds
- •Red packets: Touch/pass through
- •Final dance: Execute specific motion pattern
Celebration reward:
def celebration_reward(robot_pos, celebration_zones, gyro):
"""
+2 to +5 pts for successful celebrations
"""
for zone_pos, zone_radius, zone_type in celebration_zones:
if np.linalg.norm(robot_pos[:2] - zone_pos) < zone_radius:
if zone_type == "smiley":
# Need to be stable (low rotation rate)
if np.linalg.norm(gyro) < 0.5:
return 2.0 if small_smiley else 4.0
elif zone_type == "red_packet":
return 2.0 # Instant bonus
return 0.0
Curriculum Strategy
Full curriculum plans, warm-start strategies, and promotion criteria: See
curriculum-learningskill. Reward exploration methodology: Seereward-penalty-engineeringskill.
Stage Overview
Stage 1: Flat Ground → Stage 2A: Waves → Stage 2B: Stairs → Stage 2C: Balls → Full Course
(500K) (2M) (3M) (2M) (5M)
Each stage warm-starts from the previous best checkpoint with reduced LR (0.3-0.5×).
Checkpoint Loading
# Load pretrained policy for curriculum
trainer = ppo.Trainer(
env_name="vbot_navigation_stairs",
sim_backend=None,
cfg_override={
"num_envs": 2048,
"checkpoint_path": "runs/vbot_navigation_section001/checkpoints/agent.pt"
}
)
trainer.train()
Scoring Optimization Tactics
Stage 1: Flat Ground (20 pts)
Target: 10/10 dogs, 2 pts each = 20 pts
Key metrics:
- •Distance traveled toward each target
- •Time to reach (faster = better tiebreaker)
- •No falls
Optimization:
- •Train to 100% success rate on flat
- •Maximize velocity while maintaining stability
- •Use
vbot_navigation_section001with tight position tolerance (0.3m)
Stage 2: Section 1 (20 pts)
Break down:
| Element | Points | Strategy |
|---|---|---|
| Base traverse | 5 | Just walk forward |
| Small smileys (×2) | 4 | Stop in circle, stay stable 1s |
| Large smileys (×2) | 8 | Stop in circle, stay stable 1s |
| Red packets (×3) | 6 | Slight detour, touch each |
Observation: Smileys require stopping, not just passing through!
Stage 2: Section 2 (60 pts)
Checkpoint strategy:
- •Place checkpoints every 2-3m along route
- •Assign increasing rewards (5→8 pts progression)
- •Track reached checkpoints in episode info
Time management:
- •Section 2 has 60 second episode limit
- •Average speed needed: 0.5 m/s for 30m course
- •Budget extra time for stairs (slowest segment)
Stage 2: Section 3 (25 pts)
Dynamic obstacles:
- •Ball positions change each episode
- •Need reactive policy, not memorized path
- •Include ball observations in training
Final celebration (5 pts):
- •Must execute specific motion at end zone
- •Train celebration as separate skill, then combine
Common Failure Modes & Fixes
| Failure | Symptom | Fix |
|---|---|---|
| Falls on flat ground | Body touches ground | Increase orientation penalty, add armature |
| Stuck on stair edge | No progress, foot scraping | Add foot clearance reward, knee lift bonus |
| Overshoots target | Oscillates around goal | Add velocity damping near target, fine tracking reward |
| Wobbly gait | Body rotation, unstable | Increase ang_vel_xy penalty, action smoothness |
| Slow convergence | Reward plateau | Check reward scaling, try curriculum |
| Ball collisions | Gets hit, falls | Add ball observations, train avoidance |
| Celebration failure | Moves through smileys | Add explicit stop reward, stability check |
Submission Checklist
Required Materials
- •
Code Repository
- •All training code
- •Environment configurations
- •Reward function implementations
- •
Policy Weights
- •Final checkpoint (
.ptor.safetensors) - •Include any preprocessing/normalization stats
- •Final checkpoint (
- •
Demo Video (≤3 min)
- •Show full course completion
- •Highlight challenging sections (stairs, balls)
- •Include timing overlay
- •
Technical Document (1-5 pages)
- •Architecture description
- •Reward function design rationale
- •Training curriculum
- •Key innovations
Pre-Submission Testing
# Test on evaluation environment uv run scripts/play.py --env vbot_navigation_section001 # Play long course uv run scripts/play.py --env vbot_navigation_long_course
Final Checks
- • No hardcoded positions or memorized trajectories
- • Handles position randomization
- • Recovers from perturbations
- • Completes course with no falls
- • More bonus zones collected
Quick Reference: Environment Configs
Modify Reward Scales
# In starter_kit/navigation1/vbot/cfg.py → RewardConfig class
@dataclass
class RewardConfig:
scales: dict[str, float] = field(
default_factory=lambda: {
"position_tracking": 2.0, # ← Adjust these
"termination": -200.0,
# ... add new rewards here
}
)
Add New Reward Terms
# In starter_kit/navigation1/vbot/vbot_section001_np.py → _compute_reward()
def _compute_reward(self, data, info, velocity_commands):
reward = np.zeros(self._num_envs)
# Existing rewards...
# Add custom reward
distance = np.linalg.norm(info["position_error"], axis=1)
reward += self._cfg.reward_config.scales.get("custom_distance", 0.0) * \
sigmoid_distance_reward(distance)
return reward
Extend Observation Space
# In vbot_section001_np.py
def __init__(self, cfg, num_envs=1):
# Change observation space size
self._observation_space = gym.spaces.Box(
low=-np.inf, high=np.inf,
shape=(54 + NEW_OBS_DIM,), # ← Add dimensions
dtype=np.float32
)
VBot XML Quick Reference
| Element | Location | Purpose |
|---|---|---|
| Robot model | xmls/vbot.xml (included) | Joint limits, masses |
| Contact params | <default class="foot"> | Friction, condim |
| Sensors | <sensor> | IMU, contact forces |
| Actuators | <actuator> | Joint servos |
Summary: Path to Top Ranking
- •Master flat ground first (Stage 1 = easy 20 pts)
- •Build curriculum for each terrain type
- •Engineer dense rewards with Sigmoid distance, checkpoints
- •Add safety penalties for orientation, collision
- •Include terrain-specific rewards (knee lift for stairs, etc.)
- •Test exhaustively before submission
- •Document innovations in technical report