Optimize Performance Skill
Optimize ML-Agents training and inference for maximum efficiency.
When to Use
- •Training is too slow
- •Running out of GPU memory
- •Need faster iteration cycles
- •Deploying to resource-constrained devices
- •Scaling to many parallel environments
Quick Wins
1. Enable GPU Training
bash
# Verify GPU available python -c "import torch; print(torch.cuda.is_available())" # Should print: True # Training automatically uses GPU if available mlagents-learn config/ppo/3DBall.yaml --run-id=gpu_test
2. Increase Parallel Environments
bash
# Train with multiple environments in parallel mlagents-learn config/ppo/3DBall.yaml --run-id=parallel --num-envs=8 # More environments = faster data collection # But requires more CPU/memory
3. Adjust Time Horizon
yaml
# In config.yaml:
behaviors:
MyBehavior:
time_horizon: 128 # Increase from 64
# Larger horizon = less frequent model updates
# Faster training but potentially less stable
Training Optimization
Hyperparameter Tuning
yaml
behaviors:
MyBehavior:
hyperparameters:
# Larger batches = better GPU utilization
batch_size: 2048 # Up from 1024
buffer_size: 20480 # 10x batch_size
# Reduce epochs for faster updates
num_epoch: 3 # Down from 5
# Larger learning rate = faster convergence (if stable)
learning_rate: 5.0e-4 # Up from 3.0e-4
Network Architecture
yaml
network_settings: # Smaller networks train faster hidden_units: 64 # Down from 128 or 256 num_layers: 2 # Down from 3 # But may reduce learning capacity # Balance speed vs performance
Summary Frequency
yaml
behaviors:
MyBehavior:
summary_freq: 50000 # Up from 10000
# Less frequent summaries = faster training
# But less granular monitoring
Memory Optimization
GPU Memory
yaml
# Reduce memory usage: hyperparameters: batch_size: 512 # Down from 1024 buffer_size: 5120 # 10x batch_size # Train fewer parallel environments num_envs: 2 # Down from 4 or 8
Buffer Management
yaml
behaviors:
MyBehavior:
max_steps: 1000000
# Smaller buffer = less memory
hyperparameters:
buffer_size: 2048 # Minimum: batch_size * 2
Inference Optimization
Model Compression
yaml
# Create smaller inference model: network_settings: hidden_units: 32 # Minimal network num_layers: 2 normalize: true # Keep normalization
Inference Settings in Unity
csharp
// In Unity Behavior Parameters:
// - Model: Your .onnx model
// - Inference Device: GPU (if available)
// - Behavior Type: Inference Only
// Optimize decision frequency
public class OptimizedAgent : Agent
{
public override void OnEpisodeBegin()
{
// Request decisions less frequently
DecisionRequester.DecisionPeriod = 5; // Every 5 steps
// vs default 1 (every step)
}
}
Profiling
Python Profiling
bash
# Profile training loop
python -m cProfile -o profile.stats \
-m mlagents.trainers.learn \
config.yaml --run-id=profile
# View results
python -c "import pstats; p=pstats.Stats('profile.stats'); p.sort_stats('cumulative').print_stats(20)"
GPU Profiling
bash
# Monitor GPU usage in real-time watch -n 1 nvidia-smi # Look for: # - GPU utilization > 80% (good) # - Memory usage (shouldn't hit limit) # - Temperature (should be reasonable)
TensorBoard Profiling
python
# Enable profiling in custom training (advanced)
from torch.profiler import profile, ProfilerActivity
with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA]) as prof:
# Training code here
pass
prof.export_chrome_trace("trace.json")
# View in chrome://tracing
Performance Benchmarks
Baseline Performance
bash
# Test training speed time mlagents-learn config/ppo/3DBall.yaml \ --run-id=speed_test \ --env=./builds/3DBall \ --max-steps=10000 # Note time taken # Compare after optimizations
Measure FPS
bash
# Check environment frames per second # Look for "Steps per second" in training output # Target: > 10K steps/sec for simple environments # Complex environments: > 1K steps/sec
Optimization Checklist
- • GPU enabled and utilized
- • Parallel environments configured (4-8 typical)
- • Batch size optimized for GPU (1024-4096)
- • Network architecture appropriate for task
- • Time horizon balanced (64-128)
- • Summary frequency reduced for production
- • Memory usage within limits
- • Decision frequency optimized in Unity
- • Model size appropriate for deployment
Common Bottlenecks
CPU Bottleneck
Symptoms:
- •Low GPU utilization
- •High CPU usage
- •Slow environment steps
Solutions:
- •Increase
num_envs(more parallel environments) - •Simplify environment logic
- •Use compiled/optimized Unity builds
GPU Bottleneck
Symptoms:
- •100% GPU utilization
- •Slow training despite fast environment
Solutions:
- •Reduce batch_size or network size
- •Enable mixed precision training (advanced)
- •Upgrade GPU if possible
Memory Bottleneck
Symptoms:
- •Training crashes with OOM
- •System freezes
- •Slow disk swapping
Solutions:
- •Reduce buffer_size and batch_size
- •Train fewer parallel environments
- •Close other applications
- •Use smaller network architecture
Advanced Optimizations
Mixed Precision Training
python
# In custom training code (advanced)
from torch.cuda.amp import autocast, GradScaler
scaler = GradScaler()
with autocast():
# Forward pass in FP16
loss = compute_loss()
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
Distributed Training
bash
# Train across multiple GPUs (advanced) # Requires custom setup torchrun --nproc_per_node=4 \ -m mlagents.trainers.learn \ config.yaml --run-id=distributed
Monitoring Performance
bash
# Real-time monitoring tensorboard --logdir=results --reload_interval=5 # Key metrics: # - Environment/Step: Steps per second # - Losses/Policy Loss: Should stabilize quickly # - Policy/Learning Rate: Verify schedule
Expected Performance
| Environment | Steps/sec | GPU Usage | Time to Train |
|---|---|---|---|
| 3DBall | 20K-50K | 20-40% | 2-5 min |
| GridWorld | 15K-30K | 30-50% | 5-10 min |
| Hallway | 10K-20K | 40-60% | 10-20 min |
| SoccerTwos | 5K-10K | 60-80% | 30-60 min |
With GPU, 4-8 parallel environments
Related Skills
- •
train-ml-agent- Apply optimizations during training - •
debug-training- Diagnose performance issues - •
export-models- Create optimized inference models