Reward Function HOLD Bias Fix (v2.5.0)
Experiment Overview
| Item | Details |
|---|---|
| Date | 2024-12-27 |
| Goal | Fix reward function that teaches model HOLD is optimal |
| Environment | GPU-native PPO training, vectorized_env.py |
| Status | Success |
Context
Training BTCUSD showed the model wasn't finding trading opportunities over a 2-year period. Investigation revealed the reward function had a hidden HOLD bias.
Root Cause: Asymmetric payoff structure made HOLD the rational choice:
| Action | Reward If Correct | Reward If Wrong | Expected Value |
|---|---|---|---|
| HOLD | 0 | 0 | 0 (safe) |
| BUY/SELL | +0.35 to +0.90 | -0.52 (painful) | Negative after costs |
The Math for BTCUSD:
- •Slippage cost: 0.5% (crypto) per trade
- •5-bar horizon (5 hours at 1H timeframe)
- •Bitcoin 5-hour volatility: ~0.3-0.5% average
- •Expected move < slippage cost = negative EV for trading
Root Cause Analysis
| Issue | Problem | Impact |
|---|---|---|
| Slippage penalty too aggressive | 0.5% × 10 scaling = huge penalty | Model avoids trading entirely |
| Exploration bonus negligible | 0.01 × uncertainty = ~0.001 | No incentive to try trading |
| HOLD gets zero reward | HOLD = 0, wrong trade = -0.52 | Asymmetric payoff favors HOLD |
| Direction threshold too strict | 0.1% threshold vs 0.5% slippage | Correct predictions still lose money |
Verified Workflow
Step 1: Reduce Slippage Cost
# In vectorized_env.py, GPUEnvConfig slippage_cost_crypto: float = 0.002 # Was 0.005 (0.5% -> 0.2%) slippage_weight: float = 0.02 # Was 0.05
Step 2: Add Trading Incentive
# In vectorized_env.py, GPUEnvConfig
trading_incentive: float = 0.02 # NEW: Small bonus for non-HOLD actions
# In _calculate_rewards() after drawdown_penalty
trade_executed = (pred_direction != 0).float()
trading_incentive_reward = trade_executed * self.config.trading_incentive
# Add to combined reward
reward = (
... existing components ...
trading_incentive_reward # v2.5.0 - fix HOLD bias
) * risk_adjustment
Step 3: Increase Exploration Bonus
# In _calculate_rewards(), COMPONENT 5 exploration_bonus = 0.05 * uncertainty # Was 0.01
Step 4: Increase Direction Threshold
# In _calculate_rewards() threshold = 0.003 # Was 0.001 (0.3% vs 0.1%) # Must exceed slippage to be considered profitable
Step 5: Rebalance Reward Weights
# In GPUEnvConfig direction_weight: float = 0.40 # Was 0.35 - primary signal magnitude_weight: float = 0.05 # Was 0.10 - noisy component pnl_weight: float = 0.25 # Keep stop_tp_weight: float = 0.10 # Was 0.15 exploration_weight: float = 0.15 # Was 0.10 slippage_weight: float = 0.02 # Was 0.05 drawdown_penalty_weight: float = 0.03 # Was 0.05
Failed Attempts (Critical)
| Attempt | Why it Failed | Lesson Learned |
|---|---|---|
| Switch to 15-min timeframe | Smaller moves, same costs = worse HOLD bias | Fix reward function first, then consider timeframe |
| Just reduce slippage penalty | Model still biased toward HOLD | Need positive incentive for trading |
| Large trading incentive (0.1) | Caused overtrading | 0.02 is sufficient to break tie |
| Remove slippage penalty entirely | Model overtrades, ignores costs | Need penalty, just not excessive |
Final Parameters
# vectorized_env.py - GPUEnvConfig (v2.5.0) # Reward weights (8 components) direction_weight: float = 0.40 magnitude_weight: float = 0.05 pnl_weight: float = 0.25 stop_tp_weight: float = 0.10 exploration_weight: float = 0.15 slippage_weight: float = 0.02 drawdown_penalty_weight: float = 0.03 trading_incentive: float = 0.02 # NEW # Transaction costs slippage_cost_crypto: float = 0.002 # Was 0.005 slippage_cost: float = 0.001 # Equity unchanged # Direction threshold threshold = 0.003 # Was 0.001 (in _calculate_rewards) # Exploration bonus multiplier exploration_bonus = 0.05 * uncertainty # Was 0.01
Key Insights
- •
Asymmetric payoffs create bias - If HOLD=0 and wrong trade=-X, model learns HOLD is safe. Add small positive reward for trading to balance.
- •
Slippage must be < expected move - If cost to trade > expected profit, rational to never trade. Align slippage with actual broker costs (0.1-0.2%).
- •
Threshold should match slippage - Direction threshold (0.1%) below slippage (0.5%) means "correct" predictions still lose money. Set threshold >= slippage.
- •
Exploration needs real incentive - 0.01 multiplier is negligible. Increase to 0.05 for meaningful exploration bonus.
- •
Test with volatile assets first - BTCUSD has higher volatility, so if model won't trade BTC, it definitely won't trade lower-vol assets.
Expected Behavior After Fix
With the new reward function, expect:
- •Trade rate: 30-60% (was ~5%)
- •More balanced signal distribution (BUY/SELL/HOLD)
- •Model takes trades when expected move > costs
- •Still respects risk management (drawdown penalty)
References
- •
alpaca_trading/gpu/vectorized_env.py: Lines 388-410 (GPUEnvConfig), 1622-1712 (_calculate_rewards) - •
alpaca_trading/api/routes/signals.py: Lines 75, 136 (dashboard key fix: 'passed' -> 'pass') - •Literature: Risk-Aware RL Reward - multi-component reward design
- •Skill:
reward-scaling-calibration- related fix for reward_scale