AgentSkillsCN

reward-function-v330

当您需要调优强化学习奖励函数权重、修复奖励作弊问题,或解决损益梯度相关问题时,可使用此技能。

SKILL.md
--- frontmatter
name: reward-function-v330
description: "Use when tuning RL reward function weights, fixing reward hacking, or addressing P&L gradient issues"
author: Claude
date: 2025-01-21
version: v3.3.0

Reward Function v3.3.0 - Risk-Aware Composite Reward

Overview

ItemDetails
Date2025-01-21
GoalFix reward hacking and P&L gradient saturation
Filesalpaca_trading/gpu/vectorized_env.py, alpaca_trading/gpu/ppo_trainer_native.py
StatusSuccess

Context

v2.5.0 introduced trading_incentive (+0.02 for non-HOLD actions) to fix HOLD bias. However, this created the opposite problem: reward hacking via overtrading.

Additional issues:

  • Direction weight (40%) dominated despite being binary (+1/-1)
  • P&L weight (25%) too low despite being the actual objective
  • P&L saturated via tanh(pnl * 10) - destroyed gradient signal beyond ±0.1% moves
  • Exploration was limited due to low entropy coefficient (0.005)

Solution: Rebalanced Risk-Aware Reward

Weight Changes

Componentv2.5.0v3.3.0Rationale
direction_weight0.400.15Binary signal, shouldn't dominate
pnl_weight0.250.40Primary trading objective
magnitude_weight0.050.05Unchanged
exploration_weight0.150.15Unchanged
slippage_cost0.020.02Unchanged
drawdown_penalty0.030.03Unchanged
stop_tp_weight0.100.10Unchanged
trading_incentive0.02REMOVEDCaused reward hacking

P&L Reward Fix

python
# OLD (v2.5.0) - saturates quickly, destroys gradients
pnl_reward = torch.tanh(pnl * 10)  # Clips at ±1 for |pnl| > 0.1%

# NEW (v3.3.0) - linear with wider range, preserves gradients
pnl_reward = torch.clamp(pnl * 100, -2.0, 2.0)  # Linear until ±2% move

Why this matters:

  • tanh(pnl * 10) reaches 0.999 at pnl=0.3% - no gradient for larger moves
  • clamp(pnl * 100, -2, 2) preserves linear gradient until ±2% moves
  • Larger moves still get proportionally larger rewards

PPO Hyperparameters

Parameterv2.5.0v3.3.0Rationale
gamma0.9950.991Shorter horizon for hourly trading
ent_coef0.0050.0153x increase for exploration

Code Location

python
# vectorized_env.py - GPUEnvConfig dataclass (~line 479)
@dataclass
class GPUEnvConfig:
    # v3.3.0: Risk-Aware Composite Reward weights
    direction_weight: float = 0.15       # Reduced from 0.40
    magnitude_weight: float = 0.05       # Unchanged
    pnl_weight: float = 0.40             # Increased from 0.25
    exploration_weight: float = 0.15     # Unchanged
    slippage_cost_equity: float = 0.001  # Unchanged
    slippage_cost_crypto: float = 0.002  # Unchanged
    drawdown_penalty: float = 0.03       # Unchanged
    stop_tp_weight: float = 0.10         # Unchanged
    # REMOVED: trading_incentive (caused reward hacking)
python
# ppo_trainer_native.py - NativePPOConfig dataclass (~line 120)
@dataclass
class NativePPOConfig:
    gamma: float = 0.991      # v3.3.0: Reduced from 0.995
    ent_coef: float = 0.015   # v3.3.0: 3x increase from 0.005

Failed Attempts

AttemptWhy it FailedLesson Learned
trading_incentive = 0.02Created overtrading, model collected bonus constantlyRemove entirely, use proper P&L weighting
tanh(pnl * 10)Gradients vanish at ±0.1%, couldn't learn from larger movesUse linear clamp with wider range
direction_weight = 0.40Binary signal dominated continuous P&LReduce to 0.15, prioritize P&L
ent_coef = 0.005Insufficient exploration, model converged to local optima3x increase to 0.015
gamma = 0.995Too long horizon for hourly trading decisionsReduce to 0.991 for shorter horizon

Key Insights

Trading Incentive Creates Reward Hacking

The model learned: "Any trade gives +0.02, slippage penalty is ~0.02, so trade constantly and hope for winners." This is classic reward hacking - optimizing the surrogate (incentive) rather than the true objective (P&L).

P&L Should Be Primary Objective

Direction is a weak proxy - being "right" about direction but wrong about magnitude loses money. P&L directly measures what we want: profit.

Entropy Coefficient Balances Exploration

  • Too low (0.005): Model converges quickly but to local optima
  • Too high (>0.05): Model explores randomly, never converges
  • Sweet spot (0.015): Enough exploration to find good strategies

Gamma Affects Time Horizon

  • gamma=0.995 means 50% credit at step 139 (hourly = 6+ days)
  • gamma=0.991 means 50% credit at step 77 (hourly = 3+ days)
  • Shorter horizon more appropriate for hourly trading

Related Skills

  • reward-function-hold-bias - Historical context on HOLD bias fix
  • reward-scaling-calibration - reward_scale=0.001 calibration
  • training-improvements-v245 - LR warmup and validation improvements