Account-Aware RL Training (v2.4)

Experiment Overview

Item	Details
Date	2024-12-26
Goal	Make RL model learn from account state (P&L, win rate, drawdown)
Environment	vectorized_env.py, inference_obs_builder.py, training notebook
Status	Success

Context

Prior to v2.4, the RL model was "blind" to account performance. It received:

•53 features: price action, technicals, regime probabilities, calendar effects
•No information about cumulative P&L, win rate, or drawdown

Problem: The model could generate signals that were individually good but led to excessive drawdowns at the account level. It had no incentive to trade conservatively after losses.

Solution: Add 3 account-level features + drawdown penalty in rewards.

Verified Workflow

1. Config Parameters (GPUEnvConfig)

python

# In vectorized_env.py GPUEnvConfig dataclass (~line 405)
# Account-aware training (v2.4)
drawdown_penalty_threshold: float = 0.15  # Penalize when drawdown > 15%
drawdown_penalty_weight: float = 0.10     # Weight in reward function

2. Equity Tracking Tensors

python

# In _init_state_tensors() after line 712
# Account-level equity tracking (v2.4)
self.initial_equity = torch.ones(self.n_envs, dtype=self.dtype, device=self.device)
self.peak_equity = torch.ones(self.n_envs, dtype=self.dtype, device=self.device)
self.current_equity = torch.ones(self.n_envs, dtype=self.dtype, device=self.device)

3. Reset Equity Tensors

python

# In reset() after line 850
# Reset account-level equity tracking
self.initial_equity[env_ids] = 1.0
self.peak_equity[env_ids] = 1.0
self.current_equity[env_ids] = 1.0

4. Update Equity in step()

python

# In step() after line 926
# Update account-level equity tracking (v2.4)
self.current_equity = self.initial_equity + self.total_pnl / (current_prices + 1e-8)
self.peak_equity = torch.maximum(self.peak_equity, self.current_equity)

5. Feature Count Update

python

# In _calculate_obs_features() line 682
# Add account features
account = 3  # total_pnl_pct, rolling_win_rate, current_drawdown_pct
return base + technical + intraday + temporal + markov + extended + multi_window + account
# Result: 53 + 3 = 56 features

6. Account Features in Observations

python

# In _get_observations() after line 1258, before sanitization

# === ACCOUNT-LEVEL FEATURES (3) - v2.4 ===

# Feature 1: Total P&L % (normalized to [-1, 1])
total_pnl_pct = self.total_pnl / (self.initial_equity + 1e-8)
total_pnl_pct_norm = torch.tanh(total_pnl_pct * 10)
obs[:, :, feat_idx] = total_pnl_pct_norm[env_ids].unsqueeze(1).expand(-1, self.config.window)
feat_idx += 1

# Feature 2: Rolling win rate (0.5 if no trades)
win_rate = torch.where(
    self.n_trades[env_ids] > 0,
    self.n_wins[env_ids].float() / self.n_trades[env_ids].float(),
    torch.full((n_envs,), 0.5, dtype=self.dtype, device=self.device)
)
obs[:, :, feat_idx] = win_rate.unsqueeze(1).expand(-1, self.config.window)
feat_idx += 1

# Feature 3: Current drawdown % [0, 1]
drawdown = (self.peak_equity[env_ids] - self.current_equity[env_ids]) / (self.peak_equity[env_ids] + 1e-8)
drawdown = torch.clamp(drawdown, 0.0, 1.0)
obs[:, :, feat_idx] = drawdown.unsqueeze(1).expand(-1, self.config.window)
feat_idx += 1

7. Drawdown Penalty in Rewards

python

# In _calculate_rewards() after line 1618

# COMPONENT 7: Drawdown penalty (v2.4)
current_drawdown = (self.peak_equity - self.current_equity) / (self.peak_equity + 1e-8)
current_drawdown = torch.clamp(current_drawdown, 0.0, 1.0)

# Quadratic penalty when over threshold
drawdown_over_threshold = torch.clamp(current_drawdown - self.config.drawdown_penalty_threshold, min=0.0)
drawdown_penalty = -drawdown_over_threshold ** 2 * 10

# Add to reward combination:
reward = (
    self.config.direction_weight * direction_reward +
    self.config.magnitude_weight * magnitude_reward +
    self.config.pnl_weight * pnl_reward +
    self.config.stop_tp_weight * stop_tp_reward +
    self.config.exploration_weight * exploration_bonus +
    self.config.slippage_weight * slippage_penalty +
    self.config.drawdown_penalty_weight * drawdown_penalty  # NEW
) * risk_adjustment

8. Inference Observation Builder

python

# In inference_obs_builder.py get_target_features_from_obs_dim()
if features == 56:
    return 56  # v2.4 with account awareness
elif features == 53:
    return 53  # v2.3
# ... legacy support

# In build_inference_observation() after line 624
# === ACCOUNT-LEVEL FEATURES (3) - v2.4 ===
# Use neutral defaults during inference
if target_features >= 56:
    obs[:, feat_idx] = 0.0   # total_pnl_pct (no prior trades)
    feat_idx += 1
    obs[:, feat_idx] = 0.5   # win_rate (neutral prior)
    feat_idx += 1
    obs[:, feat_idx] = 0.0   # drawdown (no drawdown)
    feat_idx += 1

Failed Attempts (Critical)

Attempt	Why it Failed	Lesson Learned
Account features with raw P&L values	P&L scale varies by price level	Use P&L percentage normalized with tanh
Win rate = 0 when no trades	Invalid input during initial episodes	Default to 0.5 (neutral prior)
Peak equity never decreasing	Logical error in update	Use torch.maximum() to track high-water mark
Drawdown penalty linear	Too harsh at moderate levels	Quadratic scaling is gentler below threshold
Live inference with account state	Would need real account connection	Use neutral defaults (0, 0.5, 0) for inference

Final Parameters

yaml

# GPUEnvConfig (v2.4)
n_features: 56  # Was 53 in v2.3
drawdown_penalty_threshold: 0.15  # 15% drawdown starts penalty
drawdown_penalty_weight: 0.10     # Moderate weight in reward

# Feature breakdown (56 total)
base_features: 7              # price action basics
technical_features: 4         # intraday technicals
temporal_features: 7          # calendar features
markov_features: 12           # 4-chain regime probabilities
extended_features: 14         # extended technicals
multi_window_features: 9      # 20/50/100 bar windows
account_features: 3           # P&L %, win rate, drawdown %

# obs_dim = n_features * window = 56 * 100 = 5600

Key Insights

•Breaking Change: obs_dim 5300 → 5600 means v2.3 models CANNOT be used with v2.4 environments
•Neutral Inference: Live trading uses neutral defaults (0, 0.5, 0) since account state isn't tracked per-prediction
•Quadratic Penalty: The ** 2 makes penalty gentle at 16% drawdown but harsh at 25%+
•Normalized P&L: tanh(pnl * 10) keeps values in [-1, 1] even for large P&L swings
•0.5 Win Rate Prior: Prevents model confusion during initial trades with no history

Model Behavior Expected

With account awareness, the model should learn:

•Reduce position sizing after losses (sees drawdown feature)
•Be more selective after poor win rate (sees win rate feature)
•Avoid compounding losses (drawdown penalty kicks in at 15%)
•Trade more aggressively when profitable (sees positive P&L)

References

•alpaca_trading/gpu/vectorized_env.py: Lines 405 (config), 712 (tensors), 850 (reset), 926 (step), 1258 (obs)
•alpaca_trading/gpu/inference_obs_builder.py: Lines 61-108 (feature detection), 624+ (account features)
•notebooks/VSCode_Colab_Training_NATIVE.ipynb: Training notebook with v2.4 settings