AgentSkillsCN

account-aware-training

将账户状态(损益、胜率、回撤)添加到强化学习观察值中,并在奖励中引入回撤惩罚。触发条件如下:(1) 模型需要具备账户意识;(2) 训练过程中应对回撤进行惩罚;(3) 将obs_dim从5300升级至5600。

SKILL.md
--- frontmatter
name: account-aware-training
description: "Add account state (P&L, win rate, drawdown) to RL observations + drawdown penalty in rewards. Trigger when: (1) model needs account awareness, (2) training should penalize drawdowns, (3) upgrading obs_dim 5300→5600."
author: Claude Code
date: 2024-12-26

Account-Aware RL Training (v2.4)

Experiment Overview

ItemDetails
Date2024-12-26
GoalMake RL model learn from account state (P&L, win rate, drawdown)
Environmentvectorized_env.py, inference_obs_builder.py, training notebook
StatusSuccess

Context

Prior to v2.4, the RL model was "blind" to account performance. It received:

  • 53 features: price action, technicals, regime probabilities, calendar effects
  • No information about cumulative P&L, win rate, or drawdown

Problem: The model could generate signals that were individually good but led to excessive drawdowns at the account level. It had no incentive to trade conservatively after losses.

Solution: Add 3 account-level features + drawdown penalty in rewards.

Verified Workflow

1. Config Parameters (GPUEnvConfig)

python
# In vectorized_env.py GPUEnvConfig dataclass (~line 405)
# Account-aware training (v2.4)
drawdown_penalty_threshold: float = 0.15  # Penalize when drawdown > 15%
drawdown_penalty_weight: float = 0.10     # Weight in reward function

2. Equity Tracking Tensors

python
# In _init_state_tensors() after line 712
# Account-level equity tracking (v2.4)
self.initial_equity = torch.ones(self.n_envs, dtype=self.dtype, device=self.device)
self.peak_equity = torch.ones(self.n_envs, dtype=self.dtype, device=self.device)
self.current_equity = torch.ones(self.n_envs, dtype=self.dtype, device=self.device)

3. Reset Equity Tensors

python
# In reset() after line 850
# Reset account-level equity tracking
self.initial_equity[env_ids] = 1.0
self.peak_equity[env_ids] = 1.0
self.current_equity[env_ids] = 1.0

4. Update Equity in step()

python
# In step() after line 926
# Update account-level equity tracking (v2.4)
self.current_equity = self.initial_equity + self.total_pnl / (current_prices + 1e-8)
self.peak_equity = torch.maximum(self.peak_equity, self.current_equity)

5. Feature Count Update

python
# In _calculate_obs_features() line 682
# Add account features
account = 3  # total_pnl_pct, rolling_win_rate, current_drawdown_pct
return base + technical + intraday + temporal + markov + extended + multi_window + account
# Result: 53 + 3 = 56 features

6. Account Features in Observations

python
# In _get_observations() after line 1258, before sanitization

# === ACCOUNT-LEVEL FEATURES (3) - v2.4 ===

# Feature 1: Total P&L % (normalized to [-1, 1])
total_pnl_pct = self.total_pnl / (self.initial_equity + 1e-8)
total_pnl_pct_norm = torch.tanh(total_pnl_pct * 10)
obs[:, :, feat_idx] = total_pnl_pct_norm[env_ids].unsqueeze(1).expand(-1, self.config.window)
feat_idx += 1

# Feature 2: Rolling win rate (0.5 if no trades)
win_rate = torch.where(
    self.n_trades[env_ids] > 0,
    self.n_wins[env_ids].float() / self.n_trades[env_ids].float(),
    torch.full((n_envs,), 0.5, dtype=self.dtype, device=self.device)
)
obs[:, :, feat_idx] = win_rate.unsqueeze(1).expand(-1, self.config.window)
feat_idx += 1

# Feature 3: Current drawdown % [0, 1]
drawdown = (self.peak_equity[env_ids] - self.current_equity[env_ids]) / (self.peak_equity[env_ids] + 1e-8)
drawdown = torch.clamp(drawdown, 0.0, 1.0)
obs[:, :, feat_idx] = drawdown.unsqueeze(1).expand(-1, self.config.window)
feat_idx += 1

7. Drawdown Penalty in Rewards

python
# In _calculate_rewards() after line 1618

# COMPONENT 7: Drawdown penalty (v2.4)
current_drawdown = (self.peak_equity - self.current_equity) / (self.peak_equity + 1e-8)
current_drawdown = torch.clamp(current_drawdown, 0.0, 1.0)

# Quadratic penalty when over threshold
drawdown_over_threshold = torch.clamp(current_drawdown - self.config.drawdown_penalty_threshold, min=0.0)
drawdown_penalty = -drawdown_over_threshold ** 2 * 10

# Add to reward combination:
reward = (
    self.config.direction_weight * direction_reward +
    self.config.magnitude_weight * magnitude_reward +
    self.config.pnl_weight * pnl_reward +
    self.config.stop_tp_weight * stop_tp_reward +
    self.config.exploration_weight * exploration_bonus +
    self.config.slippage_weight * slippage_penalty +
    self.config.drawdown_penalty_weight * drawdown_penalty  # NEW
) * risk_adjustment

8. Inference Observation Builder

python
# In inference_obs_builder.py get_target_features_from_obs_dim()
if features == 56:
    return 56  # v2.4 with account awareness
elif features == 53:
    return 53  # v2.3
# ... legacy support

# In build_inference_observation() after line 624
# === ACCOUNT-LEVEL FEATURES (3) - v2.4 ===
# Use neutral defaults during inference
if target_features >= 56:
    obs[:, feat_idx] = 0.0   # total_pnl_pct (no prior trades)
    feat_idx += 1
    obs[:, feat_idx] = 0.5   # win_rate (neutral prior)
    feat_idx += 1
    obs[:, feat_idx] = 0.0   # drawdown (no drawdown)
    feat_idx += 1

Failed Attempts (Critical)

AttemptWhy it FailedLesson Learned
Account features with raw P&L valuesP&L scale varies by price levelUse P&L percentage normalized with tanh
Win rate = 0 when no tradesInvalid input during initial episodesDefault to 0.5 (neutral prior)
Peak equity never decreasingLogical error in updateUse torch.maximum() to track high-water mark
Drawdown penalty linearToo harsh at moderate levelsQuadratic scaling is gentler below threshold
Live inference with account stateWould need real account connectionUse neutral defaults (0, 0.5, 0) for inference

Final Parameters

yaml
# GPUEnvConfig (v2.4)
n_features: 56  # Was 53 in v2.3
drawdown_penalty_threshold: 0.15  # 15% drawdown starts penalty
drawdown_penalty_weight: 0.10     # Moderate weight in reward

# Feature breakdown (56 total)
base_features: 7              # price action basics
technical_features: 4         # intraday technicals
temporal_features: 7          # calendar features
markov_features: 12           # 4-chain regime probabilities
extended_features: 14         # extended technicals
multi_window_features: 9      # 20/50/100 bar windows
account_features: 3           # P&L %, win rate, drawdown %

# obs_dim = n_features * window = 56 * 100 = 5600

Key Insights

  • Breaking Change: obs_dim 5300 → 5600 means v2.3 models CANNOT be used with v2.4 environments
  • Neutral Inference: Live trading uses neutral defaults (0, 0.5, 0) since account state isn't tracked per-prediction
  • Quadratic Penalty: The ** 2 makes penalty gentle at 16% drawdown but harsh at 25%+
  • Normalized P&L: tanh(pnl * 10) keeps values in [-1, 1] even for large P&L swings
  • 0.5 Win Rate Prior: Prevents model confusion during initial trades with no history

Model Behavior Expected

With account awareness, the model should learn:

  1. Reduce position sizing after losses (sees drawdown feature)
  2. Be more selective after poor win rate (sees win rate feature)
  3. Avoid compounding losses (drawdown penalty kicks in at 15%)
  4. Trade more aggressively when profitable (sees positive P&L)

References

  • alpaca_trading/gpu/vectorized_env.py: Lines 405 (config), 712 (tensors), 850 (reset), 926 (step), 1258 (obs)
  • alpaca_trading/gpu/inference_obs_builder.py: Lines 61-108 (feature detection), 624+ (account features)
  • notebooks/VSCode_Colab_Training_NATIVE.ipynb: Training notebook with v2.4 settings