Training Archive & Model Gating System (v2.4.1)
Experiment Overview
| Item | Details |
|---|---|
| Date | 2024-12-24 |
| Goal | Create mandatory post-training archival with automatic model classification |
| Environment | alpaca_trading.training package, Colab/local training |
| Status | Success |
Context
Training runs produce summaries with validation metrics, but there was no:
- •Structured archival - Summaries were ephemeral JSON files
- •Model classification - No clear criteria for deployment readiness
- •Overfitting detection - No automatic checkpoint recommendations
- •Historical reference - No way to track model improvements over time
The solution: TrainingArchiveManager with automatic gating based on fitness, profit factor, consistency, and drawdown thresholds.
Verified Workflow
Model Gating Thresholds
NOTE (v2.4.1): Thresholds are calibrated for reward_scale=0.001. MaxDD is a proxy metric reflecting reward volatility during validation, not actual equity drawdown. With conservative reward scaling:
- •8% proxy MaxDD = rewards staying mostly positive
- •15% proxy MaxDD = occasional negative reward streaks
| Classification | Fitness | PF | Consistency | MaxDD | Action |
|---|---|---|---|---|---|
| APPROVED | >= 0.70 | >= 1.8 | >= 85% | <= 8% | Deploy to production |
| REVIEW | 0.50-0.70 | 1.3-1.8 | 65-85% | 8-15% | Manual review required |
| DROP | < 0.50 | < 1.3 | < 65% | > 15% | Do not deploy |
Overfitting Detection
python
# Fitness decline from peak triggers checkpoint recommendation fitness_decline_threshold = 0.05 # 5% decline from peak fitness_oscillation_threshold = 0.10 # 10% swing = unstable training
Archive Usage
python
from alpaca_trading.training import TrainingArchiveManager
# Archive training run with automatic gating
archive_mgr = TrainingArchiveManager(archive_dir='training_archives')
archive = archive_mgr.archive_training_run(summary_data)
# Results
print(f"APPROVED: {archive.approved_count}")
print(f"REVIEW: {archive.review_count}")
print(f"DROP: {archive.dropped_count}")
# Get approved models for deployment
approved = archive_mgr.get_approved_models(archive.timestamp)
print(f"Ready for deployment: {approved}")
# Get symbol history across runs
history = archive_mgr.get_symbol_history('AAPL')
Archive Structure
code
training_archives/
├── index.json # Master index of all runs
├── {timestamp}/
│ ├── summary.json # Raw training summary
│ ├── model_assessments.json # Per-model gating decisions
│ └── recommendations.md # Human-readable report
Gating Configuration
python
from alpaca_trading.training import ModelGatingConfig
# Custom thresholds (stricter than v2.4.1 defaults)
config = ModelGatingConfig(
approved_min_fitness=0.80, # Default: 0.70
approved_min_pf=2.0, # Default: 1.8
approved_min_consistency=0.90, # Default: 0.85
approved_max_drawdown=0.05, # Default: 0.08 (5% proxy MaxDD)
)
# Use custom config
classification, flags, use_checkpoint, best_idx = assess_model_quality(
final_fitness=0.75,
final_pf=1.9,
final_consistency=0.88,
final_max_dd=0.06, # 6% proxy MaxDD
fitness_history=[0.70, 0.75, 0.78, 0.75],
config=config,
)
Failed Attempts (Critical)
| Attempt | Why it Failed | Lesson Learned |
|---|---|---|
| Manual assessment | Inconsistent criteria, subjective | Use fixed thresholds in code |
| Single metric gating | Models with high PF but poor consistency slipped through | Require ALL thresholds met |
| No overfitting detection | Models deployed that had peaked earlier | Track fitness history, recommend checkpoints |
| ASCII-only reports | Unicode errors on Windows (emoji characters) | Use encoding='utf-8' on all file operations |
Key Insights
Why Multiple Thresholds
- •Fitness alone is insufficient - High fitness can mask poor profit factor
- •Consistency matters - A model with PF=5.0 but 50% consistency is risky
- •Drawdown is critical - High-equity models can still blow up
- •All thresholds must pass - A single weak metric can indicate problems
Checkpoint Recommendations
When use_checkpoint=True is returned:
- •Model's final fitness declined >5% from peak
- •The
best_idxindicates which validation point had peak fitness - •Calculate checkpoint update:
checkpoint_update = (best_idx + 1) * validation_interval
Flags Returned
| Flag | Meaning |
|---|---|
FITNESS_DECLINE | Final < peak by >5% |
UNSTABLE_TRAINING | Fitness oscillation >10% |
LOW_FITNESS | Below REVIEW threshold |
LOW_PF | Profit factor below threshold |
LOW_CONSISTENCY | Consistency below threshold |
HIGH_DRAWDOWN | Max drawdown above threshold |
Files Created
code
alpaca_trading/training/__init__.py # Package exports alpaca_trading/training/gating.py # ModelGatingConfig, assess_model_quality() alpaca_trading/training/archive.py # TrainingArchiveManager tests/test_training_archive.py # 20 unit tests
References
- •
alpaca_trading/training/gating.py: Lines 40-128 (gating logic) - •
alpaca_trading/training/archive.py: Lines 55-393 (archive manager) - •
tests/test_training_archive.py: Full test suite - •CLAUDE.md: Model Gating Standards section