Empirical Config Builder - Research Notes

Experiment Overview

Item	Details
Date	2026-01-08
Goal	Replace hardcoded selection thresholds with data-driven values
Environment	Python 3.10+, SymbolDatabase, numpy
Status	Success

Context

Universe selection had many hardcoded "magic numbers":

•MIN_VOLUME_USD_EQUITY = 1_000_000 - why $1M?
•MIN_PRICE_EQUITY = 5.0 - why $5?
•SECTOR_TOP_PCT = 0.30 - why 30%?

These values were originally guessed and never validated against actual market data. The empirical config builder derives these from the SymbolDatabase using percentiles.

Parameters Analysis

Can Be Data-Driven (6 parameters)

Parameter	Derivation Method	Code
min_volume_equity	P50 of daily volume	`volume_pct[50]`
min_price	P5 of equity prices	`price_pct[5]`
max_price	P99 of equity prices	`price_pct[99]`
sector_top_pct	`target_candidates / equities_passing_volume`	Calculated
min_per_sector	`median_sector_size / 10`	Calculated
max_per_sector	`median_sector_size`	Calculated

Should Stay Hardcoded (Theory-Based)

Parameter	Value	Why Fixed
hurst_short_target	(0.30, 0.50)	Literature: H<0.5 = mean-reverting
hurst_long_target	(0.50, 0.70)	Literature: H>0.5 = trending
half_life_target_hours	(4, 24)	Trading frequency constraint
regime_duration_target	(5, 20)	Markov model requirement
scoring weights	Sum to 1.0	Design decision

Verified Workflow

1. Basic Usage (Notebook)

python

# In training notebook cell-14:
USE_EMPIRICAL_THRESHOLDS = True   # Enable empirical mode
TARGET_CANDIDATES = 1500          # Target candidate count

# Thresholds are automatically derived in cell-16

2. Programmatic Usage

python

from alpaca_trading.selection import SymbolDatabase
from alpaca_trading.selection.empirical_config import build_config_from_database

db = SymbolDatabase(db_path='data/symbol_database.db')
result = build_config_from_database(
    db=db,
    target_candidates=1500,       # How many candidates you want
    volume_percentile=50,         # P50 = median (top 50% by volume)
    price_percentile_low=5,       # Exclude bottom 5% (penny stocks)
    price_percentile_high=99,     # Exclude top 1% (too expensive)
)

# Use the derived config
config = result.config

# See what was derived
print(result.describe())
# Output:
# ======================================================================
# EMPIRICAL CONFIGURATION (derived from market data)
# ======================================================================
#
# DERIVED THRESHOLDS:
#   min_volume_equity             : $180,432  [P50 of equity volume]
#   min_price                     : 1.25  [P5 of equity price]
#   max_price                     : 892.50  [P99 of equity price]
#   sector_top_pct                : 41.67%  [calculated for 1500 target candidates]
#   min_per_sector                : 45  [median_sector_size / 10]
#   max_per_sector                : 450  [median_sector_size]

3. With Correlation Estimation (Advanced)

python

from alpaca_trading.selection.empirical_config import build_full_empirical_config

result = build_full_empirical_config(
    db=db,
    data_fetcher=fetcher,         # Required for correlation
    target_candidates=1500,
    estimate_correlations=True,   # Compute actual correlations
)

# max_correlation is now derived from P75 of pairwise correlations
print(f"max_correlation: {result.config.max_correlation:.2f}")

Output Structure

python

@dataclass
class EmpiricalConfigResult:
    config: SelectionConfig       # Ready-to-use config
    thresholds: Dict[str, Any]    # All derived values
    derivation_method: Dict[str, str]  # How each was derived
    data_summary: Dict[str, Any]  # Market stats used

Failed Attempts (Critical)

Attempt	Why it Failed	Lesson Learned
Fetching snapshots directly	Redundant API calls when DB exists	Use SymbolDatabase
Fixed percentiles for all	Different markets need different P values	Crypto uses P25 for volume
Using mean instead of median	Outliers skew mean significantly	Always use median (P50)
Deriving Hurst targets	Theory-based, not market-dependent	Keep as hardcoded
Same min_per_sector everywhere	Small sectors need protection	Use median_sector_size / 10

Key Insights

•
Volume percentile choice matters:
- •P25: Very inclusive (~7000 candidates)
- •P50: Balanced (~3600 candidates)
- •P75: Selective (~1800 candidates)
•
Price percentiles:
- •P5 excludes penny stocks without guessing "$5"
- •P99 excludes extremely expensive stocks naturally
•
Sector filtering auto-calculation:
- •sector_top_pct = target_candidates / equities_passing_volume
- •Clamped to [0.15, 0.50] to prevent extremes
- •min/max per sector derived from actual sector sizes
•
Correlation threshold:
- •P75 of pairwise correlations is a reasonable threshold
- •Computing this requires historical data (expensive)
- •Optional - default 0.60 is usually fine

Files Modified

File	Changes
`alpaca_trading/selection/empirical_config.py`	Added `build_config_from_database()`, `EmpiricalConfigResult`
`notebooks/training.ipynb`	Added `USE_EMPIRICAL_THRESHOLDS` option
`CLAUDE.md`	Added empirical config documentation

Typical Results

Parameter	Hardcoded	Empirical (P50)
min_volume_equity	$1,000,000	$180,432
min_price	$5.00	$1.25
max_price	$10,000	$892.50
sector_top_pct	30%	42%

Observation: Hardcoded values were MORE restrictive than P50 (median). This explains why selection sometimes returned fewer candidates than expected.

References

•Skill: symbol-database-selection - SymbolDatabase infrastructure
•Skill: per-sector-candidate-filtering - Sector filtering parameters
•Skill: symbol-selection-statistical - Statistical selection theory