Calibration Analysis Skill
Purpose
Systematic analysis of model predictions vs realized outcomes. This tells you exactly where your models are wrong and by how much.
When to Use
- •Daily model health checks
- •Debugging unexpected PnL patterns
- •Deciding which model component to improve next
- •Validating changes before/after deployment
- •Any time you want to know "is this model actually working?"
Prerequisites
- •
measurement-infrastructuremust be implemented and logging data - •At least 24 hours of prediction data (preferably 1 week+)
Key Metrics
1. Brier Score
The mean squared error of probability predictions:
code
BS = (1/N) Σ (pᵢ - oᵢ)²
Where:
- •pᵢ = predicted probability
- •oᵢ ∈ {0, 1} = actual outcome
Interpretation:
- •BS = 0: Perfect predictions
- •BS = 0.25: Predicting 50% for everything (random)
- •BS > 0.25: Worse than random
2. Brier Score Decomposition
Split Brier score into interpretable components:
code
BS = Reliability - Resolution + Uncertainty Reliability = (1/N) Σₖ nₖ(p̄ₖ - ōₖ)² - Measures calibration quality - Lower is better - If your 70% predictions hit 70%, this is 0 Resolution = (1/N) Σₖ nₖ(ōₖ - ō)² - Measures discrimination ability - Higher is better - Do your high predictions differ from low predictions? Uncertainty = ō(1 - ō) - Base rate variance - Not controllable, just inherent difficulty
3. Information Ratio
code
IR = Resolution / Uncertainty
Interpretation:
- •IR > 1.0: Model predictions carry useful information
- •IR ≈ 1.0: Model is roughly as good as predicting base rate
- •IR < 1.0: Model is adding noise (REMOVE IT)
Implementation
Brier Score Decomposition
rust
struct BrierDecomposition {
brier_score: f64,
reliability: f64,
resolution: f64,
uncertainty: f64,
information_ratio: f64,
}
fn compute_brier_decomposition(
predictions: &[f64],
outcomes: &[bool],
num_bins: usize,
) -> BrierDecomposition {
let n = predictions.len() as f64;
// Base rate
let o_bar: f64 = outcomes.iter()
.map(|&o| if o { 1.0 } else { 0.0 })
.sum::<f64>() / n;
// Bin predictions
let mut bins: Vec<Vec<(f64, bool)>> = vec![vec![]; num_bins];
for (&p, &o) in predictions.iter().zip(outcomes.iter()) {
let bin_idx = ((p * num_bins as f64) as usize).min(num_bins - 1);
bins[bin_idx].push((p, o));
}
let mut reliability = 0.0;
let mut resolution = 0.0;
for bin in &bins {
if bin.is_empty() { continue; }
let n_k = bin.len() as f64;
let p_bar_k: f64 = bin.iter().map(|(p, _)| p).sum::<f64>() / n_k;
let o_bar_k: f64 = bin.iter()
.map(|(_, o)| if *o { 1.0 } else { 0.0 })
.sum::<f64>() / n_k;
reliability += n_k * (p_bar_k - o_bar_k).powi(2);
resolution += n_k * (o_bar_k - o_bar).powi(2);
}
reliability /= n;
resolution /= n;
let uncertainty = o_bar * (1.0 - o_bar);
let brier_score = reliability - resolution + uncertainty;
let information_ratio = resolution / uncertainty.max(1e-10);
BrierDecomposition {
brier_score,
reliability,
resolution,
uncertainty,
information_ratio,
}
}
Calibration Curve
rust
struct CalibrationPoint {
bin_center: f64, // Mean predicted probability in bin
realized_rate: f64, // Actual success rate in bin
sample_count: usize, // Number of samples in bin
confidence_interval: (f64, f64), // 95% CI on realized rate
}
fn build_calibration_curve(
predictions: &[f64],
outcomes: &[bool],
num_bins: usize,
) -> Vec<CalibrationPoint> {
// Sort by prediction
let mut pairs: Vec<(f64, bool)> = predictions.iter()
.zip(outcomes.iter())
.map(|(&p, &o)| (p, o))
.collect();
pairs.sort_by(|a, b| a.0.partial_cmp(&b.0).unwrap());
// Divide into equal-sized bins
let bin_size = pairs.len() / num_bins;
let mut curve = Vec::new();
for i in 0..num_bins {
let start = i * bin_size;
let end = if i == num_bins - 1 { pairs.len() } else { (i + 1) * bin_size };
let bin = &pairs[start..end];
if bin.is_empty() { continue; }
let mean_pred: f64 = bin.iter().map(|(p, _)| p).sum::<f64>() / bin.len() as f64;
let realized: f64 = bin.iter()
.map(|(_, o)| if *o { 1.0 } else { 0.0 })
.sum::<f64>() / bin.len() as f64;
// Wilson score interval for confidence
let ci = wilson_score_interval(realized, bin.len(), 0.95);
curve.push(CalibrationPoint {
bin_center: mean_pred,
realized_rate: realized,
sample_count: bin.len(),
confidence_interval: ci,
});
}
curve
}
fn wilson_score_interval(p: f64, n: usize, confidence: f64) -> (f64, f64) {
let z = 1.96; // 95% confidence
let n = n as f64;
let denominator = 1.0 + z * z / n;
let center = (p + z * z / (2.0 * n)) / denominator;
let margin = z * (p * (1.0 - p) / n + z * z / (4.0 * n * n)).sqrt() / denominator;
((center - margin).max(0.0), (center + margin).min(1.0))
}
Conditional Calibration
Slice calibration by conditioning variables:
rust
enum ConditioningVariable {
VolatilityQuartile,
FundingRegime,
TimeOfDay,
InventoryState,
RecentFillRate,
BookImbalance,
Regime,
}
struct ConditionalCalibration {
variable: ConditioningVariable,
slices: HashMap<String, BrierDecomposition>,
}
fn compute_conditional_calibration(
records: &[PredictionRecord],
prediction_extractor: impl Fn(&PredictionRecord) -> f64,
outcome_extractor: impl Fn(&PredictionRecord) -> bool,
condition: ConditioningVariable,
) -> ConditionalCalibration {
// Group records by condition
let mut groups: HashMap<String, Vec<(f64, bool)>> = HashMap::new();
for record in records {
let slice_name = get_slice_name(record, &condition);
let pred = prediction_extractor(record);
let outcome = outcome_extractor(record);
groups.entry(slice_name)
.or_default()
.push((pred, outcome));
}
// Compute Brier decomposition for each slice
let mut slices = HashMap::new();
for (name, pairs) in groups {
if pairs.len() < 100 { continue; } // Need minimum samples
let preds: Vec<f64> = pairs.iter().map(|(p, _)| *p).collect();
let outcomes: Vec<bool> = pairs.iter().map(|(_, o)| *o).collect();
slices.insert(name, compute_brier_decomposition(&preds, &outcomes, 20));
}
ConditionalCalibration {
variable: condition,
slices,
}
}
fn get_slice_name(record: &PredictionRecord, condition: &ConditioningVariable) -> String {
match condition {
ConditioningVariable::VolatilityQuartile => {
let sigma = record.market_state.sigma_bipower;
if sigma < 0.00005 { "Q1_low".to_string() }
else if sigma < 0.00010 { "Q2_medium".to_string() }
else if sigma < 0.00020 { "Q3_high".to_string() }
else { "Q4_extreme".to_string() }
}
ConditioningVariable::FundingRegime => {
let funding = record.market_state.funding_rate;
if funding > 0.0005 { "high_positive".to_string() }
else if funding > 0.0 { "low_positive".to_string() }
else if funding > -0.0005 { "low_negative".to_string() }
else { "high_negative".to_string() }
}
ConditioningVariable::Regime => {
let state = &record.market_state;
if state.regime_cascade_prob > 0.5 { "cascade".to_string() }
else if state.regime_volatile_prob > 0.5 { "volatile".to_string() }
else if state.regime_trending_prob > 0.5 { "trending".to_string() }
else { "quiet".to_string() }
}
// ... other conditions
_ => "unknown".to_string()
}
}
PnL Attribution
Decompose daily PnL to identify problem areas:
rust
struct PnLAttribution {
date: NaiveDate,
gross_pnl: f64,
// Decomposition
spread_capture: f64, // Revenue from bid-ask spread
adverse_selection: f64, // Loss from fills before adverse moves
inventory_cost: f64, // Cost of holding inventory (mark-to-market)
fees: f64, // Exchange fees
// By regime
pnl_by_regime: HashMap<Regime, f64>,
time_by_regime: HashMap<Regime, f64>, // Fraction of time
}
fn compute_pnl_attribution(records: &[PredictionRecord]) -> PnLAttribution {
let mut spread_capture = 0.0;
let mut adverse_selection = 0.0;
let mut fees = 0.0;
let mut pnl_by_regime: HashMap<Regime, f64> = HashMap::new();
let mut time_by_regime: HashMap<Regime, f64> = HashMap::new();
for record in records {
let Some(outcomes) = &record.outcomes else { continue };
let regime = record.market_state.dominant_regime();
for fill in &outcomes.fills {
let level = &record.predictions.levels[fill.level_index];
// Spread capture: distance from mid at fill time
let mid_at_fill = fill.mark_price_at_fill;
let spread_earned = match level.side {
Side::Bid => mid_at_fill - fill.fill_price,
Side::Ask => fill.fill_price - mid_at_fill,
};
spread_capture += spread_earned * fill.fill_size;
// Adverse selection: price move against us after fill
let price_move_1s = fill.mark_price_1s_later - fill.mark_price_at_fill;
let adverse = match level.side {
Side::Bid => -price_move_1s, // Bought, price dropped
Side::Ask => price_move_1s, // Sold, price rose
};
adverse_selection += adverse.min(0.0) * fill.fill_size;
// Fees
fees += fill.fill_price * fill.fill_size * 0.00015; // 1.5 bps
// By regime
*pnl_by_regime.entry(regime).or_default() +=
spread_earned * fill.fill_size + adverse.min(0.0) * fill.fill_size;
}
*time_by_regime.entry(regime).or_default() += 1.0;
}
// Normalize time fractions
let total_time: f64 = time_by_regime.values().sum();
for v in time_by_regime.values_mut() {
*v /= total_time;
}
// Inventory cost computed separately from position changes
let inventory_cost = compute_inventory_mtm(records);
PnLAttribution {
date: records[0].timestamp.date(),
gross_pnl: spread_capture + adverse_selection - fees - inventory_cost,
spread_capture,
adverse_selection,
inventory_cost,
fees,
pnl_by_regime,
time_by_regime,
}
}
Daily Report Template
code
=== Calibration Report: {date} ===
PnL Attribution
───────────────────────────────────────────
Gross PnL: ${gross_pnl:>10.2}
├── Spread Capture: ${spread_capture:>10.2} {spread_status}
├── Adverse Selection: ${adverse_selection:>10.2} {as_status}
├── Inventory Cost: ${inventory_cost:>10.2}
└── Fees: ${fees:>10.2}
Model Calibration
───────────────────────────────────────────
Brier IR Status
Fill Prediction (1s): {fp_brier:.3} {fp_ir:.2} {fp_status}
Fill Prediction (10s): {fp10_brier:.3} {fp10_ir:.2} {fp10_status}
Adverse Selection: {as_brier:.3} {as_ir:.2} {as_status_model}
Volatility (RMSE): {vol_rmse:.6} {vol_status}
Regime Distribution
───────────────────────────────────────────
Time PnL PnL/Hour
Quiet: {quiet_time:>4.0%} ${quiet_pnl:>8.2} ${quiet_rate:>6.2}/hr
Trending: {trend_time:>4.0%} ${trend_pnl:>8.2} ${trend_rate:>6.2}/hr
Volatile: {vol_time:>4.0%} ${vol_pnl:>8.2} ${vol_rate:>6.2}/hr
Cascade: {casc_time:>4.0%} ${casc_pnl:>8.2} ${casc_rate:>6.2}/hr
Conditional Calibration Issues
───────────────────────────────────────────
{conditional_issues}
Actionable Items
───────────────────────────────────────────
{action_items}
Report Generation
rust
fn generate_daily_report(date: NaiveDate) -> String {
let records = load_predictions_for_date(date);
// PnL attribution
let pnl = compute_pnl_attribution(&records);
// Model calibration
let fill_1s = compute_brier_decomposition(
&extract_fill_predictions_1s(&records),
&extract_fill_outcomes_1s(&records),
20
);
let fill_10s = compute_brier_decomposition(
&extract_fill_predictions_10s(&records),
&extract_fill_outcomes_10s(&records),
20
);
let adverse = compute_brier_decomposition(
&extract_adverse_predictions(&records),
&extract_adverse_outcomes(&records),
20
);
// Conditional calibration
let conditions = [
ConditioningVariable::Regime,
ConditioningVariable::VolatilityQuartile,
ConditioningVariable::FundingRegime,
];
let mut issues = Vec::new();
for cond in conditions {
let cal = compute_conditional_calibration(&records, /* ... */, cond);
for (slice, decomp) in &cal.slices {
if decomp.information_ratio < 1.0 {
issues.push(format!(
"- {} in {}: IR={:.2} (model adding noise)",
cond, slice, decomp.information_ratio
));
}
}
}
// Action items
let actions = generate_action_items(&pnl, &fill_1s, &adverse, &issues);
format_report(/* ... */)
}
fn generate_action_items(
pnl: &PnLAttribution,
fill_cal: &BrierDecomposition,
adverse_cal: &BrierDecomposition,
issues: &[String],
) -> Vec<String> {
let mut actions = Vec::new();
// Adverse selection is biggest PnL drag
if pnl.adverse_selection < -100.0 && pnl.adverse_selection.abs() > pnl.spread_capture * 0.5 {
actions.push("HIGH: Adverse selection > 50% of spread capture. Review adverse-selection-classifier.".to_string());
}
// Model IR below 1.0
if fill_cal.information_ratio < 1.0 {
actions.push(format!(
"HIGH: Fill prediction IR={:.2}. Model adding noise. Consider removing or retraining.",
fill_cal.information_ratio
));
}
if adverse_cal.information_ratio < 1.0 {
actions.push(format!(
"HIGH: Adverse selection IR={:.2}. Model adding noise. Review classifier.",
adverse_cal.information_ratio
));
}
// Regime-specific issues
if let Some(&cascade_pnl) = pnl.pnl_by_regime.get(&Regime::Cascade) {
if cascade_pnl < -50.0 {
actions.push(format!(
"MEDIUM: Cascade regime PnL=${:.2}. Review liquidation detector thresholds.",
cascade_pnl
));
}
}
// Conditional calibration issues
if !issues.is_empty() {
actions.push(format!(
"MEDIUM: {} conditional calibration issues. Review conditional slices.",
issues.len()
));
}
actions
}
Alert Thresholds
Set up automated alerts:
rust
struct AlertThresholds {
// Model health
min_information_ratio: f64, // 1.0 - below this, model is useless
max_brier_score: f64, // 0.25 - above this, worse than random
// PnL
max_daily_loss: f64, // Dollar amount
max_adverse_selection_ratio: f64, // AS / spread_capture
// Regime
max_cascade_loss: f64, // Dollar amount in cascade regime
}
impl Default for AlertThresholds {
fn default() -> Self {
AlertThresholds {
min_information_ratio: 1.0,
max_brier_score: 0.25,
max_daily_loss: 500.0,
max_adverse_selection_ratio: 0.7,
max_cascade_loss: 100.0,
}
}
}
fn check_alerts(report: &DailyReport, thresholds: &AlertThresholds) -> Vec<Alert> {
let mut alerts = Vec::new();
if report.fill_calibration.information_ratio < thresholds.min_information_ratio {
alerts.push(Alert::Critical(format!(
"Fill prediction IR={:.2} below threshold {}",
report.fill_calibration.information_ratio,
thresholds.min_information_ratio
)));
}
if report.pnl.gross_pnl < -thresholds.max_daily_loss {
alerts.push(Alert::Critical(format!(
"Daily loss ${:.2} exceeds threshold ${:.2}",
report.pnl.gross_pnl,
thresholds.max_daily_loss
)));
}
// ... more checks
alerts
}
Common Patterns
"Model is well-calibrated overall but fails in regime X"
This is the most common pattern. Solution:
- •Identify which regime has IR < 1.0
- •Either: train regime-specific model, or
- •Fall back to wider spreads / simpler model in that regime
"Calibration looks good but still losing money"
Possible causes:
- •Good calibration on the wrong metric (e.g., calibrated fill prediction but adverse selection is the real problem)
- •Execution slippage not captured in calibration
- •Latency effects (predictions stale by the time orders placed)
"IR > 1 but Brier score is high"
Model has good discrimination (can tell high from low) but poor calibration (predictions don't match frequencies). Fix with isotonic regression or Platt scaling.
Dependencies
- •Requires: measurement-infrastructure (prediction logs with outcomes)
- •Enables: All model improvement work, daily-calibration-report
Next Steps
After analyzing calibration:
- •Identify weakest component (lowest IR or biggest PnL drag)
- •Read that component's skill file
- •Use signal-audit to identify better features
- •Implement improvement
- •Re-run calibration to validate