Brier Score

Category: Decision-Making & Strategic Thinking Source: Glenn W. Brier (1950) / Meteorology / Forecasting Science Practitioner Score: 42/50 (Tier 1)

Overview

The Brier Score is a strictly proper scoring rule that measures the accuracy of probabilistic predictions. It calculates the mean squared error between predicted probabilities and actual outcomes, providing a single numeric measure of forecast quality. Lower scores indicate better accuracy, with 0 being perfect and 2 being worst possible.

Core Insight: You can't improve what you don't measure. The Brier Score converts vague notions of "good forecasting" into quantifiable performance, enabling systematic improvement through calibration feedback.

Formula: BS = (1/N) Σ(fᵢ - oᵢ)²

•fᵢ = forecasted probability (0-1)
•oᵢ = actual outcome (1 if event occurred, 0 if not)
•N = number of forecasts

When to Use

•Calibration assessment - Are your 70% predictions actually happening 70% of the time?
•Forecaster comparison - Which analyst/model produces more accurate predictions?
•Training feedback - Providing objective performance scores for improvement
•Decision validation - Evaluating quality of past predictions over time
•Model selection - Comparing machine learning models or forecasting systems

Anti-patterns:

•Single predictions (need 20+ for statistical validity)
•Non-probabilistic forecasts (binary yes/no)
•Events without clear resolution criteria
•Immediate feedback needs (requires outcome data)

How to Execute

Step 1: Record Probabilistic Forecast

Action: Document your prediction as a probability between 0% and 100%

•Precision: Use granular probabilities (65%, not "likely")
•Timestamp: Record when prediction was made
•Resolution criteria: Define exactly what constitutes "event occurred"
•Output: Logged forecast with probability, date, and outcome definition

Step 2: Wait for Event Resolution

Action: Allow sufficient time for outcome to be determined

•Clear endpoint: Specify resolution date/trigger in advance
•Unambiguous outcome: 1 (occurred) or 0 (did not occur)
•No gaming: Outcome determination must be independent of forecaster
•Output: Resolved outcome (1 or 0)

Step 3: Calculate Individual Forecast Error

Action: Compute squared difference between forecast and outcome

•If event occurred (oᵢ = 1): Error = (1 - fᵢ)²
•If event did not occur (oᵢ = 0): Error = (0 - fᵢ)² = fᵢ²
•Example: Predicted 70% (0.7), event happened → (1 - 0.7)² = 0.09
•Output: Single forecast Brier score

Step 4: Aggregate Across Multiple Forecasts

Action: Average squared errors across N predictions

•Minimum sample: 20+ forecasts for meaningful assessment
•Formula: BS = (Error₁ + Error₂ + ... + Errorₙ) / N
•Output: Overall Brier score for forecast set

Step 5: Interpret Score Against Benchmarks

Action: Compare your score to reference points

•Perfect accuracy: 0.00 (impossible in practice)
•Excellent: < 0.10 (superforecaster level)
•Good: 0.10 - 0.20 (well-calibrated forecaster)
•Average: 0.20 - 0.30 (typical expert)
•Poor: > 0.30 (worse than random guessing)
•Output: Performance classification

Step 6: Decompose into Calibration vs. Resolution

Action: Break Brier score into skill components

•Calibration: Are X% forecasts correct X% of the time?
•Resolution: Can you distinguish different probability levels?
•Formula: BS = Reliability - Resolution + Uncertainty
•Output: Diagnostic breakdown identifying improvement areas

Step 7: Implement Calibration Improvements

Action: Use insights to adjust forecasting behavior

•Overconfident (too many extremes): Pull predictions toward 50%
•Underconfident (clustered near 50%): Increase differentiation
•Systemic bias: Adjust all forecasts by consistent offset
•Output: Updated forecasting protocol

Real-World Examples

Weather Forecasting:

•Meteorologists tracked with Brier scores for 50+ years
•Led to dramatic improvements in precipitation forecasting
•Result: Today's 5-day forecast as accurate as 1-day forecast in 1970s

Good Judgment Project:

•Superforecasters averaged 0.15-0.18 Brier scores
•Regular forecasters averaged 0.25-0.30
•Intelligence analysts (with classified info) averaged 0.30+
•Result: Validated that systematic methodology beats expertise

Sports Betting Markets:

•Bookmakers use Brier scores to evaluate odds accuracy
•Prediction markets (PredictIt, Polymarket) track participant scores
•Result: Efficient markets reflect well-calibrated probabilities

Integration Points

Complements:

•Superforecasting: Brier score measures effectiveness of 10 commandments
•Calibration: Primary diagnostic tool for improving calibration
•Bayesian Updating: Tracks whether belief updates improve accuracy
•Prediction Markets: Aggregation mechanism with Brier-optimized incentives

Enables:

•Performance tracking: Quantitative measure for deliberate practice
•A/B testing: Compare forecasting methodologies empirically
•Incentive design: Reward accurate probabilistic predictions

Common Pitfalls

Pitfall 1: Conflating Low Score with Skill

•Warning sign: 0.05 score from 10 predictions all at 51%/49%
•Fix: Check resolution - can you distinguish probability levels?

Pitfall 2: Ignoring Calibration Components

•Warning sign: Good overall score masking systemic bias
•Fix: Decompose into reliability, resolution, and uncertainty

Pitfall 3: Small Sample Sizes

•Warning sign: Declaring "good forecaster" from 5 predictions
•Fix: Require minimum 20-50 forecasts before drawing conclusions

Pitfall 4: Cherry-Picking

•Warning sign: Only tracking predictions you feel confident about
•Fix: Commit to scoring ALL forecasts in domain upfront

Pitfall 5: Resolution Ambiguity

•Warning sign: Disputes about whether event "really" occurred
•Fix: Define resolution criteria precisely when making forecast

Multi-Category Extension

For events with more than 2 outcomes (e.g., election with 3+ candidates):

Formula: BS = (1/N) Σ Σⱼ (fᵢⱼ - oᵢⱼ)²

Example: Three-way race forecast of A=60%, B=10%, C=30%

•If A wins: (1-0.6)² + (0-0.1)² + (0-0.3)² = 0.16 + 0.01 + 0.09 = 0.26
•If B wins: (0-0.6)² + (1-0.1)² + (0-0.3)² = 0.36 + 0.81 + 0.09 = 1.26
•If C wins: (0-0.6)² + (0-0.1)² + (1-0.3)² = 0.36 + 0.01 + 0.49 = 0.86

Note: Multi-category scores range from 0 (perfect) to 2 (worst possible).

Validation Checklist

• All forecasts are probabilistic (0-100%), not binary
• Resolution criteria defined before outcome known
• Outcomes recorded honestly (no retroactive "adjustments")
• Sample size sufficient (20+ forecasts minimum)
• Score decomposed into calibration and resolution components
• Comparison made to baseline (random guessing, other forecasters)
• Trend tracked over time to measure improvement

Practical Tips

Tracking System:

•Spreadsheet columns: Date, Question, Forecast %, Outcome (0/1), Error²
•Running average in final column
•Monthly review of calibration plots

Calibration Plot:

•X-axis: Your forecasted probability (grouped into bins)
•Y-axis: Actual frequency of occurrence
•Perfect calibration = diagonal line

Improvement Signals:

•Score decreasing over time (learning)
•Calibration plot approaching diagonal
•Resolution increasing (more differentiation)

brier-score