Calibration

Overview

Calibration is the alignment between predicted probabilities and actual outcome frequencies - if you claim "70% confidence" and those predictions come true 70% of the time, you're well-calibrated. If they happen 40% or 90% of the time, you're miscalibrated (overconfident or underconfident). Developed through Philip Tetlock's decade-long forecasting research, calibration training transformed amateur "superforecasters" into predictors who outperformed professional intelligence analysts by 30%.

The framework is metacognitive: you're not just predicting outcomes, you're learning to accurately assess the strength of your own beliefs. Most people are terrible at this - they confuse "feels very certain" with "is actually 95% likely" - leading to systematic overconfidence in business decisions, project timelines, and strategic bets.

Calibration requires rigorous tracking: record predictions with explicit percentages, measure Brier scores (accuracy metric), identify systematic biases (always too high/low?), and adjust. Over hundreds of predictions, well-calibrated forecasters develop intuitive sense of uncertainty levels.

When to Use

•Improving forecast accuracy in business planning, sales projections, product timelines
•Building prediction markets or forecasting tournaments within organizations
•Evaluating expertise - are "industry experts" actually calibrated or just confidently wrong?
•Training teams to express uncertainty honestly instead of false precision
•Debugging overconfidence bias in founders, executives, product managers
•Assessing risk - calibrated probabilities enable better expected value calculations

The Process

Step 1: Make Explicit Probability Predictions

Transform vague language ("probably," "unlikely," "very confident") into precise percentages. Force yourself to commit to a number.

Vague → Calibrated:

•❌ "We'll probably hit Q3 revenue target"
•✅ "I'm 65% confident we hit Q3 revenue target"
•❌ "This hire seems great"
•✅ "I'm 80% confident this hire succeeds in role for 18+ months"

Key technique: Use 10% increments for granularity (50%, 60%, 70%) instead of just "likely/unlikely."

Step 2: Track Predictions and Outcomes Systematically

Create a prediction log with: (1) prediction statement, (2) probability assigned, (3) date made, (4) outcome (success/failure), (5) date resolved.

Tracking spreadsheet columns:

•Prediction text
•Probability (%)
•Date predicted
•Actual outcome (1 = happened, 0 = didn't happen)
•Date resolved
•Category (hiring, product, revenue, etc.)

Volume matters: You need 50-100+ predictions to measure calibration accurately. Single predictions don't reveal systematic bias.

Step 3: Calculate Brier Score

Brier score measures prediction accuracy on 0-2 scale, where 0 = perfect, 2 = worst possible. Lower scores = better calibration + resolution.

Formula: Brier Score = (1/n) × Σ(forecast - outcome)²

•Forecast = your probability (0.7 for 70%)
•Outcome = 1 if happened, 0 if didn't
•n = number of predictions

Example:

•Prediction: "70% chance we ship feature by Q3" → shipped on time
•Brier: (0.7 - 1)² = 0.09
•Prediction: "90% chance candidate accepts offer" → they declined
•Brier: (0.9 - 0)² = 0.81 (large penalty for overconfidence)

Benchmark: Superforecasters achieve Brier scores of 0.15-0.25. Average forecasters: 0.35-0.50.

Step 4: Analyze Calibration Curve

Group predictions by probability bucket (60-70%, 70-80%, etc.) and compare predicted probability to actual frequency of outcomes.

Perfect calibration:

•60% predictions → 60% actually happened
•80% predictions → 80% actually happened
•90% predictions → 90% actually happened

Overconfidence pattern:

•80% predictions → only 60% happened (systematically too confident)

Underconfidence pattern:

•70% predictions → 85% happened (leaving accuracy on table)

Visualization: Plot predicted probability (x-axis) vs. actual frequency (y-axis). Perfect calibration = diagonal line.

Step 5: Identify and Correct Systematic Biases

Look for patterns in miscalibration - are you overconfident in certain domains? Underconfident in others? Do you cluster predictions at 50%, 80%, 90% instead of using full range?

Common biases to detect:

•Domain overconfidence: Great calibration in engineering estimates, terrible in hiring
•50% clustering: Defaulting to 50% when uncertain (should use wider range)
•Overconfidence in rare events: Saying 95% for things that happen 75% of time
•Outcome-blind anchoring: Anchoring on initial view, not updating with new data

Correction strategy: Apply domain-specific adjustments. If you're always 15 points too high on hiring, mentally subtract 15 points before recording prediction.

Step 6: Practice Deliberate Calibration Training

Calibration improves through feedback loops - predict, observe outcome, update mental model, repeat. Superforecasters practice daily.

Training exercises:

•Trivia calibration: Predict "% confident I know the answer" then check - forces you to distinguish 60% from 90%
•Same-day resolution: Predict events with quick feedback (will meeting run over time? will build pass?)
•External benchmarks: Compare your forecasts to prediction markets, other forecasters
•Pre-mortems: Before launching, predict "% chance we encounter X problem" - revisit after launch

Key insight: Calibration is a skill, not a trait. Even poor forecasters improve 20-30% with training.

Example Application

Situation: Product manager predicting feature ship dates over 6 months, tracking 25 predictions.

Application:

•Initial predictions: Average 85% confidence, actual delivery rate: 60% (severe overconfidence)
•Brier score: 0.42 (below average)
•Pattern identified: Always underestimate integration complexity, overestimate team velocity
•Adjustment: Reduce confidence by 15 points on features requiring backend integration
•Month 4-6 results: 72% confidence, 68% delivery rate (much better calibration)
•New Brier score: 0.28 (approaching superforecaster level)

Outcome: PM now expresses uncertainty honestly, sets realistic stakeholder expectations, and prioritizes better (focuses on high-confidence ships).

Example Application 2

Situation: Executive team forecasting quarterly OKR achievement, historically claiming "80% confident" but only hitting 50%.