Statistical Analysis for ML Experiments

Name: statistical-analysis
Rating: 92
Author: jlaws

Test Selection

Decision Table: Which Test to Use

Scenario	Test	Assumptions	When to Use
Compare 2 models, paired runs	Paired permutation test	None (non-parametric)	Default choice for ML
Compare 2 models, unpaired	Mann-Whitney U	Independent samples	Different datasets/splits
Compare 2 means, normal data	Paired t-test	Normality, paired	Large N (>30) runs
Compare >2 models	Friedman test	Paired, ordinal	Ranking across datasets
Post-hoc after Friedman	Nemenyi test	Same as Friedman	Pairwise model comparison
Compare proportions	McNemar's test	Paired binary	Classification correct/incorrect
Correlation of metrics	Spearman's rho	Monotonic relationship	Rank-based, robust to outliers
Effect of hyperparameters	ANOVA / Kruskal-Wallis	See specific test	Factorial ablation design

Decision Table: Confidence Interval Method

Situation	Method	N Seeds
Metric approximately normal	t-interval	>= 5
Unknown distribution	Bootstrap percentile	>= 10
Small N, heavy tails	BCa bootstrap	>= 15
Proportion (accuracy)	Wilson interval	Any

Bootstrap Confidence Intervals

python

import numpy as np
from scipy import stats

def bootstrap_ci(
    data: np.ndarray,
    statistic=np.mean,
    n_bootstrap: int = 10_000,
    confidence: float = 0.95,
    method: str = "percentile",
    seed: int = 42,
) -> tuple[float, float, float]:
    """Compute bootstrap confidence interval.

    Args:
        data: 1D array of observations (e.g., accuracy per seed).
        statistic: Function to compute on each resample.
        method: 'percentile' or 'bca'.

    Returns:
        (point_estimate, lower, upper)
    """
    rng = np.random.default_rng(seed)
    observed = statistic(data)
    n = len(data)

    boot_stats = np.array([
        statistic(rng.choice(data, size=n, replace=True))
        for _ in range(n_bootstrap)
    ])

    alpha = 1 - confidence
    if method == "percentile":
        lower = np.percentile(boot_stats, 100 * alpha / 2)
        upper = np.percentile(boot_stats, 100 * (1 - alpha / 2))
    elif method == "bca":
        # Bias-correction
        z0 = stats.norm.ppf(np.mean(boot_stats < observed))
        # Acceleration (jackknife)
        jackknife = np.array([
            statistic(np.delete(data, i)) for i in range(n)
        ])
        jack_mean = jackknife.mean()
        acc = np.sum((jack_mean - jackknife) ** 3) / (
            6 * np.sum((jack_mean - jackknife) ** 2) ** 1.5
        )
        # Adjusted percentiles
        z_alpha = stats.norm.ppf(alpha / 2)
        z_1alpha = stats.norm.ppf(1 - alpha / 2)
        p_lower = stats.norm.cdf(z0 + (z0 + z_alpha) / (1 - acc * (z0 + z_alpha)))
        p_upper = stats.norm.cdf(z0 + (z0 + z_1alpha) / (1 - acc * (z0 + z_1alpha)))
        lower = np.percentile(boot_stats, 100 * p_lower)
        upper = np.percentile(boot_stats, 100 * p_upper)
    else:
        raise ValueError(f"Unknown method: {method}")

    return observed, lower, upper

# Usage
accuracies = np.array([92.1, 92.5, 91.8, 92.3, 92.0, 91.9, 92.4, 92.2, 92.6, 91.7])
point, lo, hi = bootstrap_ci(accuracies, method="bca")
print(f"Accuracy: {point:.2f} [{lo:.2f}, {hi:.2f}] (95% BCa CI)")

Paired Permutation Test

python

def paired_permutation_test(
    scores_a: np.ndarray,
    scores_b: np.ndarray,
    n_permutations: int = 100_000,
    seed: int = 42,
) -> tuple[float, float]:
    """Two-sided paired permutation test.

    Returns:
        (observed_diff, p_value) where diff = mean(a) - mean(b).
    """
    rng = np.random.default_rng(seed)
    diffs = scores_a - scores_b
    observed = np.mean(diffs)

    count = 0
    for _ in range(n_permutations):
        signs = rng.choice([-1, 1], size=len(diffs))
        perm_diff = np.mean(diffs * signs)
        if abs(perm_diff) >= abs(observed):
            count += 1

    p_value = count / n_permutations
    return observed, p_value

# Usage
model_a = np.array([92.1, 92.5, 91.8, 92.3, 92.0])
model_b = np.array([91.5, 91.8, 91.2, 91.9, 91.4])
diff, p = paired_permutation_test(model_a, model_b)
print(f"Mean diff: {diff:.2f}, p-value: {p:.4f}")

Multiple Comparison Corrections

When to Correct

Scenario	Correct?	Method
Single primary metric, 2 models	No	Single test suffices
1 model vs 5 baselines	Yes	Holm-Bonferroni
5 metrics, 1 comparison	Yes	Holm-Bonferroni
Exploratory ablation (many combos)	Yes	Benjamini-Hochberg (FDR)
Pre-registered single comparison	No	Pre-registration controls

Implementation

python

from statsmodels.stats.multitest import multipletests

def correct_pvalues(
    p_values: list[float],
    method: str = "holm",  # 'bonferroni', 'holm', 'fdr_bh'
    alpha: float = 0.05,
) -> dict:
    """Apply multiple comparison correction.

    Methods:
        'bonferroni': Conservative. alpha/n per test.
        'holm': Step-down Bonferroni. Less conservative, controls FWER.
        'fdr_bh': Benjamini-Hochberg. Controls false discovery rate.
    """
    reject, corrected, _, _ = multipletests(p_values, alpha=alpha, method=method)
    return {
        "reject": reject.tolist(),
        "corrected_p": corrected.tolist(),
        "method": method,
    }

# Usage: comparing model against 4 baselines
raw_p = [0.01, 0.04, 0.03, 0.08]
result = correct_pvalues(raw_p, method="holm")
# result['reject'] -> [True, True, True, False]
# result['corrected_p'] -> [0.04, 0.08, 0.09, 0.08]

Power Analysis

python

from statsmodels.stats.power import TTestPower

def compute_required_seeds(
    effect_size: float,
    alpha: float = 0.05,
    power: float = 0.8,
) -> int:
    """How many seeds needed to detect a given effect size.

    effect_size: Cohen's d = (mean_a - mean_b) / pooled_std
        Small: 0.2, Medium: 0.5, Large: 0.8

    Typical ML scenario:
        If accuracy diff ~0.5% and std ~0.3%, d = 0.5/0.3 = 1.67 (large).
        If accuracy diff ~0.2% and std ~0.3%, d = 0.2/0.3 = 0.67 (medium).
    """
    analysis = TTestPower()
    n = analysis.solve_power(
        effect_size=effect_size,
        alpha=alpha,
        power=power,
        alternative="two-sided",
    )
    return int(np.ceil(n))

# How many seeds to detect 0.5% accuracy diff with 0.3% std?
d = 0.5 / 0.3  # ~1.67
n_seeds = compute_required_seeds(d)
print(f"Need {n_seeds} seeds per model")  # ~5

# Smaller effect: 0.2% diff, 0.3% std
d_small = 0.2 / 0.3  # ~0.67
n_seeds_small = compute_required_seeds(d_small)
print(f"Need {n_seeds_small} seeds per model")  # ~20

Ablation Study Design

Reporting Ablation Results

Component Removed	Accuracy	Delta	p-value
None (full model)	92.3 +/- 0.3	--	--
- Attention	90.1 +/- 0.4	-2.2	0.001
- Residual	91.8 +/- 0.3	-0.5	0.032
- Dropout	92.1 +/- 0.3	-0.2	0.241

Gotchas and Anti-Patterns

P-Hacking via Metric Selection

•Choosing the metric that gives significance after seeing results invalidates the test.
•Fix: Pre-register primary metric. Report all metrics regardless of significance.

Underpowered Experiments

•3 seeds is almost never enough. Most ML experiments need 5-20 seeds depending on variance.
•Running power analysis post-hoc to explain away non-significance is invalid.
•Fix: Run power analysis before experiments. Report effect sizes alongside p-values.

Invalid Independence Assumptions

•Runs on overlapping data splits are not independent. K-fold results are correlated.
•Training with different seeds on the same data violates independence for unpaired tests.
•Fix: Use paired tests when runs share data. Use corrected variance estimators for k-fold.

Misuse of T-Tests

•T-tests assume normality. With 3-5 seeds, you cannot verify this.
•T-tests on bounded metrics (accuracy in [0,1]) are technically wrong but often okay in practice for mid-range values.
•Fix: Default to non-parametric tests (permutation, Wilcoxon). Use t-tests only with N >= 30 or verified normality.

Reporting Pitfalls

•"Statistically significant" without effect size is meaningless. A 0.01% improvement can be significant with enough runs.
•Confidence intervals are more informative than p-values. Always report both.
•Standard deviation vs standard error: SD measures spread of runs, SE measures uncertainty of the mean. Report SD for reproducibility, SE or CI for comparison.