Statistical Testing and Simulation Validation
Overview
This skill provides patterns for testing probabilistic code (Monte Carlo simulations, statistical estimates, random sampling) in a way that is both rigorous and stable in CI. The goal is to eliminate flaky tests while maintaining meaningful statistical assertions.
Decision Tree: Which Testing Approach?
code
+-----------------------------+
| What are you testing? |
+-------------+---------------+
|
+-------------------+-------------------+
v v v
+-----------------+ +-----------------+ +-----------------+
| Control flow / | | Distributional | | Edge cases / |
| specific output | | properties | | robustness |
+--------+--------+ +--------+--------+ +--------+--------+
| | |
v v v
+-----------------+ +-----------------+ +-----------------+
| DETERMINISTIC | | SEEDED | | RANDOM-SEED |
| MOCK | | STATISTICAL | | DISTRIBUTION |
+-----------------+ +-----------------+ +-----------------+
Use Deterministic Mocking When:
- •Testing control flow, not statistical properties
- •Exact output values matter for the test
- •Reproducing a specific edge case
- •Testing error handling paths
Use Seeded Statistical Tests When:
- •Validating distributional properties (mean, variance, percentiles)
- •Need reproducibility for debugging failures
- •CI stability is critical
- •Testing convergence behavior
Use Random-Seed Distribution Tests When:
- •Exploring edge cases in development (not CI)
- •Validating robustness across seeds
- •Complement to seeded tests, not replacement
Choosing N for Statistical Assertions
For Proportions (e.g., "50% of simulations should be profitable")
| Desired Precision | Minimum N | Example |
|---|---|---|
| +/-10% (rough) | 100 | Quick sanity check |
| +/-5% (moderate) | 400 | Standard CI test |
| +/-3% (good) | 1,000 | Important assertions |
| +/-1% (precise) | 10,000 | Critical validations |
For Means (e.g., "average return should be ~7%")
Depends on variance. Use power analysis or empirical calibration.
Confidence Interval Patterns
Pattern 1: Clopper-Pearson for Proportions
Use when testing binary outcomes (success/failure, profit/loss).
Pattern 2: Bootstrap for Complex Statistics
Use when testing medians, percentiles, or custom statistics.
Pattern 3: Tolerance Bands for Time Series
Use when testing simulated paths over time.
Anti-Patterns
Point Estimate Assertion (BAD)
typescript
// BAD: Asserts exact value from random process expect(mean(results)).toBe(0.5); // Will fail ~always
Tight Tolerance with Small N (BAD)
typescript
// BAD: N=100 can't support 2-decimal precision expect(mean(results)).toBeCloseTo(0.5, 2); // Flaky
Random Seed in CI (BAD)
typescript
// BAD: Different seed each run = non-reproducible failures
const results = runSimulations({ n: 1000 }); // No seed!
Seed Management Strategy
Development Seeds vs. CI Seeds
typescript
// config/test-seeds.ts
export const SEEDS = {
// CI seeds: fixed, documented, produce known-good results
ci: {
monteCarlo: 42,
portfolioSim: 123,
stressTest: 7777,
},
// Development: use current timestamp for exploration
development: () => Date.now(),
};
Documenting Seed Behavior
Document expected statistical properties for each seed to catch when simulation logic changes.
Property-Based Testing for Invariants
Use for properties that must hold regardless of random outcomes:
typescript
import * as fc from 'fast-check';
test('portfolio value is always non-negative', () => {
fc.assert(
fc.property(
fc.integer({ min: 1, max: 100 }),
fc.integer({ min: 100, max: 10000 }),
fc.integer({ min: 0, max: 999999 }),
(periods, n, seed) => {
const results = runSimulations({ periods, n, seed });
return results.every(r => r.finalValue >= 0);
}
),
{ numRuns: 50 }
);
});
Integration with Phoenix Workflows
When Monte Carlo or statistical code changes:
- •Before merge: Run seeded statistical tests (standard CI)
- •After merge: Run random-seed distribution tests (nightly job)
- •On failure: Check if seed behavior changed, recalibrate if needed
- •Document: Update seed documentation with new expected values