Test Health Audit
Systematic review of a test suite for flakiness, performance, and coverage. Spawns a team to audit, fix, optimize, and verify.
When to Use
- •Tests pass/fail inconsistently (flaky)
- •Test suite is slow or getting slower
- •Coverage is unknown or declining
- •Before a release or after a large feature lands
- •CI is unreliable due to test issues
The Process
dot
digraph audit {
rankdir=TB;
"Spawn audit agents (parallel)" [shape=box];
"Flakiness audit" [shape=box];
"Performance audit" [shape=box];
"Fix flaky tests" [shape=box];
"Optimize slow tests" [shape=box];
"10x verification run" [shape=box];
"Push PR" [shape=box];
"Spawn audit agents (parallel)" -> "Flakiness audit";
"Spawn audit agents (parallel)" -> "Performance audit";
"Flakiness audit" -> "Fix flaky tests";
"Performance audit" -> "Optimize slow tests";
"Fix flaky tests" -> "10x verification run";
"Optimize slow tests" -> "10x verification run";
"10x verification run" -> "Push PR";
}
Phase 1: Parallel Audits
Spawn two research agents simultaneously:
Flakiness Audit (test-automator or QA agent):
- •Identify all flaky tests and categorize root causes
- •Search for these patterns across ALL test files:
| Pattern | Risk | Example |
|---|---|---|
| Shared filesystem paths | HIGH | Tests reading/writing same directory |
| Non-seeded RNG | HIGH | thread_rng(), rand::rng() in test paths |
| Timing dependencies | MEDIUM | sleep(), Instant::now(), elapsed checks |
| Global/static mutable state | HIGH | lazy_static, shared Mutex, AtomicBool |
| Execution order dependencies | MEDIUM | Tests that assume prior test ran |
| Probabilistic assertions | LOW | Monte Carlo with tight margins |
- •Produce ranked report: test name, file:line, root cause, severity, fix approach
Performance Audit (performance-engineer agent):
- •Profile test suite timing per file and per test
- •Identify slowest tests and why they're slow:
| Pattern | Fix |
|---|---|
| Excessive loop counts (50k+) | Reduce to minimum proving the point |
| Brute-force probabilistic triggering | Pre-computed seeds or boosted parameters |
| Redundant negative assertions (P=0) | 100 iterations sufficient for structural impossibility |
| Expensive AI computation in tests | Reduce search depth or board size |
| Duplicate coverage across files | Consolidate or remove redundant tests |
- •Produce ranked report: test name, time contribution, optimization, expected savings
Phase 2: Parallel Fixes
Spawn fix agents based on audit findings. Common fix patterns:
Flakiness fixes:
- •Shared filesystem -> isolated temp directories per test
- •Non-seeded RNG -> seeded RNG or boosted parameters to make outcome deterministic
- •Timing deps -> generous tolerances or remove timing dependency
- •Global state -> per-test initialization
Performance fixes:
- •Structural impossibility tests (P=0 by code path): reduce to 100-1k iterations
- •Rare event triggering: boost parameters to increase probability + seeded RNG
- •Statistical distribution tests: reduce samples, adjust thresholds proportionally
- •Combat/simulation loops: give test characters overwhelming stats
Phase 3: Coverage Check
After fixes, verify coverage hasn't regressed:
bash
# Rust cargo llvm-cov --summary-only # JS/TS npx jest --coverage --coverageReporters=text-summary # Python pytest --cov --cov-report=term-summary
Flag any modules below project threshold.
Phase 4: Verification
Run the full test suite 10 times consecutively:
bash
# Rust for i in $(seq 1 10); do echo "Run $i"; cargo test 2>&1 | grep "test result:"; done # JS/TS for i in $(seq 1 10); do echo "Run $i"; npx jest 2>&1 | grep "Tests:"; done
Pass criteria: 0 failures across all 10 runs.
Team Composition
| Role | Count | Task |
|---|---|---|
| QA / test-automator | 1 | Flakiness audit |
| performance-engineer | 1 | Performance audit |
| QA / general-purpose | 1-3 | Fix flaky tests (parallelize by category) |
| Dev / general-purpose | 1 | Optimize slow tests |
Scale QA agents based on number of flaky test categories found.
Common Mistakes
| Mistake | Fix |
|---|---|
| Reducing iteration count too aggressively | Keep enough for statistical confidence (3x expected hits minimum) |
| Fixing flakiness by adding sleep/retry | Fix root cause (isolation, determinism), not symptoms |
| Modifying production code to fix tests | Only change test code unless the production code has a genuine bug |
| Ignoring Monte Carlo tests as "probably fine" | Check margins are generous (2-5x expected range) |
| Skipping the 10x verification | Flakiness is probabilistic; 1 run proves nothing |