A/B Test Setup
1️⃣ Purpose & Scope
Ensure every A/B test is valid, rigorous, and safe before a single line of code is written.
- •Prevents "peeking"
- •Enforces statistical power
- •Blocks invalid hypotheses
2️⃣ Pre-Requisites
You must have:
- •A clear user problem
- •Access to an analytics source
- •Roughly estimated traffic volume
Hypothesis Quality Checklist
A valid hypothesis includes:
- •Observation or evidence
- •Single, specific change
- •Directional expectation
- •Defined audience
- •Measurable success criteria
3️⃣ Hypothesis Lock (Hard Gate)
Before designing variants or metrics, you MUST:
- •Present the final hypothesis
- •Specify:
- •Target audience
- •Primary metric
- •Expected direction of effect
- •Minimum Detectable Effect (MDE)
Ask explicitly:
“Is this the final hypothesis we are committing to for this test?”
Do NOT proceed until confirmed.
4️⃣ Assumptions & Validity Check (Mandatory)
Explicitly list assumptions about:
- •Traffic stability
- •User independence
- •Metric reliability
- •Randomization quality
- •External factors (seasonality, campaigns, releases)
If assumptions are weak or violated:
- •Warn the user
- •Recommend delaying or redesigning the test
5️⃣ Test Type Selection
Choose the simplest valid test:
- •A/B Test – single change, two variants
- •A/B/n Test – multiple variants, higher traffic required
- •Multivariate Test (MVT) – interaction effects, very high traffic
- •Split URL Test – major structural changes
Default to A/B unless there is a clear reason otherwise.
6️⃣ Metrics Definition
Primary Metric (Mandatory)
- •Single metric used to evaluate success
- •Directly tied to the hypothesis
- •Pre-defined and frozen before launch
Secondary Metrics
- •Provide context
- •Explain why results occurred
- •Must not override the primary metric
Guardrail Metrics
- •Metrics that must not degrade
- •Used to prevent harmful wins
- •Trigger test stop if significantly negative
7️⃣ Sample Size & Duration
Define upfront:
- •Baseline rate
- •MDE
- •Significance level (typically 95%)
- •Statistical power (typically 80%)
Estimate:
- •Required sample size per variant
- •Expected test duration
Do NOT proceed without a realistic sample size estimate.
8️⃣ Execution Readiness Gate (Hard Stop)
You may proceed to implementation only if all are true:
- •Hypothesis is locked
- •Primary metric is frozen
- •Sample size is calculated
- •Test duration is defined
- •Guardrails are set
- •Tracking is verified
If any item is missing, stop and resolve it.
Running the Test
During the Test
DO:
- •Monitor technical health
- •Document external factors
DO NOT:
- •Stop early due to “good-looking” results
- •Change variants mid-test
- •Add new traffic sources
- •Redefine success criteria
Analyzing Results
Analysis Discipline
When interpreting results:
- •Do NOT generalize beyond the tested population
- •Do NOT claim causality beyond the tested change
- •Do NOT override guardrail failures
- •Separate statistical significance from business judgment
Interpretation Outcomes
| Result | Action |
|---|---|
| Significant positive | Consider rollout |
| Significant negative | Reject variant, document learning |
| Inconclusive | Consider more traffic or bolder change |
| Guardrail failure | Do not ship, even if primary wins |
Documentation & Learning
Test Record (Mandatory)
Document:
- •Hypothesis
- •Variants
- •Metrics
- •Sample size vs achieved
- •Results
- •Decision
- •Learnings
- •Follow-up ideas
Store records in a shared, searchable location to avoid repeated failures.
Refusal Conditions (Safety)
Refuse to proceed if:
- •Baseline rate is unknown and cannot be estimated
- •Traffic is insufficient to detect the MDE
- •Primary metric is undefined
- •Multiple variables are changed without proper design
- •Hypothesis cannot be clearly stated
Explain why and recommend next steps.
Key Principles (Non-Negotiable)
- •One hypothesis per test
- •One primary metric
- •Commit before launch
- •No peeking
- •Learning over winning
- •Statistical rigor first
Final Reminder
A/B testing is not about proving ideas right. It is about learning the truth with confidence.
If you feel tempted to rush, simplify, or “just try it” — that is the signal to slow down and re-check the design.