Experiment Design
Types of Experiments
1. A/B Test (Two Variants)
What: Compare two versions (A vs B)
Example:
- •Control (A): Blue "Buy Now" button
- •Treatment (B): Green "Buy Now" button
When to Use:
- •Testing single change
- •Clear hypothesis
- •Binary decision (ship or don't ship)
Pros:
- •Simple to implement
- •Easy to analyze
- •Clear winner
Cons:
- •Only tests one change
- •Can't test interactions
2. Multivariate Test (Multiple Changes)
What: Test multiple changes simultaneously
Example:
- •Variable 1: Button color (Blue, Green, Red)
- •Variable 2: Button text ("Buy Now", "Add to Cart", "Get Started")
- •Variants: 3 × 3 = 9 combinations
When to Use:
- •Testing multiple elements
- •Want to find best combination
- •Have enough traffic
Pros:
- •Test interactions between variables
- •Find optimal combination
Cons:
- •Requires much more traffic
- •Complex analysis
- •Longer test duration
3. Sequential Testing
What: Continuously monitor and stop early if clear winner
Example:
- •Start A/B test
- •Check results daily
- •Stop when statistical significance reached (could be day 3 or day 14)
When to Use:
- •Want to ship winners fast
- •High traffic
- •Using tools that support it (Statsig, GrowthBook)
Pros:
- •Faster results
- •Less opportunity cost
Cons:
- •Requires special statistical methods
- •Can't "peek" with traditional A/B tests
4. Holdout Groups (Long-Term Effects)
What: Keep small % of users on old experience permanently
Example:
- •95% of users: New feature
- •5% of users: Old experience (holdout)
When to Use:
- •Measure long-term effects
- •Detect delayed negative impacts
- •Validate cumulative changes
Pros:
- •Detects long-term issues
- •Measures true impact
Cons:
- •Some users get worse experience
- •Requires ongoing monitoring
When to Experiment
✅ Experiment When:
- •
Significant Features (High Impact)
- •Major redesign
- •New pricing model
- •Core flow changes
- •
Uncertain Outcomes
- •Don't know if it will work
- •Conflicting opinions
- •No clear data
- •
Multiple Solution Options
- •Two different approaches
- •Want to pick the best
- •
Optimization Opportunities
- •Incremental improvements
- •Conversion optimization
- •Engagement optimization
❌ Don't Experiment When:
- •
Obvious Bugs/Fixes
- •Broken functionality
- •Security issues
- •Legal compliance
- •
Very Low Traffic
- •Can't reach statistical significance
- •Would take months
- •
Trivial Changes
- •Copy typo fix
- •Minor styling adjustment
- •
Ethical Issues
- •Manipulative dark patterns
- •Harmful to users
Experiment Design Process
Step 1: Define Hypothesis
Template:
"If we [change], then [metric] will [improve by X%], because [reasoning]."
Example:
"If we change the CTA button from blue to green, then click-through rate will increase by 10%, because green is more attention-grabbing."
Step 2: Choose Metrics
Primary Metric: What you're optimizing
- •Example: Click-through rate
Secondary Metrics: Other important outcomes
- •Example: Conversion rate, revenue per user
Counter Metrics: Watch for negatives
- •Example: Bounce rate, time on page
Step 3: Determine Sample Size
Inputs:
- •Baseline conversion rate: 5%
- •Expected improvement: 10% relative lift (5% → 5.5%)
- •Significance level: 0.05 (95% confidence)
- •Power: 0.80 (80% chance of detecting effect)
Output:
- •Sample size needed: ~31,000 users per variant
Tools:
- •Evan Miller's calculator: https://www.evanmiller.org/ab-testing/sample-size.html
- •Optimizely sample size calculator
Step 4: Set Test Duration
Factors:
- •Sample size needed
- •Daily traffic
- •Weekly patterns (run at least 1-2 weeks)
- •Business cycles
Example:
- •Sample size: 31,000 per variant (62,000 total)
- •Daily traffic: 5,000
- •Duration: 62,000 / 5,000 = 12.4 days → Run for 2 weeks
Step 5: Design Variants
Control (A): Current experience Treatment (B): New experience
Best Practices:
- •Change only one thing (for A/B test)
- •Make change meaningful (not trivial)
- •Ensure variants are distinct
Step 6: Launch Test
Checklist:
- • Hypothesis documented
- • Metrics instrumented
- • Sample size calculated
- • Randomization working
- • QA tested both variants
- • Monitoring dashboard ready
Step 7: Analyze Results
Check:
- •Statistical significance (p < 0.05)
- •Practical significance (is improvement meaningful?)
- •Secondary metrics (any red flags?)
- •Segment analysis (works for everyone?)
Step 8: Decide (Ship, Iterate, Kill)
Ship if:
- •Positive, significant, no red flags
Iterate if:
- •Mixed results, some segments good
Kill if:
- •Negative, not significant, opportunity cost too high
Choosing Metrics
Primary Metric (What We're Optimizing)
Characteristics:
- •Directly tied to hypothesis
- •Sensitive to change
- •Measurable in test duration
Examples:
- •Click-through rate (CTR)
- •Conversion rate
- •Sign-up completion rate
- •Time to first action
Bad Primary Metrics:
- •Revenue (too noisy, delayed)
- •Retention (takes too long to measure)
- •NPS (survey-based, low sample)
Secondary Metrics (Guardrails, Side Effects)
Purpose: Ensure we're not breaking other things
Examples:
- •Revenue per user
- •Engagement (sessions per user)
- •Feature adoption
- •Customer satisfaction
Counter Metrics (Watch for Negatives)
Purpose: Detect unintended negative consequences
Examples:
- •Bounce rate (users leaving immediately)
- •Error rate (technical issues)
- •Support tickets (confusion)
- •Churn rate (users leaving)
Example: Checkout Flow Test
Hypothesis:
"If we reduce checkout from 5 steps to 3 steps, conversion will increase by 15%."
Metrics:
- •Primary: Checkout conversion rate
- •Secondary: Average order value, time to complete checkout
- •Counter: Cart abandonment rate, error rate, support tickets
Statistical Significance
P-Value < 0.05 (95% Confidence)
What it Means:
- •Less than 5% chance result is due to random chance
- •95% confident the effect is real
Example:
- •Control: 5.0% conversion
- •Treatment: 5.5% conversion
- •P-value: 0.03 ✅ (< 0.05, statistically significant)
Interpretation:
"We're 95% confident that the treatment is better than control."
Statistical Power (80%+)
What it Means:
- •80% chance of detecting an effect if it exists
- •Reduces false negatives
Example:
- •Power: 80%
- •Means: 20% chance of missing a real effect
Minimum Detectable Effect (MDE)
What it Means:
- •Smallest effect size you can reliably detect
- •Depends on sample size
Example:
- •Baseline: 5% conversion
- •Sample size: 10,000 per variant
- •MDE: 0.5% absolute (10% relative)
- •Can detect: 5.0% → 5.5% or larger
Trade-off:
- •Larger sample size → Smaller MDE (detect smaller effects)
- •Smaller sample size → Larger MDE (only detect big effects)
Sample Size Calculation
Formula (Simplified)
n = (Z_α/2 + Z_β)² × (p₁(1-p₁) + p₂(1-p₂)) / (p₁ - p₂)² Where: - n = sample size per variant - Z_α/2 = 1.96 (for 95% confidence) - Z_β = 0.84 (for 80% power) - p₁ = baseline conversion rate - p₂ = expected conversion rate
Example Calculation
Inputs:
- •Baseline conversion rate (p₁): 5% = 0.05
- •Expected improvement: 10% relative lift
- •New conversion rate (p₂): 5.5% = 0.055
- •Significance level (α): 0.05
- •Power (1-β): 0.80
Calculation:
n = (1.96 + 0.84)² × (0.05×0.95 + 0.055×0.945) / (0.05 - 0.055)² n = 7.84 × (0.0475 + 0.052) / 0.000025 n = 7.84 × 0.0995 / 0.000025 n ≈ 31,200 per variant
Total sample size: 62,400 users
Using Online Calculators
Evan Miller's Calculator:
- •Go to https://www.evanmiller.org/ab-testing/sample-size.html
- •Enter baseline conversion rate: 5%
- •Enter minimum detectable effect: 10% (relative)
- •Get sample size: ~31,000 per variant
Optimizely Calculator:
- •Go to Optimizely sample size calculator
- •Enter baseline: 5%
- •Enter minimum detectable effect: 0.5% (absolute)
- •Get sample size: ~31,000 per variant
Test Duration
Minimum Duration: 1-2 Weeks
Why:
- •Capture weekly patterns (weekday vs weekend)
- •Avoid day-of-week bias
- •Account for user behavior cycles
Example:
- •Don't run Monday-Wednesday only
- •Run at least Monday-Sunday (1 full week)
Full Business Cycles
Examples:
- •E-commerce: Include payday (1st and 15th of month)
- •B2B SaaS: Include full week (avoid Friday-only)
- •Seasonal: Avoid holidays (unless testing holiday-specific)
Enough Data for Significance
Formula:
Duration = Sample Size Needed / Daily Traffic
Example:
- •Sample size: 62,000 total
- •Daily traffic: 5,000
- •Duration: 62,000 / 5,000 = 12.4 days
- •Run for: 2 weeks (14 days)
Not Too Long (Opportunity Cost)
Trade-off:
- •Longer test = More confidence
- •Longer test = Delayed learnings, slower iteration
Guideline:
- •Most tests: 1-4 weeks
- •High-traffic sites: 1-2 weeks
- •Low-traffic sites: 2-4 weeks
- •Don't run > 1 month (diminishing returns)
Experiment Variants
Control (Current Experience)
What: The existing experience
Example:
- •Current checkout flow (5 steps)
- •Current button color (blue)
- •Current pricing page
Purpose: Baseline for comparison
Treatment (New Experience)
What: The proposed change
Example:
- •New checkout flow (3 steps)
- •New button color (green)
- •New pricing page
Purpose: Test hypothesis
Multiple Treatments (If Testing Different Approaches)
Example:
- •Control: 5-step checkout
- •Treatment A: 3-step checkout (combine steps)
- •Treatment B: 1-page checkout (all on one page)
Traffic Split:
- •Control: 33%
- •Treatment A: 33%
- •Treatment B: 34%
Analysis:
- •Compare each treatment to control
- •Compare treatments to each other
Randomization
User-Level Randomization (Consistent Experience)
What: Each user always sees same variant
How:
const variant = hashUserId(userId) % 2 === 0 ? 'control' : 'treatment';
When to Use:
- •Logged-in users
- •Want consistent experience
- •Testing flows (multi-step)
Pros:
- •Consistent experience
- •No confusion
Cons:
- •Requires user ID
Session-Level (For Anonymous Users)
What: Each session sees same variant (but different sessions can differ)
How:
const variant = hashSessionId(sessionId) % 2 === 0 ? 'control' : 'treatment';
When to Use:
- •Anonymous users
- •Single-page tests
Pros:
- •Works for anonymous users
Cons:
- •Same user can see different variants across sessions
Stratified Sampling (For Segments)
What: Ensure even distribution across segments
Example:
- •Segment 1: Free users (50% control, 50% treatment)
- •Segment 2: Paid users (50% control, 50% treatment)
Why:
- •Avoid imbalanced segments
- •Enable segment analysis
Common Pitfalls
1. Peeking (Stopping Test Early When "Winning")
Problem:
Day 3: Treatment is winning! (p = 0.04) → Ship it! Day 7: Treatment is losing... (p = 0.12) → Oops.
Why It's Bad:
- •Increases false positive rate
- •P-value fluctuates during test
Solution:
- •Decide sample size upfront
- •Don't look until test completes
- •Or use sequential testing (proper method)
2. Sample Ratio Mismatch (Uneven Splits)
Problem:
Expected: 50% control, 50% treatment Actual: 48% control, 52% treatment
Why It's Bad:
- •Indicates randomization bug
- •Results may be invalid
Solution:
- •Check sample ratio before analyzing
- •Investigate if mismatch > 1%
3. Novelty Effect (Users Trying New Thing)
Problem:
Week 1: Treatment is winning! (+20%) Week 4: Treatment is same as control (0%)
Why It's Bad:
- •Users try new thing out of curiosity
- •Effect fades over time
Solution:
- •Run test longer (2-4 weeks)
- •Use holdout group for long-term measurement
- •Segment by new vs returning users
4. Seasonality (Testing During Holidays)
Problem:
Test during Black Friday: +50% conversion Test during normal week: +5% conversion
Why It's Bad:
- •Holiday behavior is different
- •Results don't generalize
Solution:
- •Avoid testing during holidays
- •Or run test across multiple weeks (include holiday + normal)
Sequential Testing
What is Sequential Testing?
Traditional A/B Test:
- •Decide sample size upfront
- •Run until sample size reached
- •Analyze once at end
Sequential Testing:
- •Monitor continuously
- •Stop early if clear winner
- •Adjust significance threshold
How It Works
Algorithm:
- •Use adjusted significance threshold (not 0.05)
- •Account for multiple looks
- •Stop when threshold crossed
Example (Simplified):
Day 1: p = 0.10 → Continue Day 3: p = 0.03 → Continue Day 5: p = 0.001 → Stop! (clear winner)
Tools That Support Sequential Testing
- •Statsig: Built-in sequential testing
- •GrowthBook: Bayesian statistics
- •Optimizely: Stats Engine (sequential)
Benefits
- •Faster results (stop early if clear winner)
- •Less opportunity cost
- •Detect large effects quickly
Drawbacks
- •Requires special tools
- •Can't use traditional p-value
- •More complex
Holdout Groups
What is a Holdout Group?
Definition: Small % of users kept on old experience permanently
Example:
- •95% of users: New feature
- •5% of users: Old experience (holdout)
Why Use Holdout Groups?
Measure Long-Term Effects:
- •A/B test shows +10% conversion in 2 weeks
- •Holdout shows +5% conversion after 6 months
- •Learning: Effect diminishes over time
Detect Delayed Negative Impacts:
- •A/B test shows +15% signups
- •Holdout shows +10% churn after 3 months
- •Learning: Feature attracts wrong users
How Long to Keep Holdout?
Guideline:
- •1-3 months for most features
- •6-12 months for major changes
- •Permanent for critical features
When to Remove Holdout?
Remove if:
- •No long-term differences detected
- •Opportunity cost too high (5% of users on worse experience)
- •Feature is critical (everyone should have it)
Experiment Analysis
Step 1: Compare Primary Metric
Example:
- •Control: 5.0% conversion
- •Treatment: 5.5% conversion
- •Lift: +10% relative
- •P-value: 0.03 ✅
Decision: Treatment is statistically significantly better.
Step 2: Check Secondary Metrics
Example:
- •Revenue per user: $10.50 (control) vs $11.20 (treatment) ✅
- •Time to checkout: 3.2 min (control) vs 2.8 min (treatment) ✅
Decision: Secondary metrics also improved.
Step 3: Check Counter Metrics
Example:
- •Bounce rate: 30% (control) vs 32% (treatment) ⚠️
- •Error rate: 0.5% (control) vs 0.5% (treatment) ✅
Decision: Slight increase in bounce rate, investigate.
Step 4: Segment Analysis
Did it work for everyone?
| Segment | Control | Treatment | Lift |
|---|---|---|---|
| Mobile | 4.5% | 5.2% | +15% ✅ |
| Desktop | 5.5% | 5.8% | +5% ✅ |
| Free users | 3.0% | 3.6% | +20% ✅ |
| Paid users | 7.0% | 7.1% | +1% ⚠️ |
Learning: Works great for mobile and free users, minimal impact on paid users.
Step 5: Statistical Significance
Check:
- •P-value < 0.05 ✅
- •Confidence interval doesn't include 0 ✅
Example:
- •Lift: +10%
- •95% CI: [+5%, +15%]
- •Interpretation: We're 95% confident the true lift is between 5% and 15%.
Step 6: Practical Significance
Is the improvement meaningful?
Example:
- •Statistically significant: Yes (p = 0.04)
- •Lift: +0.1% (5.0% → 5.005%)
- •Decision: Not practically significant (too small to matter)
Guideline:
- •Small lift but high volume → Ship (e.g., +0.1% on 1M users = 1,000 more conversions)
- •Large lift but low volume → Maybe ship (e.g., +50% on 100 users = 50 more conversions)
Decision Framework
Ship If:
✅ Positive: Treatment is better than control ✅ Significant: P-value < 0.05 ✅ No Red Flags: Secondary and counter metrics look good ✅ Works for Key Segments: At least works for majority
Example:
- •Conversion: +10% (p = 0.03) ✅
- •Revenue: +8% (p = 0.05) ✅
- •Bounce rate: No change ✅
- •Works for mobile and desktop ✅
- •Decision: Ship!
Iterate If:
⚠️ Mixed Results: Some metrics up, some down ⚠️ Works for Some Segments Only: E.g., only mobile, not desktop ⚠️ Close to Significance: P = 0.06 (just missed)
Example:
- •Conversion: +10% (p = 0.03) ✅
- •Revenue: -5% (p = 0.08) ⚠️
- •Decision: Iterate. Conversion is up but revenue is down. Investigate why.
Kill If:
❌ Negative: Treatment is worse than control ❌ Not Significant: P-value > 0.05 ❌ Opportunity Cost Too High: Could be working on better ideas
Example:
- •Conversion: +2% (p = 0.15) ❌
- •Took 4 weeks to test
- •Decision: Kill. Not significant, move on to next idea.
Tools
Feature Flags
LaunchDarkly:
- •Feature flag management
- •Gradual rollouts
- •Kill switches
Split.io:
- •Feature flags + experimentation
- •Real-time metrics
Unleash:
- •Open-source feature flags
- •Self-hosted option
Experimentation Platforms
Optimizely:
- •Full-stack experimentation
- •Visual editor for web
- •Stats Engine (sequential testing)
VWO (Visual Website Optimizer):
- •A/B testing for web
- •Heatmaps, session recordings
- •Visual editor
GrowthBook:
- •Open-source experimentation
- •Bayesian statistics
- •Feature flags
Statsig:
- •Modern experimentation platform
- •Sequential testing
- •Free tier
Analytics
Amplitude:
- •Product analytics
- •Funnel analysis
- •Cohort analysis
Mixpanel:
- •Event-based analytics
- •A/B test analysis
- •Retention analysis
PostHog:
- •Open-source product analytics
- •Feature flags
- •Session replay
A/B Testing for Engineers
1. Feature Flag Implementation
Node.js (LaunchDarkly):
const LaunchDarkly = require('launchdarkly-node-server-sdk');
const client = LaunchDarkly.init(process.env.LAUNCHDARKLY_SDK_KEY);
await client.waitForInitialization();
app.get('/checkout', async (req, res) => {
const user = {
key: req.user.id,
email: req.user.email,
custom: {
plan: req.user.plan
}
};
const showNewCheckout = await client.variation('new-checkout-flow', user, false);
if (showNewCheckout) {
res.render('checkout-new');
} else {
res.render('checkout-old');
}
});
Python (Statsig):
from statsig import statsig
statsig.initialize(os.environ['STATSIG_SERVER_KEY'])
@app.route('/checkout')
def checkout():
user = {
'userID': current_user.id,
'email': current_user.email,
'custom': {
'plan': current_user.plan
}
}
show_new_checkout = statsig.check_gate(user, 'new_checkout_flow')
if show_new_checkout:
return render_template('checkout_new.html')
else:
return render_template('checkout_old.html')
2. Metric Instrumentation
Segment (Event Tracking):
const Analytics = require('analytics-node');
const analytics = new Analytics(process.env.SEGMENT_WRITE_KEY);
// Track checkout started
analytics.track({
userId: user.id,
event: 'Checkout Started',
properties: {
variant: showNewCheckout ? 'treatment' : 'control',
cart_value: cart.total,
items_count: cart.items.length
}
});
// Track checkout completed
analytics.track({
userId: user.id,
event: 'Checkout Completed',
properties: {
variant: showNewCheckout ? 'treatment' : 'control',
order_id: order.id,
revenue: order.total
}
});
3. Data Pipeline
Architecture:
Application
↓ (events)
Segment
↓ (forwards to)
├── Amplitude (analytics)
├── Mixpanel (analytics)
├── Data Warehouse (BigQuery, Snowflake)
└── Statsig (experimentation)
4. Results Dashboard
Grafana Dashboard:
{
"dashboard": {
"title": "A/B Test: New Checkout Flow",
"panels": [
{
"title": "Conversion Rate by Variant",
"targets": [
{
"expr": "sum(checkout_completed{variant='control'}) / sum(checkout_started{variant='control'})",
"legendFormat": "Control"
},
{
"expr": "sum(checkout_completed{variant='treatment'}) / sum(checkout_started{variant='treatment'})",
"legendFormat": "Treatment"
}
]
},
{
"title": "Sample Size",
"targets": [
{
"expr": "sum(checkout_started{variant='control'})",
"legendFormat": "Control"
},
{
"expr": "sum(checkout_started{variant='treatment'})",
"legendFormat": "Treatment"
}
]
}
]
}
}
Real Experiment Examples
Example 1: Button Color Test (Classic)
Hypothesis:
"If we change the CTA button from blue to orange, click-through rate will increase by 10%, because orange is more attention-grabbing."
Test:
- •Control: Blue button
- •Treatment: Orange button
- •Sample size: 10,000 per variant
- •Duration: 1 week
Results:
- •Control: 5.2% CTR
- •Treatment: 5.7% CTR
- •Lift: +9.6%
- •P-value: 0.04 ✅
Decision: Ship orange button.
Example 2: Checkout Flow Optimization
Hypothesis:
"If we reduce checkout from 5 steps to 3 steps, conversion will increase by 15%, because users abandon due to flow length."
Test:
- •Control: 5-step checkout
- •Treatment: 3-step checkout (combined steps)
- •Sample size: 50,000 per variant
- •Duration: 2 weeks
Results:
- •Control: 8.5% conversion
- •Treatment: 9.8% conversion
- •Lift: +15.3%
- •P-value: 0.001 ✅
Secondary Metrics:
- •Time to checkout: 4.2 min → 3.1 min ✅
- •Error rate: 2.1% → 1.8% ✅
Decision: Ship 3-step checkout.
Example 3: Pricing Page Variants
Hypothesis:
"If we show annual pricing first (instead of monthly), annual plan adoption will increase by 25%, because anchoring effect."
Test:
- •Control: Monthly pricing shown first
- •Treatment: Annual pricing shown first
- •Sample size: 20,000 per variant
- •Duration: 3 weeks
Results:
- •Control: 12% annual adoption
- •Treatment: 18% annual adoption
- •Lift: +50%
- •P-value: 0.001 ✅
Counter Metrics:
- •Overall conversion: 10.5% → 10.2% ⚠️ (slight drop)
Decision: Ship, but monitor overall conversion.
Example 4: Onboarding Flow
Hypothesis:
"If we add an interactive tutorial in onboarding, activation rate will increase by 30%, because users don't know how to get started."
Test:
- •Control: No tutorial
- •Treatment: Interactive tutorial (5 steps)
- •Sample size: 15,000 per variant
- •Duration: 2 weeks
Results:
- •Control: 25% activation rate
- •Treatment: 28% activation rate
- •Lift: +12%
- •P-value: 0.08 ❌ (not significant)
Segment Analysis:
- •New users: +20% (p = 0.03) ✅
- •Returning users: +2% (p = 0.5) ❌
Decision: Iterate. Show tutorial only to new users.
Advanced: Bayesian A/B Testing
Traditional (Frequentist) A/B Testing
Approach:
- •Null hypothesis: No difference between A and B
- •P-value: Probability of seeing this result if null is true
- •Reject null if p < 0.05
Interpretation:
"There's a 95% chance the result is not due to random chance."
Bayesian A/B Testing
Approach:
- •Prior belief: What we believe before test
- •Likelihood: Data from test
- •Posterior belief: Updated belief after test
Interpretation:
"There's a 95% probability that B is better than A."
Benefits of Bayesian
- •
Easier to Interpret:
- •"95% probability B is better" (intuitive)
- •vs "p = 0.03" (confusing)
- •
Can Stop Early:
- •No peeking problem
- •Stop when confident enough
- •
Incorporates Prior Knowledge:
- •Use historical data
- •More accurate with small samples
Tools That Use Bayesian
- •GrowthBook: Bayesian by default
- •VWO: Bayesian engine option
- •Google Optimize: Bayesian (deprecated)
Example
Test:
- •Control: 5.0% conversion (1000 users)
- •Treatment: 5.5% conversion (1000 users)
Frequentist:
- •P-value: 0.15 (not significant)
- •Decision: Can't conclude
Bayesian:
- •Probability B > A: 87%
- •Expected lift: +10%
- •Decision: Likely better, but not confident enough (need 95%)
Summary
Quick Reference
Experiment Types:
- •A/B test: Two variants
- •Multivariate: Multiple changes
- •Sequential: Stop early
- •Holdout: Long-term measurement
When to Experiment:
- •Significant features
- •Uncertain outcomes
- •Multiple options
- •Optimization
Process:
- •Define hypothesis
- •Choose metrics
- •Calculate sample size
- •Set duration
- •Design variants
- •Launch
- •Analyze
- •Decide
Metrics:
- •Primary: What we're optimizing
- •Secondary: Guardrails
- •Counter: Watch for negatives
Statistical Significance:
- •P-value < 0.05
- •Power > 80%
- •Minimum detectable effect
Common Pitfalls:
- •Peeking
- •Sample ratio mismatch
- •Novelty effect
- •Seasonality
Decision Framework:
- •Ship: Positive, significant, no red flags
- •Iterate: Mixed results
- •Kill: Negative, not significant
Tools:
- •Feature flags: LaunchDarkly, Split.io
- •Experimentation: Optimizely, Statsig, GrowthBook
- •Analytics: Amplitude, Mixpanel, PostHog