Experiment Design

Types of Experiments

1. A/B Test (Two Variants)

What: Compare two versions (A vs B)

Example:

•Control (A): Blue "Buy Now" button
•Treatment (B): Green "Buy Now" button

When to Use:

•Testing single change
•Clear hypothesis
•Binary decision (ship or don't ship)

Pros:

•Simple to implement
•Easy to analyze
•Clear winner

Cons:

•Only tests one change
•Can't test interactions

2. Multivariate Test (Multiple Changes)

What: Test multiple changes simultaneously

Example:

•Variable 1: Button color (Blue, Green, Red)
•Variable 2: Button text ("Buy Now", "Add to Cart", "Get Started")
•Variants: 3 × 3 = 9 combinations

When to Use:

•Testing multiple elements
•Want to find best combination
•Have enough traffic

Pros:

•Test interactions between variables
•Find optimal combination

Cons:

•Requires much more traffic
•Complex analysis
•Longer test duration

3. Sequential Testing

What: Continuously monitor and stop early if clear winner

Example:

•Start A/B test
•Check results daily
•Stop when statistical significance reached (could be day 3 or day 14)

When to Use:

•Want to ship winners fast
•High traffic
•Using tools that support it (Statsig, GrowthBook)

Pros:

•Faster results
•Less opportunity cost

Cons:

•Requires special statistical methods
•Can't "peek" with traditional A/B tests

4. Holdout Groups (Long-Term Effects)

What: Keep small % of users on old experience permanently

Example:

•95% of users: New feature
•5% of users: Old experience (holdout)

When to Use:

•Measure long-term effects
•Detect delayed negative impacts
•Validate cumulative changes

Pros:

•Detects long-term issues
•Measures true impact

Cons:

•Some users get worse experience
•Requires ongoing monitoring

When to Experiment

✅ Experiment When:

•
Significant Features (High Impact)
- •Major redesign
- •New pricing model
- •Core flow changes
•
Uncertain Outcomes
- •Don't know if it will work
- •Conflicting opinions
- •No clear data
•
Multiple Solution Options
- •Two different approaches
- •Want to pick the best
•
Optimization Opportunities
- •Incremental improvements
- •Conversion optimization
- •Engagement optimization

❌ Don't Experiment When:

•
Obvious Bugs/Fixes
- •Broken functionality
- •Security issues
- •Legal compliance
•
Very Low Traffic
- •Can't reach statistical significance
- •Would take months
•
Trivial Changes
- •Copy typo fix
- •Minor styling adjustment
•
Ethical Issues
- •Manipulative dark patterns
- •Harmful to users

Experiment Design Process

Step 1: Define Hypothesis

Template:

"If we [change], then [metric] will [improve by X%], because [reasoning]."

Example:

"If we change the CTA button from blue to green, then click-through rate will increase by 10%, because green is more attention-grabbing."

Step 2: Choose Metrics

Primary Metric: What you're optimizing

•Example: Click-through rate

Secondary Metrics: Other important outcomes

•Example: Conversion rate, revenue per user

Counter Metrics: Watch for negatives

•Example: Bounce rate, time on page

Step 3: Determine Sample Size

Inputs:

•Baseline conversion rate: 5%
•Expected improvement: 10% relative lift (5% → 5.5%)
•Significance level: 0.05 (95% confidence)
•Power: 0.80 (80% chance of detecting effect)

Output:

•Sample size needed: ~31,000 users per variant

Tools:

•Evan Miller's calculator: https://www.evanmiller.org/ab-testing/sample-size.html
•Optimizely sample size calculator

Step 4: Set Test Duration

Factors:

•Sample size needed
•Daily traffic
•Weekly patterns (run at least 1-2 weeks)
•Business cycles

Example:

•Sample size: 31,000 per variant (62,000 total)
•Daily traffic: 5,000
•Duration: 62,000 / 5,000 = 12.4 days → Run for 2 weeks

Step 5: Design Variants

Control (A): Current experience Treatment (B): New experience

Best Practices:

•Change only one thing (for A/B test)
•Make change meaningful (not trivial)
•Ensure variants are distinct

Step 6: Launch Test

Checklist:

Step 7: Analyze Results

Check:

•Statistical significance (p < 0.05)
•Practical significance (is improvement meaningful?)
•Secondary metrics (any red flags?)
•Segment analysis (works for everyone?)

Step 8: Decide (Ship, Iterate, Kill)

Ship if:

•Positive, significant, no red flags

Iterate if:

•Mixed results, some segments good

Kill if:

•Negative, not significant, opportunity cost too high

Choosing Metrics

Primary Metric (What We're Optimizing)

Characteristics:

•Directly tied to hypothesis
•Sensitive to change
•Measurable in test duration

Examples:

•Click-through rate (CTR)
•Conversion rate
•Sign-up completion rate
•Time to first action

Bad Primary Metrics:

•Revenue (too noisy, delayed)
•Retention (takes too long to measure)
•NPS (survey-based, low sample)

Secondary Metrics (Guardrails, Side Effects)

Purpose: Ensure we're not breaking other things

Examples:

•Revenue per user
•Engagement (sessions per user)
•Feature adoption
•Customer satisfaction

Counter Metrics (Watch for Negatives)

Purpose: Detect unintended negative consequences

Examples:

•Bounce rate (users leaving immediately)
•Error rate (technical issues)
•Support tickets (confusion)
•Churn rate (users leaving)

Example: Checkout Flow Test

Hypothesis:

"If we reduce checkout from 5 steps to 3 steps, conversion will increase by 15%."

Metrics:

•Primary: Checkout conversion rate
•Secondary: Average order value, time to complete checkout
•Counter: Cart abandonment rate, error rate, support tickets

Statistical Significance

P-Value < 0.05 (95% Confidence)

What it Means:

•Less than 5% chance result is due to random chance
•95% confident the effect is real

Example:

•Control: 5.0% conversion
•Treatment: 5.5% conversion
•P-value: 0.03 ✅ (< 0.05, statistically significant)

Interpretation:

"We're 95% confident that the treatment is better than control."

Statistical Power (80%+)

What it Means:

•80% chance of detecting an effect if it exists
•Reduces false negatives

Example:

•Power: 80%
•Means: 20% chance of missing a real effect

Minimum Detectable Effect (MDE)

What it Means:

•Smallest effect size you can reliably detect
•Depends on sample size

Example:

•Baseline: 5% conversion
•Sample size: 10,000 per variant
•MDE: 0.5% absolute (10% relative)
•Can detect: 5.0% → 5.5% or larger

Trade-off:

•Larger sample size → Smaller MDE (detect smaller effects)
•Smaller sample size → Larger MDE (only detect big effects)

Sample Size Calculation

Formula (Simplified)

code

n = (Z_α/2 + Z_β)² × (p₁(1-p₁) + p₂(1-p₂)) / (p₁ - p₂)²

Where:
- n = sample size per variant
- Z_α/2 = 1.96 (for 95% confidence)
- Z_β = 0.84 (for 80% power)
- p₁ = baseline conversion rate
- p₂ = expected conversion rate

Example Calculation

Inputs:

•Baseline conversion rate (p₁): 5% = 0.05
•Expected improvement: 10% relative lift
•New conversion rate (p₂): 5.5% = 0.055
•Significance level (α): 0.05
•Power (1-β): 0.80

Calculation:

code

n = (1.96 + 0.84)² × (0.05×0.95 + 0.055×0.945) / (0.05 - 0.055)²
n = 7.84 × (0.0475 + 0.052) / 0.000025
n = 7.84 × 0.0995 / 0.000025
n ≈ 31,200 per variant

Total sample size: 62,400 users

Using Online Calculators

Evan Miller's Calculator:

•Go to https://www.evanmiller.org/ab-testing/sample-size.html
•Enter baseline conversion rate: 5%
•Enter minimum detectable effect: 10% (relative)
•Get sample size: ~31,000 per variant

Optimizely Calculator:

•Go to Optimizely sample size calculator
•Enter baseline: 5%
•Enter minimum detectable effect: 0.5% (absolute)
•Get sample size: ~31,000 per variant

Test Duration

Minimum Duration: 1-2 Weeks

Why:

•Capture weekly patterns (weekday vs weekend)
•Avoid day-of-week bias
•Account for user behavior cycles

Example:

•Don't run Monday-Wednesday only
•Run at least Monday-Sunday (1 full week)

Full Business Cycles

Examples:

•E-commerce: Include payday (1st and 15th of month)
•B2B SaaS: Include full week (avoid Friday-only)
•Seasonal: Avoid holidays (unless testing holiday-specific)

Enough Data for Significance

Formula:

code

Duration = Sample Size Needed / Daily Traffic

Example:

•Sample size: 62,000 total
•Daily traffic: 5,000
•Duration: 62,000 / 5,000 = 12.4 days
•Run for: 2 weeks (14 days)

Not Too Long (Opportunity Cost)

Trade-off:

•Longer test = More confidence
•Longer test = Delayed learnings, slower iteration

Guideline:

•Most tests: 1-4 weeks
•High-traffic sites: 1-2 weeks
•Low-traffic sites: 2-4 weeks
•Don't run > 1 month (diminishing returns)

Experiment Variants

Control (Current Experience)

What: The existing experience

Example:

•Current checkout flow (5 steps)
•Current button color (blue)
•Current pricing page

Purpose: Baseline for comparison

Treatment (New Experience)

What: The proposed change

Example:

•New checkout flow (3 steps)
•New button color (green)
•New pricing page

Purpose: Test hypothesis

Multiple Treatments (If Testing Different Approaches)

Example:

•Control: 5-step checkout
•Treatment A: 3-step checkout (combine steps)
•Treatment B: 1-page checkout (all on one page)

Traffic Split:

•Control: 33%
•Treatment A: 33%
•Treatment B: 34%

Analysis:

•Compare each treatment to control
•Compare treatments to each other

Randomization

User-Level Randomization (Consistent Experience)

What: Each user always sees same variant

How:

javascript

const variant = hashUserId(userId) % 2 === 0 ? 'control' : 'treatment';

When to Use:

•Logged-in users
•Want consistent experience
•Testing flows (multi-step)

Pros:

•Consistent experience
•No confusion

Cons:

•Requires user ID

Session-Level (For Anonymous Users)

What: Each session sees same variant (but different sessions can differ)

How:

javascript

const variant = hashSessionId(sessionId) % 2 === 0 ? 'control' : 'treatment';

When to Use:

•Anonymous users
•Single-page tests

Pros:

•Works for anonymous users

Cons:

•Same user can see different variants across sessions

Stratified Sampling (For Segments)

What: Ensure even distribution across segments

Example:

•Segment 1: Free users (50% control, 50% treatment)
•Segment 2: Paid users (50% control, 50% treatment)

Why:

•Avoid imbalanced segments
•Enable segment analysis

Common Pitfalls

1. Peeking (Stopping Test Early When "Winning")

Problem:

code

Day 3: Treatment is winning! (p = 0.04) → Ship it!
Day 7: Treatment is losing... (p = 0.12) → Oops.

Why It's Bad:

•Increases false positive rate
•P-value fluctuates during test

Solution:

•Decide sample size upfront
•Don't look until test completes
•Or use sequential testing (proper method)

2. Sample Ratio Mismatch (Uneven Splits)

Problem:

code

Expected: 50% control, 50% treatment
Actual: 48% control, 52% treatment

Why It's Bad:

•Indicates randomization bug
•Results may be invalid

Solution:

•Check sample ratio before analyzing
•Investigate if mismatch > 1%

3. Novelty Effect (Users Trying New Thing)

Problem:

code

Week 1: Treatment is winning! (+20%)
Week 4: Treatment is same as control (0%)

Why It's Bad:

•Users try new thing out of curiosity
•Effect fades over time

Solution:

•Run test longer (2-4 weeks)
•Use holdout group for long-term measurement
•Segment by new vs returning users

4. Seasonality (Testing During Holidays)

Problem:

code

Test during Black Friday: +50% conversion
Test during normal week: +5% conversion

Why It's Bad:

•Holiday behavior is different
•Results don't generalize

Solution:

•Avoid testing during holidays
•Or run test across multiple weeks (include holiday + normal)

Sequential Testing

What is Sequential Testing?

Traditional A/B Test:

•Decide sample size upfront
•Run until sample size reached
•Analyze once at end

Sequential Testing:

•Monitor continuously
•Stop early if clear winner
•Adjust significance threshold

How It Works

Algorithm:

•Use adjusted significance threshold (not 0.05)
•Account for multiple looks
•Stop when threshold crossed

Example (Simplified):

code

Day 1: p = 0.10 → Continue
Day 3: p = 0.03 → Continue
Day 5: p = 0.001 → Stop! (clear winner)

Tools That Support Sequential Testing

•Statsig: Built-in sequential testing
•GrowthBook: Bayesian statistics
•Optimizely: Stats Engine (sequential)

Benefits

•Faster results (stop early if clear winner)
•Less opportunity cost
•Detect large effects quickly

Drawbacks

•Requires special tools
•Can't use traditional p-value
•More complex

Holdout Groups

What is a Holdout Group?

Definition: Small % of users kept on old experience permanently

Example:

•95% of users: New feature
•5% of users: Old experience (holdout)

Why Use Holdout Groups?

Measure Long-Term Effects:

•A/B test shows +10% conversion in 2 weeks
•Holdout shows +5% conversion after 6 months
•Learning: Effect diminishes over time

Detect Delayed Negative Impacts:

•A/B test shows +15% signups
•Holdout shows +10% churn after 3 months
•Learning: Feature attracts wrong users

How Long to Keep Holdout?

Guideline:

•1-3 months for most features
•6-12 months for major changes
•Permanent for critical features

When to Remove Holdout?

Remove if:

•No long-term differences detected
•Opportunity cost too high (5% of users on worse experience)
•Feature is critical (everyone should have it)

Experiment Analysis

Step 1: Compare Primary Metric

Example:

•Control: 5.0% conversion
•Treatment: 5.5% conversion
•Lift: +10% relative
•P-value: 0.03 ✅

Decision: Treatment is statistically significantly better.

Step 2: Check Secondary Metrics

Example:

•Revenue per user: $10.50 (control) vs $11.20 (treatment) ✅
•Time to checkout: 3.2 min (control) vs 2.8 min (treatment) ✅

Decision: Secondary metrics also improved.

Step 3: Check Counter Metrics

Example:

•Bounce rate: 30% (control) vs 32% (treatment) ⚠️
•Error rate: 0.5% (control) vs 0.5% (treatment) ✅

Decision: Slight increase in bounce rate, investigate.

Step 4: Segment Analysis

Did it work for everyone?

Segment	Control	Treatment	Lift
Mobile	4.5%	5.2%	+15% ✅
Desktop	5.5%	5.8%	+5% ✅
Free users	3.0%	3.6%	+20% ✅
Paid users	7.0%	7.1%	+1% ⚠️

Learning: Works great for mobile and free users, minimal impact on paid users.

Step 5: Statistical Significance

Check:

•P-value < 0.05 ✅
•Confidence interval doesn't include 0 ✅

Example:

•Lift: +10%
•95% CI: [+5%, +15%]
•Interpretation: We're 95% confident the true lift is between 5% and 15%.

Step 6: Practical Significance

Is the improvement meaningful?

Example:

•Statistically significant: Yes (p = 0.04)
•Lift: +0.1% (5.0% → 5.005%)
•Decision: Not practically significant (too small to matter)

Guideline:

•Small lift but high volume → Ship (e.g., +0.1% on 1M users = 1,000 more conversions)
•Large lift but low volume → Maybe ship (e.g., +50% on 100 users = 50 more conversions)

Decision Framework

Ship If:

✅ Positive: Treatment is better than control ✅ Significant: P-value < 0.05 ✅ No Red Flags: Secondary and counter metrics look good ✅ Works for Key Segments: At least works for majority

Example:

•Conversion: +10% (p = 0.03) ✅
•Revenue: +8% (p = 0.05) ✅
•Bounce rate: No change ✅
•Works for mobile and desktop ✅
•Decision: Ship!

Iterate If:

⚠️ Mixed Results: Some metrics up, some down ⚠️ Works for Some Segments Only: E.g., only mobile, not desktop ⚠️ Close to Significance: P = 0.06 (just missed)

Example:

•Conversion: +10% (p = 0.03) ✅
•Revenue: -5% (p = 0.08) ⚠️
•Decision: Iterate. Conversion is up but revenue is down. Investigate why.

Kill If:

❌ Negative: Treatment is worse than control ❌ Not Significant: P-value > 0.05 ❌ Opportunity Cost Too High: Could be working on better ideas

Example:

•Conversion: +2% (p = 0.15) ❌
•Took 4 weeks to test
•Decision: Kill. Not significant, move on to next idea.

Tools

Feature Flags

LaunchDarkly:

•Feature flag management
•Gradual rollouts
•Kill switches

Split.io:

•Feature flags + experimentation
•Real-time metrics

Unleash:

•Open-source feature flags
•Self-hosted option

Experimentation Platforms

Optimizely:

•Full-stack experimentation
•Visual editor for web
•Stats Engine (sequential testing)

VWO (Visual Website Optimizer):

•A/B testing for web
•Heatmaps, session recordings
•Visual editor

GrowthBook:

•Open-source experimentation
•Bayesian statistics
•Feature flags

Statsig:

•Modern experimentation platform
•Sequential testing
•Free tier

Analytics

Amplitude:

•Product analytics
•Funnel analysis
•Cohort analysis

Mixpanel:

•Event-based analytics
•A/B test analysis
•Retention analysis

PostHog:

•Open-source product analytics
•Feature flags
•Session replay

A/B Testing for Engineers

1. Feature Flag Implementation

Node.js (LaunchDarkly):

javascript

const LaunchDarkly = require('launchdarkly-node-server-sdk');

const client = LaunchDarkly.init(process.env.LAUNCHDARKLY_SDK_KEY);

await client.waitForInitialization();

app.get('/checkout', async (req, res) => {
  const user = {
    key: req.user.id,
    email: req.user.email,
    custom: {
      plan: req.user.plan
    }
  };
  
  const showNewCheckout = await client.variation('new-checkout-flow', user, false);
  
  if (showNewCheckout) {
    res.render('checkout-new');
  } else {
    res.render('checkout-old');
  }
});

Python (Statsig):

python

from statsig import statsig

statsig.initialize(os.environ['STATSIG_SERVER_KEY'])

@app.route('/checkout')
def checkout():
    user = {
        'userID': current_user.id,
        'email': current_user.email,
        'custom': {
            'plan': current_user.plan
        }
    }
    
    show_new_checkout = statsig.check_gate(user, 'new_checkout_flow')
    
    if show_new_checkout:
        return render_template('checkout_new.html')
    else:
        return render_template('checkout_old.html')

2. Metric Instrumentation

Segment (Event Tracking):

javascript

const Analytics = require('analytics-node');
const analytics = new Analytics(process.env.SEGMENT_WRITE_KEY);

// Track checkout started
analytics.track({
  userId: user.id,
  event: 'Checkout Started',
  properties: {
    variant: showNewCheckout ? 'treatment' : 'control',
    cart_value: cart.total,
    items_count: cart.items.length
  }
});

// Track checkout completed
analytics.track({
  userId: user.id,
  event: 'Checkout Completed',
  properties: {
    variant: showNewCheckout ? 'treatment' : 'control',
    order_id: order.id,
    revenue: order.total
  }
});

3. Data Pipeline

Architecture:

code

Application
    ↓ (events)
Segment
    ↓ (forwards to)
├── Amplitude (analytics)
├── Mixpanel (analytics)
├── Data Warehouse (BigQuery, Snowflake)
└── Statsig (experimentation)

4. Results Dashboard

Grafana Dashboard:

json

{
  "dashboard": {
    "title": "A/B Test: New Checkout Flow",
    "panels": [
      {
        "title": "Conversion Rate by Variant",
        "targets": [
          {
            "expr": "sum(checkout_completed{variant='control'}) / sum(checkout_started{variant='control'})",
            "legendFormat": "Control"
          },
          {
            "expr": "sum(checkout_completed{variant='treatment'}) / sum(checkout_started{variant='treatment'})",
            "legendFormat": "Treatment"
          }
        ]
      },
      {
        "title": "Sample Size",
        "targets": [
          {
            "expr": "sum(checkout_started{variant='control'})",
            "legendFormat": "Control"
          },
          {
            "expr": "sum(checkout_started{variant='treatment'})",
            "legendFormat": "Treatment"
          }
        ]
      }
    ]
  }
}

Real Experiment Examples

Example 1: Button Color Test (Classic)

Hypothesis:

"If we change the CTA button from blue to orange, click-through rate will increase by 10%, because orange is more attention-grabbing."

Test:

•Control: Blue button
•Treatment: Orange button
•Sample size: 10,000 per variant
•Duration: 1 week

Results:

•Control: 5.2% CTR
•Treatment: 5.7% CTR
•Lift: +9.6%
•P-value: 0.04 ✅

Decision: Ship orange button.

Example 2: Checkout Flow Optimization

Hypothesis:

"If we reduce checkout from 5 steps to 3 steps, conversion will increase by 15%, because users abandon due to flow length."

Test:

•Control: 5-step checkout
•Treatment: 3-step checkout (combined steps)
•Sample size: 50,000 per variant
•Duration: 2 weeks

Results:

•Control: 8.5% conversion
•Treatment: 9.8% conversion
•Lift: +15.3%
•P-value: 0.001 ✅

Secondary Metrics:

•Time to checkout: 4.2 min → 3.1 min ✅
•Error rate: 2.1% → 1.8% ✅

Decision: Ship 3-step checkout.

Example 3: Pricing Page Variants

Hypothesis:

"If we show annual pricing first (instead of monthly), annual plan adoption will increase by 25%, because anchoring effect."

Test:

•Control: Monthly pricing shown first
•Treatment: Annual pricing shown first
•Sample size: 20,000 per variant
•Duration: 3 weeks

Results:

•Control: 12% annual adoption
•Treatment: 18% annual adoption
•Lift: +50%
•P-value: 0.001 ✅

Counter Metrics:

•Overall conversion: 10.5% → 10.2% ⚠️ (slight drop)

Decision: Ship, but monitor overall conversion.

Example 4: Onboarding Flow

Hypothesis:

"If we add an interactive tutorial in onboarding, activation rate will increase by 30%, because users don't know how to get started."

Test:

•Control: No tutorial
•Treatment: Interactive tutorial (5 steps)
•Sample size: 15,000 per variant
•Duration: 2 weeks

Results:

•Control: 25% activation rate
•Treatment: 28% activation rate
•Lift: +12%
•P-value: 0.08 ❌ (not significant)

Segment Analysis:

•New users: +20% (p = 0.03) ✅
•Returning users: +2% (p = 0.5) ❌

Decision: Iterate. Show tutorial only to new users.

Advanced: Bayesian A/B Testing

Traditional (Frequentist) A/B Testing

Approach:

•Null hypothesis: No difference between A and B
•P-value: Probability of seeing this result if null is true
•Reject null if p < 0.05

Interpretation:

"There's a 95% chance the result is not due to random chance."

Bayesian A/B Testing

Approach:

•Prior belief: What we believe before test
•Likelihood: Data from test
•Posterior belief: Updated belief after test

Interpretation:

"There's a 95% probability that B is better than A."

Benefits of Bayesian

•
Easier to Interpret:
- •"95% probability B is better" (intuitive)
- •vs "p = 0.03" (confusing)
•
Can Stop Early:
- •No peeking problem
- •Stop when confident enough
•
Incorporates Prior Knowledge:
- •Use historical data
- •More accurate with small samples

Tools That Use Bayesian

•GrowthBook: Bayesian by default
•VWO: Bayesian engine option
•Google Optimize: Bayesian (deprecated)

Example

Test:

•Control: 5.0% conversion (1000 users)
•Treatment: 5.5% conversion (1000 users)

Frequentist:

•P-value: 0.15 (not significant)
•Decision: Can't conclude

Bayesian:

•Probability B > A: 87%
•Expected lift: +10%
•Decision: Likely better, but not confident enough (need 95%)

Summary

Quick Reference

Experiment Types:

•A/B test: Two variants
•Multivariate: Multiple changes
•Sequential: Stop early
•Holdout: Long-term measurement

When to Experiment:

•Significant features
•Uncertain outcomes
•Multiple options
•Optimization

Process:

•Define hypothesis
•Choose metrics
•Calculate sample size
•Set duration
•Design variants
•Launch
•Analyze
•Decide

Metrics:

•Primary: What we're optimizing
•Secondary: Guardrails
•Counter: Watch for negatives

Statistical Significance:

•P-value < 0.05
•Power > 80%
•Minimum detectable effect

Common Pitfalls:

•Peeking
•Sample ratio mismatch
•Novelty effect
•Seasonality

Decision Framework:

•Ship: Positive, significant, no red flags
•Iterate: Mixed results
•Kill: Negative, not significant

Tools:

•Feature flags: LaunchDarkly, Split.io
•Experimentation: Optimizely, Statsig, GrowthBook
•Analytics: Amplitude, Mixpanel, PostHog