A/B Testing Statistical Analysis

This skill provides guidance on correctly analyzing A/B test results using appropriate statistical methods.

Overview

A/B testing compares a control group against a treatment group to determine if a change has a statistically significant effect. The key challenge is choosing the right statistical test based on the metric type.

Choosing the Right Statistical Test

Binary Metrics (Conversion, Click-through, etc.)

For metrics that are 0/1 outcomes (did user convert?), use a two-proportion z-test:

python

from statsmodels.stats.proportion import proportions_ztest

# counts: number of successes in each group
# nobs: total observations in each group
counts = [treatment_conversions, control_conversions]
nobs = [treatment_total, control_total]

# Two-sided test
stat, p_value = proportions_ztest(counts, nobs, alternative='two-sided')

Why not chi-squared? The two-proportion z-test is mathematically equivalent for 2x2 tables but directly gives you the z-statistic which can be useful for confidence intervals.

Continuous Metrics (Revenue, Time, etc.)

For continuous measurements, use Welch's t-test (does not assume equal variances):

python

from scipy import stats

# Two-sided Welch's t-test
stat, p_value = stats.ttest_ind(
    treatment_values,
    control_values,
    equal_var=False  # Welch's t-test
)

Important: Use equal_var=False to get Welch's t-test, which is more robust than Student's t-test when sample sizes or variances differ between groups.

When to Use Which Test

Metric Type	Examples	Test
Binary (0/1)	Conversion, Click, Purchase	Two-proportion z-test
Continuous	Revenue, Time, Page views	Welch's t-test
Count data	Number of items	Welch's t-test (if mean > 5)

Multiple Comparison Corrections

When testing multiple hypotheses, the probability of at least one false positive increases. Apply corrections:

Bonferroni Correction

The simplest and most conservative approach:

python

# If testing k hypotheses at significance level alpha:
adjusted_alpha = alpha / k

# A result is significant only if p_value < adjusted_alpha
significant = p_value < (0.05 / num_tests)

Example: Testing 3 metrics with α = 0.05:

•Adjusted threshold: 0.05 / 3 = 0.0167
•Only p-values below 0.0167 are considered significant

When to Apply Bonferroni

Apply Bonferroni when:

•Testing multiple metrics in the same experiment
•Comparing one treatment against multiple controls
•Running multiple experiments and want to control family-wise error rate

Do NOT apply across independent experiments if you accept some false positives.

Calculating Effect Sizes

Relative Lift (Relative Change)

The most common way to express A/B test results:

python

# Relative lift = (treatment - control) / control
relative_lift = (treatment_mean - control_mean) / control_mean

Interpretation: A lift of 0.15 means the treatment is 15% better than control.

Conversion Rate Calculation

python

import pandas as pd

# For a dataframe with 'variant' and 'converted' columns
control_data = df[df['variant'] == 'control']
treatment_data = df[df['variant'] == 'treatment']

control_rate = control_data['converted'].mean()
treatment_rate = treatment_data['converted'].mean()

Complete Analysis Workflow

Step 1: Load and Split Data

python

import pandas as pd

df = pd.read_csv('experiment.csv')
control = df[df['variant'] == 'control']
treatment = df[df['variant'] == 'treatment']

Step 2: Analyze Binary Metric

python

from statsmodels.stats.proportion import proportions_ztest

# Calculate rates
control_rate = control['converted'].mean()
treatment_rate = treatment['converted'].mean()

# Run test
counts = [treatment['converted'].sum(), control['converted'].sum()]
nobs = [len(treatment), len(control)]
_, p_value = proportions_ztest(counts, nobs, alternative='two-sided')

# Calculate lift
lift = (treatment_rate - control_rate) / control_rate

Step 3: Analyze Continuous Metric

python

from scipy import stats

# Calculate means
control_mean = control['revenue'].mean()
treatment_mean = treatment['revenue'].mean()

# Run Welch's t-test
_, p_value = stats.ttest_ind(
    treatment['revenue'],
    control['revenue'],
    equal_var=False
)

# Calculate lift
lift = (treatment_mean - control_mean) / control_mean

Step 4: Apply Multiple Testing Correction

python

num_tests = 3  # e.g., conversion, revenue, duration
adjusted_alpha = 0.05 / num_tests  # 0.0167

# Determine significance
is_significant = p_value < adjusted_alpha

Power Analysis

Power analysis helps determine how many additional samples are needed to detect an effect. Use this when a result is not statistically significant but you want to know if more data could help.

For Continuous Metrics (t-test)

python

from statsmodels.stats.power import TTestIndPower
import numpy as np

def additional_samples_needed(control_data, treatment_data, alpha, power=0.8):
    """Calculate additional samples needed for significance."""
    control_mean = control_data.mean()
    treatment_mean = treatment_data.mean()
    pooled_std = np.sqrt((control_data.var() + treatment_data.var()) / 2)

    if pooled_std == 0 or control_mean == treatment_mean:
        return 0

    # Cohen's d effect size
    effect_size = abs(treatment_mean - control_mean) / pooled_std

    power_analysis = TTestIndPower()
    required_n = power_analysis.solve_power(
        effect_size=effect_size,
        alpha=alpha,
        power=power,
        alternative='two-sided'
    )

    current_n = (len(control_data) + len(treatment_data)) / 2
    return max(0, int(np.ceil(required_n - current_n)))

For Binary Metrics (proportions)

python

from statsmodels.stats.power import zt_ind_solve_power
import numpy as np

def additional_samples_proportion(control_prop, treatment_prop, n_control, n_treatment, alpha, power=0.8):
    """Calculate additional samples needed for proportion test."""
    if control_prop == treatment_prop:
        return 0

    # Cohen's h effect size for proportions
    effect_size = 2 * (np.arcsin(np.sqrt(treatment_prop)) - np.arcsin(np.sqrt(control_prop)))

    required_n = zt_ind_solve_power(
        effect_size=abs(effect_size),
        alpha=alpha,
        power=power,
        alternative='two-sided'
    )

    current_n = (n_control + n_treatment) / 2
    return max(0, int(np.ceil(required_n - current_n)))

Key Concepts

•Power (1 - β): Probability of detecting a true effect. Typically 0.8 (80%)
•
Effect size: Standardized measure of the difference between groups
- •Cohen's d for continuous: (mean1 - mean2) / pooled_std
- •Cohen's h for proportions: 2 * (arcsin(√p1) - arcsin(√p2))
•Alpha (α): Significance level (e.g., 0.05 or Bonferroni-adjusted)

When to Use

•Result is not significant but effect looks promising
•Planning sample size for future experiments
•Deciding whether to continue collecting data

Common Pitfalls

•Using chi-squared for proportions: While valid, proportions_ztest is more direct
•Forgetting equal_var=False: Student's t-test assumes equal variances
•Not correcting for multiple tests: Inflates false positive rate
•Division by zero in lift: Handle cases where control mean is 0
•Confusing one-tailed vs two-tailed: Use two-tailed unless you have a strong prior
•Ignoring power analysis: A non-significant result doesn't mean no effect exists

Dependencies

bash

pip install scipy statsmodels pandas numpy

Key imports:

•scipy.stats.ttest_ind - Welch's t-test
•statsmodels.stats.proportion.proportions_ztest - Two-proportion z-test