Who this is for
- Applied Scientists and ML practitioners running A/B tests on models, ranking systems, or product features.
- Data Scientists supporting product teams with experiment analysis.
- Engineers and analysts who need to make ship/no-ship decisions grounded in evidence.
Prerequisites
- Basic probability (events, independence) and familiarity with averages and proportions.
- Comfort with simple algebra and reading charts.
- Some exposure to A/B testing terminology (control, treatment, metric).
Why this matters
As an Applied Scientist, you will regularly answer questions like: Did the new ranking model truly improve CTR? Is the uplift in revenue real or just noise? Can we safely roll out the feature without harming core metrics? Statistical significance gives a principled way to manage risk and decide with confidence.
- Make launch decisions: Quantify evidence for or against a change.
- Communicate rigor: Share p-values and confidence intervals that stakeholders trust.
- Protect users and business: Use guardrails and power analysis to avoid costly mistakes.
Concept explained simply
Hypothesis testing in one minute
- Null hypothesis (H0): No true effect (e.g., no difference between control and treatment).
- Alternative (H1): There is an effect (e.g., treatment is different).
- p-value: If H0 were true, probability of seeing your result (or more extreme). Small p-value means your data is unlikely under H0.
- Significance level (alpha, α): Your false-positive risk budget, commonly 0.05.
- Confidence interval (CI): A plausible range for the true effect. A 95% CI that excludes 0 aligns with a 5% two-sided test.
- Power (1−β): Chance you detect a real effect of a given size. Higher power means fewer false negatives.
Mental model
Think of testing like a risk budget. You can afford α (say 5%) chance of being wrong when declaring a win. A low p-value spends that budget to make a claim; a confidence interval shows the realistic range of impact. Power tells you whether your test is sensitive enough to detect a meaningful win.
Quick glossary
- Type I error (false positive): Declaring a win when there is none. Controlled by α.
- Type II error (false negative): Missing a real win. Reduced by increasing sample size or expected effect.
- MDE (Minimum Detectable Effect): Smallest effect you care to detect with chosen power.
Worked examples
Example 1: Conversion rate difference (proportions z-test)
Setup: Control A: n=50,000, conversions=5,500 (11.00%). Treatment B: n=50,000, conversions=5,720 (11.44%).
- Observed difference: 0.1144 − 0.1100 = 0.0044 (0.44 pp).
- Pooled rate: p = (5,500 + 5,720) / 100,000 = 0.1122.
- SE = sqrt[p(1−p)(1/nA + 1/nB)] ≈ sqrt(0.1122×0.8878×(2/50,000)) ≈ 0.0020.
- z = 0.0044 / 0.0020 ≈ 2.20 → two-sided p ≈ 0.028.
- 95% CI for difference: 0.0044 ± 1.96×0.0020 ≈ [0.0005, 0.0083] → positive and excludes 0.
Decision: Significant at α=0.05. Estimated uplift is about 0.44 pp (95% CI roughly 0.05–0.83 pp).
Example 2: Revenue per user (means t-test, large n)
Setup: A: n=2,000, mean=$5.20, sd=$4.00. B: n=2,000, mean=$5.45, sd=$4.00.
- Difference: 0.25.
- SE = sqrt(sd^2/nA + sd^2/nB) = sqrt(16/2000 + 16/2000) = sqrt(0.016) ≈ 0.1265.
- t ≈ 0.25 / 0.1265 ≈ 1.98 → p ≈ 0.048 (two-sided).
- 95% CI: 0.25 ± 1.96×0.1265 ≈ [0.002, 0.498].
Decision: Barely significant at 5%. The plausible uplift ranges from near-zero to $0.50 per user.
Example 3: Guardrail metric (refund rate)
Setup: A: 40/2,000 = 2.00%. B: 46/2,000 = 2.30%. Difference: +0.30 pp.
- Pooled p ≈ (86/4,000) = 0.0215.
- SE ≈ sqrt(0.0215×0.9785×(1/2000 + 1/2000)) ≈ 0.0046.
- z ≈ 0.0030 / 0.0046 ≈ 0.65 → p ≈ 0.51 (not significant).
- 95% CI ≈ 0.0030 ± 1.96×0.0046 ≈ [−0.006, 0.012].
Decision: Not significant; possible harm up to ~1.2 pp cannot be ruled out. Consider whether primary wins justify this risk or run longer to increase power on the guardrail.
Power and sample size planning
Before running, choose an MDE that matters and plan for power (commonly 80%). A quick approximation for proportions (balanced groups):
n_per_group ≈ 2 × (Z_{1−α/2} + Z_{power})^2 × p × (1−p) / MDE^2
- Use baseline rate for p.
- Z-values: 1.96 for 95% CI; 0.84 for 80% power.
- Round up, then add buffer for missing data and outliers.
Mini planner example
- Baseline conversion p=0.10, target MDE=0.005 (0.5 pp), α=0.05, power=80%.
- Compute: 2×(1.96+0.84)^2×0.10×0.90 / 0.005^2 ≈ 2×7.84×0.09 / 0.000025 ≈ 2×0.7056 / 0.000025 ≈ 56,448 per group.
- If daily traffic is 20k per group, run at least 3 days plus a buffer to cover day-of-week effects.
Multiple comparisons and peeking
- Multiple metrics/variants inflate false positives. Simple safeguard: Bonferroni (divide α by number of independent tests). Conservative but easy.
- Sequential peeking (stopping early on a good result) inflates Type I error unless you use an alpha-spending plan or a sequential method. If you must peek, predefine rules.
- Pre-specify primary metric and analysis window. Use secondary metrics for diagnostics and add context, not to fish for wins.
Handling skew and variance
- Skewed spend/revenue: Consider log-transform, winsorization/trimmed means, or nonparametric tests (e.g., Mann–Whitney) and bootstrap confidence intervals.
- Variance reduction: Stratification, regression adjustment, or CUPED (using pre-experiment covariates) can reduce noise and improve power if planned up front and applied symmetrically to both groups.
Exercises
Do these exercises to practice. Use the checklist to confirm your work, then compare with the solutions.
Exercise 1: Is the conversion uplift significant?
Control: n=40,000, conv=4,400. Treatment: n=40,000, conv=4,640.
- Compute the difference in conversion rate, the pooled SE, the z-score, two-sided p-value, and a 95% CI.
- Decide at α=0.05.
Exercise 2: Plan sample size for power
Baseline p=0.08, target MDE=0.005 (absolute), α=0.05, power=80%.
- Estimate n per group using the proportions approximation.
- How long would you need to run with 30k users per variant per day?
- [ ] I computed pooled rates and SE correctly.
- [ ] I reported both p-value and 95% CI.
- [ ] I stated a clear decision at α=0.05.
- [ ] I justified runtime from sample size and traffic.
Common mistakes and self-check
- Peeking without correction: Early stopping on a lucky spike inflates false positives. Self-check: Did we predefine stop rules or use a sequential method?
- Misreading p-values: p=0.03 is not the probability H0 is true. It is the probability of the observed (or more extreme) result if H0 were true. Self-check: Can you explain p-value without saying “probability H0 is true”?
- Ignoring power: “Not significant” can mean “not enough data.” Self-check: Did we run a power analysis against a meaningful MDE?
- Wrong unit of analysis: User-level vs session-level mixing can bias results. Self-check: Are assignments and metrics aligned at the same unit?
- Multiple comparisons: Testing many variants/metrics raises false positives. Self-check: Did we adjust or pre-specify a primary metric?
- Seasonality and novelty effects: Short runs may mislead. Self-check: Did we cover a full weekly cycle and monitor stabilization?
- Switching tails post hoc: Choosing one-sided after seeing direction inflates error. Self-check: Was test sidedness pre-specified?
- Randomization imbalance: Rare but possible. Self-check: Compare key covariates across groups.
Practical projects
- Build a power calculator: Input baseline, MDE, α, power; output runtime given traffic.
- Bootstrap CIs for revenue per user and compare against a t-test on simulated skewed data.
- Implement CUPED-style variance reduction with a pre-experiment metric and show power gains.
- Simulate sequential peeking to visualize inflated false positives vs an alpha-spending approach.
Learning path
- Experiment design basics: units, randomization, primary vs guardrail metrics.
- Statistical significance: p-values, CIs, Type I/II errors (this lesson).
- Power and sample size: MDE planning and runtime.
- Multiple comparisons and sequential testing: safe monitoring.
- Advanced analysis: variance reduction, nonparametrics, and bootstrap.
- Interpretation and decision-making: trade-offs and risk framing for stakeholders.
Next steps
- Set up a pre-analysis plan template: hypotheses, metrics, α, power, runtime, stopping rules.
- Automate a report that shows effect size, p-value, CI, power achieved, and guardrail checks.
- Review one past experiment: would the decision change if you used CIs and power planning?
Mini challenge
Quick check: A: 6,000/60,000 (10%). B: 6,480/62,000 (~10.45%). Is this significant at α=0.05?
Reveal answer
- Difference ≈ 0.1045 − 0.1000 = 0.0045.
- Pooled p ≈ (6,000+6,480)/(122,000) ≈ 0.1023.
- SE ≈ sqrt(0.1023×0.8977×(1/60,000 + 1/62,000)) ≈ 0.0019.
- z ≈ 0.0045 / 0.0019 ≈ 2.37 → p ≈ 0.018 (two-sided).
- Decision: Significant at 5%. 95% CI roughly [0.0008, 0.0082].
Quick Test
Take the quick test below to check understanding. Everyone can take the test; if you are logged in, your progress will be saved.