Why this matters
As a Data Analyst, you will often plan experiments and answer: How much traffic do we need? How long should we run the test? Can we detect this lift? Getting sample size and power right avoids underpowered tests (wasted time) and overlong tests (wasted opportunity).
- Decide if a feature can be tested this sprint based on required users/sessions.
- Set Minimum Detectable Effect (MDE) so stakeholders know what lift is realistically detectable.
- Estimate test runtime using daily traffic and target power.
- Explain trade-offs: faster decision vs result sensitivity.
Concept explained simply
Think of an A/B test like taking a photo in low light:
- Sample size = exposure time. More samples = clearer picture.
- Power (1 − β) = your camera’s ability to capture a real object. Commonly 80% or 90%.
- Alpha (α) = false alarm rate you accept (usually 5% two-sided).
- MDE = the smallest lift you care to detect (effect size). Smaller MDE needs more data.
- Variance/SD = shakiness of your hand. More variance = more samples needed.
- Baseline = current performance (e.g., 5% conversion).
Mental model: You set how sensitive you want the test to be (MDE) and how confident you want to be (alpha, power). The math tells you the needed sample size.
Practical formulas you can use safely
These approximations work well for planning with equal group sizes and two-sided tests:
- Two-sample proportions (e.g., conversion rate)
n per group ≈ [ 2 × p × (1 − p) × (Zα/2 + Zβ)2 ] / MDE2
Where p = baseline rate (e.g., 0.05), MDE = absolute difference (e.g., +1 percentage point = 0.01). - Two-sample means (e.g., average order value)
n per group ≈ [ 2 × σ2 × (Zα/2 + Zβ)2 ] / MDE2
Where σ = standard deviation of the metric, MDE = absolute difference in the mean.
What Z values should I use?
- Two-sided α = 0.05 → Zα/2 ≈ 1.96
- Power = 80% → Zβ ≈ 0.84
- Power = 90% → Zβ ≈ 1.28
One-sided vs two-sided tests
Two-sided is safer for most product tests. One-sided can reduce sample size but only use it when you truly care about one direction and would not act on a significant effect in the other direction.
Worked examples
- Conversion rate (proportions):
Goal: Detect +1 percentage point on a 5% baseline (0.05 → 0.06), α=0.05 two-sided, power=80%.
p = 0.05, MDE = 0.01, Zα/2 = 1.96, Zβ = 0.84. Sum = 2.8; (Sum)2 = 7.84.
2 × p(1−p) = 2 × 0.05 × 0.95 = 0.095.
n ≈ (0.095 × 7.84) / 0.012 = 0.7448 / 0.0001 ≈ 7,448 per group (≈ 14,896 total). - Click-through rate (proportions, relative MDE):
Baseline 2% (0.02). Target relative lift 15% → absolute MDE = 0.02 × 0.15 = 0.003.
2 × p(1−p) = 2 × 0.02 × 0.98 = 0.0392. With α=0.05, power=80%: factor = 7.84.
n ≈ (0.0392 × 7.84) / 0.0032 ≈ 0.3073 / 0.000009 ≈ 34,149 per group. - Average order value (means):
Baseline mean $50, SD $60, MDE = +$5, α=0.05, power=90%.
Zα/2 = 1.96, Zβ = 1.28 → Sum = 3.24; (Sum)2 = 10.4976.
2 × σ2 = 2 × 3,600 = 7,200.
n ≈ (7,200 × 10.4976) / 52 ≈ 75,583 / 25 ≈ 3,024 per group.
Step-by-step planning checklist
- Clarify the decision: what action will you take if the test wins/loses?
- Choose primary metric and its type (proportion or mean).
- Estimate baseline and variance (or SD); use recent, representative data.
- Set MDE that is both meaningful and feasible.
- Pick α (usually 0.05 two-sided) and desired power (80% or 90%).
- Compute n per group and translate to days using your daily traffic.
- Sanity-check duration against seasonality and release cadence.
- Plan guardrails (e.g., sample ratio mismatch, data quality checks).
Exercises — practice
Do these without a calculator first to estimate, then compute precisely.
Exercise 1: Conversion rate sample size
You have a 3% signup rate. You want to detect an absolute lift of +0.6 percentage points (MDE = 0.006) with α=0.05 (two-sided) and power=80%. Use the simplified proportions formula.
- Round up to the nearest whole number.
- Report per-group and total sample sizes.
Hints
- Zα/2 ≈ 1.96, Zβ ≈ 0.84 → (sum)^2 ≈ 7.84.
- 2 × p(1−p) with p=0.03.
- MDE = 0.006.
Exercise 2: Mean metric sample size
Average session length is 8 minutes with SD = 5 minutes. You want to detect a decrease of 0.8 minutes (MDE = 0.8) with α=0.05 (two-sided) and power=80%. Use the means formula.
- Round up to the nearest whole number.
- Report per-group and total sample sizes.
Hints
- Zα/2 ≈ 1.96, Zβ ≈ 0.84 → (sum)^2 ≈ 7.84.
- σ = 5 → σ^2 = 25.
- MDE = 0.8 → MDE^2 = 0.64.
- Checklist: Did you use absolute MDE values? Did you round up? Did you state assumptions (two-sided, equal split)?
Common mistakes and self-checks
- Using relative MDE in a formula that expects absolute MDE. Convert 15% relative on 2% baseline to 0.003 absolute.
- Underestimating variance/SD. Use recent data; consider seasonality and segmentation.
- Peeking and stopping early. If you look often without proper corrections, your α inflates; pre-plan duration.
- Mismatched tails. Planning with one-sided α but analyzing two-sided changes the error rates.
- Ignoring allocation ratio. If not 50/50, required total n increases; use a calculator that handles unequal splits.
- Overly tiny MDEs. If MDE is too small, required n or duration may be impractical; reset expectations.
- Poor baseline estimate. Baseline should reflect the audience and time window of the test.
Quick self-check
- If you halve the MDE, sample size should roughly quadruple. Does your number reflect that?
- If SD doubles (for means), n should roughly quadruple. Check your result.
- Lowering α or increasing power should increase n. Did your plan reflect the trade-off?
Practical projects
- Spreadsheet calculator: Build a simple sheet with inputs (baseline, SD, MDE, α, power) and outputs (n per group, test duration). Include a toggle for proportions vs means.
- Traffic-to-time planner: Add daily unique counts per page and compute estimated days to reach required n with a 10% buffer.
- Guardrail checklist: Create a one-pager you attach to every test plan: sample ratio match, data freshness, outlier handling, and seasonality notes.
- Pilot variance check: Run a short A/A or pre-test measurement to validate SD/variance assumptions before the main test.
Learning path
- Before this: Understanding metrics, variance, and basic hypothesis testing.
- This lesson: Choose α, power, and MDE; compute required sample size and estimated duration.
- Next: Test design details (guardrails, bucketing, segmentation) and interpreting results (p-values, confidence intervals, lift).
Next steps
- Pick one upcoming A/B test and fill out the checklist above.
- Compute two versions: an 80% power plan and a 90% power plan; discuss trade-offs with stakeholders.
- Document assumptions (baseline window, tails, allocation) so analysis matches the plan.
Mini challenge
You have 12,000 daily unique visitors to a page. Baseline conversion is 4%. You want to detect a +0.8 percentage point lift (to 4.8%), α=0.05 two-sided, power=80%, equal split.
- Estimate n per group and total.
- How many days will the test take with a 10% buffer?
Suggested approach
Use n per group ≈ [2 × p(1−p) × (Zα/2 + Zβ)^2] / MDE^2. Then divide required total by daily eligible traffic per variant.
Quick test
Take the quick test below to lock in your understanding. Everyone can take it for free; progress is saved only if you are logged in.