Why this matters
As a Marketing Analyst, you will plan and judge A/B tests for landing pages, emails, pricing, creatives, and funnels. Choosing the right sample size and run time prevents two costly outcomes: stopping too early (false winners) and running too long (wasting traffic and time). Mastering these basics helps you ship reliable wins faster.
- Estimate how many users you need before starting.
- Plan how long to run a test based on traffic.
- Avoid peeking mistakes and mid-week bias.
Concept explained simply
You want to detect whether Variant B truly performs differently from Variant A. Two practical decisions control this:
- How small a change you care to detect (Minimum Detectable Effect, MDE).
- How confident you want to be (significance level, alpha) and how often you want to catch true effects (power).
Inputs you typically choose:
- Baseline metric: e.g., conversion rate (proportions) or average order value (means) and its variability.
- MDE: smallest effect size that matters (absolute or relative).
- Alpha: commonly 0.05 (two-sided) for marketing tests.
- Power: commonly 0.8 (80%) or 0.9 (90%).
- Allocation: often 50/50 for two variants to finish faster.
- Daily traffic or events to estimate calendar days.
Mental model: a resolution dial
Think of MDE as a resolution dial. The smaller the change you want to see, the more data you need. Turn the dial down (tiny effects): tests take longer. Turn it up (bigger effects): tests finish faster.
- Smaller MDE → larger sample → longer run.
- Higher traffic or more events/day → shorter calendar time.
- More variability (noisier metric) → larger sample.
- Two-sided tests are safer defaults and usually require a bit more data than one-sided.
Quick formulas you can use
These are standard approximations to plan before you test. They are close enough for scoping and prioritization. For precise planning, use a stats calculator with the same inputs.
Two-proportion test (e.g., conversion rate)
Per variant sample size ≈ 2 × p̄ × (1 − p̄) × (Zα/2 + Zβ)^2 / δ^2, where:
- p1 = baseline rate, p2 = target rate, δ = |p2 − p1| (absolute difference)
- p̄ = (p1 + p2) / 2
- Zα/2 = 1.96 for α = 0.05 (two-sided)
- Zβ = 0.84 for 80% power; 1.28 for 90% power
Two-mean test (e.g., average order value)
Per variant sample size ≈ 2 × (Zα/2 + Zβ)^2 × σ^2 / δ^2, where:
- σ is the standard deviation of the metric
- δ is the absolute difference you want to detect (MDE)
From sample size to run time
- Compute per-variant sample size n from a formula or calculator.
- Estimate daily exposure per variant: daily_total × allocation (e.g., 50%).
- Days ≈ n / daily_exposure_per_variant. Add a buffer to cover full weekly cycles.
Worked examples
Example 1: Landing page conversion uplift
Goal: Detect a 10% relative uplift, baseline 5% → target 5.5%.
- Inputs: p1 = 0.05, p2 = 0.055, δ = 0.005, p̄ = 0.0525
- α = 0.05 (two-sided), power = 0.8 → Zα/2 = 1.96, Zβ = 0.84
Show the math
n ≈ 2 × 0.0525 × 0.9475 × (1.96 + 0.84)^2 / 0.005^2
2 × 0.04974 × 7.84 / 0.000025 ≈ 31,200 per variant (approx.)
Runtime with 10,000 visitors/day, 50/50 split → 5,000 per variant/day → ~6–7 days. Good practice: run at least 1–2 full weeks to cover weekdays/weekends.
Example 2: Email open rate
Goal: Detect +2 percentage points, baseline 20% → 22%.
- Inputs: p1 = 0.20, p2 = 0.22, δ = 0.02, p̄ = 0.21
- α = 0.05 (two-sided), power = 0.9 → Zα/2 = 1.96, Zβ = 1.28
Show the math
n ≈ 2 × 0.21 × 0.79 × (1.96 + 1.28)^2 / 0.02^2
2 × 0.1659 × 10.4976 / 0.0004 ≈ 8,700 per variant (approx.)
If your list has 100,000 recipients split 50/50, one send can be enough.
Example 3: Average order value (means)
Goal: Detect +$3 AOV, baseline SD ≈ $20.
- Inputs: σ = 20, δ = 3
- α = 0.05 (two-sided), power = 0.8 → Zα/2 = 1.96, Zβ = 0.84
Show the math
n ≈ 2 × (1.96 + 0.84)^2 × 20^2 / 3^2
2 × 7.84 × 400 / 9 ≈ 697 per variant (approx.)
With ~300 orders/day total (150 per variant), you need ~5 days of orders; run for at least one full week.
How to plan runtime safely
- Always pre-set MDE, alpha, power, and guardrails.
- Compute n per variant; estimate days from your traffic.
- Run for whole business cycles (commonly ≥ 1–2 full weeks).
- Avoid peeking to stop at first p < 0.05; wait until sample-size and minimum duration are both met.
- Prefer 50/50 allocation for speed unless there are strong risk reasons.
Guardrails to track during the test
- Traffic quality: bots filtered, tracking firing consistently.
- Variant parity: sample counts per variant are balanced within a few percent.
- No major marketing or site changes mid-test that affect segments unevenly.
Common mistakes and how to self-check
- Stopping early on a spike: Self-check by confirming both sample-size target and minimum calendar duration are met.
- Choosing an unrealistically tiny MDE: Sanity-check against expected business impact and traffic; if runtime is months, increase MDE or choose a higher-impact change.
- Using visitor counts when the unit is sessions or orders: Match sample to the unit your metric is computed on.
- Ignoring variance for means tests: You need an SD estimate; use historical data.
- Not covering full weeks: Seasonal bias can flip results. Include at least one full weekday-weekend cycle.
Exercises
Try these. Then compare with the solutions.
Exercise 1 (mirrors ex1)
You plan a conversion test with baseline 3% and want to detect +0.6 percentage points (absolute) at α = 0.05 (two-sided), power = 0.8. Estimate the per-variant sample size and the minimum days if you get 20,000 visitors/day at 50/50 split.
Exercise 2 (mirrors ex2)
You plan an AOV test. SD ≈ $45, MDE = $4, α = 0.05 (two-sided), power = 0.8. Estimate per-variant sample size. If you have 500 orders/day total (evenly split), how many days do you need (before adding weekly-cycle buffers)?
Show exercise solutions
Exercise 1 solution (summary)
Two-proportion formula with p1 = 0.03, p2 = 0.036 → δ = 0.006, p̄ = 0.033. n ≈ 13,900 per variant. With 10,000 per variant/day (50% of 20k), ≈ 1.4 days; still run at least a full week.
Exercise 2 solution (summary)
Two-mean formula: n ≈ 2 × (1.96 + 0.84)^2 × 45^2 / 4^2 ≈ 1,980 per variant. With 250 orders/variant/day, ≈ 8 days; round to full weeks.
Checklist before you launch
- MDE chosen and meaningful for the business.
- Alpha and power set (e.g., 0.05 and 0.8).
- Baseline and, if needed, SD estimated from recent data.
- Per-variant sample size computed.
- Planned start and end dates cover at least one full weekly cycle.
- Allocation set (often 50/50) and tracking QA done.
Mini challenge
You have 12,000 daily sessions, baseline add-to-cart rate 8%. You care about a 1 percentage point absolute lift. Alpha 0.05, power 0.8. Roughly estimate per-variant n and how many days you would run at 50/50. Then list two guardrails you would monitor. Compare to the worked examples to sanity-check your answer.
Who this is for
- Marketing Analysts planning or reviewing A/B tests.
- Growth, CRM, and Product Marketers who need reliable experiment timelines.
- Designers/PMs collaborating on test roadmaps.
Prerequisites
- Basic statistics: proportions, means, standard deviation.
- Comfort with metrics you test (e.g., conversion rate, AOV, CTR).
- Access to recent baseline data for your metric.
Learning path
- Start here: MDE, alpha, power, baseline, and time-to-sample.
- Next: Test design choices (allocation, tail direction, guardrails).
- Then: Interpreting results (confidence intervals, lift, uncertainty).
- Advanced: Sequential testing and multiple comparisons control.
Practical projects
- Build a sample-size sheet: Create a spreadsheet with inputs (baseline, MDE, alpha, power, SD) and outputs (per-variant n, runtime). Include both proportions and means.
- Backtest a past experiment: Using historical data, recompute the needed sample size and check if the original run met it. Summarize risks if it did not.
- Traffic budgeting: For your next quarter’s test ideas, estimate n and calendar days for each. Prioritize by impact and feasibility.
Next steps
- Apply these formulas to your next planned test and compare with a calculator.
- Add minimum duration rules to your team playbook (e.g., ≥ 1–2 full weeks).
- Track a small set of guardrails (traffic balance, bot filtering, tracking health).
Take the quick test
Available to everyone for free. Progress is saved only for logged-in users.