Why this matters
As a Product Analyst, you will be asked questions like: When can we call this A/B test? Can we launch before the sale ends? How much traffic do we need to detect a 5% lift? Experiment duration planning answers these with defensible numbers so stakeholders can make timely, low-risk decisions.
- Plan realistic launch dates and avoid underpowered tests.
- Balance time-to-decision with statistical rigor.
- Prevent false wins from peeking or seasonality.
Concept explained simply
Duration is how long you must run an experiment to collect enough high-quality data to detect the effect you care about. It depends on two things:
- How many samples you need (sample size).
- How quickly you get those samples (traffic per variant) and how long outcomes take to materialize (measurement delay).
For a binary metric (e.g., conversion rate), a common approximation for equal-sized variants is:
n_per_variant ≈ 2 × (Z1−α/2 + Zpower)^2 × p × (1−p) / δ^2
- p = baseline rate (e.g., 0.05)
- δ = absolute Minimum Detectable Effect (MDE), e.g., +0.5 percentage points = 0.005
- α = significance level (two-sided commonly 0.05, so Z ≈ 1.96)
- power = typically 0.8 (Z ≈ 0.84)
For a continuous metric (e.g., revenue per user):
n_per_variant ≈ 2 × (Z1−α/2 + Zpower)^2 × σ^2 / δ^2
- σ = standard deviation of the metric
- δ = absolute lift you want to detect
Then convert to calendar time:
days ≈ n_per_variant / daily_eligible_units_per_variant
Add buffers for ramp-up, weekly cycles, and any measurement delay (e.g., D7 retention requires at least 7 extra days after exposure).
Mental model
- Levers: smaller MDE or higher variance → bigger sample → longer duration. More traffic → shorter duration. Delayed outcomes → longer calendar time.
- Quality gates: ensure stable traffic, avoid mid-test changes, cover full weekly cycles.
- Decision rule: commit to stop criteria (e.g., target sample size or fixed calendar window) before you start.
Quick reference: typical Z values
- Two-sided α = 0.05 → Z1−α/2 ≈ 1.96
- Power = 0.80 → Zpower ≈ 0.84
- Power = 0.90 → Zpower ≈ 1.28
Inputs you need
- Primary metric type (binary or continuous) and unit of analysis (user, session, order).
- Baseline value (rate or mean) and variability (p(1−p) or standard deviation).
- MDE (absolute); optionally relate it to a % lift for stakeholders.
- Alpha (typically 0.05, two-sided) and power (typically 0.8).
- Traffic estimates: eligible population per day and allocation ratio (commonly 50/50).
- Outcome delay (e.g., D7 retention), ramp plan, and expected seasonality.
- Experiment length constraints (deadlines, releases, campaigns).
Worked examples
Example 1: Signup conversion
Goal: detect a +0.5 pp absolute lift from 5.0% to 5.5% (≈ +10% relative). α = 0.05 (two-sided), power = 0.8. Eligible traffic: 20,000 users/day, 50/50 allocation.
n_per_variant ≈ 2 × (1.96 + 0.84)^2 × 0.05 × 0.95 / 0.005^2
(1.96 + 0.84)^2 = 7.84; p(1−p) = 0.0475
n_per_variant ≈ 2 × 7.84 × 0.0475 / 0.000025 ≈ 29,800
Daily per variant ≈ 10,000 → ≈ 3.0 days of traffic. Practical plan: 1-day ramp + full week coverage → run ~8 days total.
Example 2: Average order value (AOV)
Goal: detect +$2 lift. Baseline mean = $50, σ = $70. α = 0.05, power = 0.8. Daily orders across both variants ≈ 3,000 (per variant ≈ 1,500).
n_per_variant ≈ 2 × 7.84 × 70^2 / 2^2 = 2 × 7.84 × 4,900 / 4 ≈ 19,200
Days ≈ 19,200 / 1,500 ≈ 12.8 days. Practical plan: 1–2 day ramp + 2 full weeks to cover weekly cycles.
Example 3: D7 retention
Goal: detect +2 pp absolute lift (25% → 27%). α = 0.05, power = 0.8. p(1−p) = 0.25 × 0.75 = 0.1875. Daily new eligible users = 15,000 (per variant 7,500). Outcome delay = 7 days.
n_per_variant ≈ 2 × 7.84 × 0.1875 / 0.02^2 = 2 × 1.47 / 0.0004 ≈ 7,350
Recruitment time ≈ 7,350 / 7,500 ≈ 1 day, then wait 7 days for D7 outcome → ~8 days minimum. Plan for a full 2-week calendar to cover weekly cycles and delays.
Plan your experiment duration: step-by-step
- Define your primary metric and unit of analysis.
- Collect baseline and variability (historical data or prior tests).
- Choose α and power with stakeholders.
- Set a practical MDE that is meaningful to the business.
- Compute sample size per variant (proportion or mean formula).
- Estimate daily eligible traffic per variant and compute raw days.
- Add buffers: ramp-up, weekly cycle coverage, outcome delays.
- Pre-register stopping rule and guardrails (e.g., SRM checks).
- Communicate a calendar plan (start, checkpoints, expected stop).
Mini task: convert % MDE to absolute
If baseline conversion is 6% and stakeholders want +12% relative lift, absolute MDE δ = 0.06 × 0.12 = 0.0072 (0.72 pp).
Checklist before you start
- Primary metric, baseline, and MDE are documented.
- Alpha and power agreed with stakeholders.
- Traffic estimate verified against recent weeks.
- Ramp policy defined (e.g., 20% → 50% → 100%).
- Outcome delays accounted for (e.g., retention).
- Plan covers at least one full weekly cycle.
- Stopping rule is written and shared.
- Guardrails set: SRM check, error logs, crash rate, availability.
Exercises (practice)
These mirror the exercises below. Work them out, then compare with the provided solutions. Your answers are not auto-graded on this page.
Exercise 1: Binary metric duration
Baseline signup rate p = 8%. Absolute MDE δ = 0.8 pp (10% relative). α = 0.05 (two-sided), power = 0.8. Eligible traffic = 12,000 users/day total, 50/50 allocation. Compute sample size per variant and approximate days of traffic. Recommend a calendar plan including ramp and weekly coverage.
Exercise 2: Continuous metric duration
Primary metric: average session duration. Baseline mean = 120 s, σ = 90 s. Absolute MDE δ = 6 s. α = 0.05, power = 0.8. Eligible traffic = 4,000 users/day total, 50/50. Compute sample size per variant and days. What calendar duration would you propose and why?
Common mistakes and self-check
- Counting total visitors, not eligible units. Self-check: does your denominator match your unit of analysis?
- Using relative MDE without converting to absolute. Self-check: convert to pp or absolute delta before formulas.
- Ignoring outcome delays (e.g., D7 retention). Self-check: add delay days after last exposure needed for measurement.
- Ending mid-week with strong weekday seasonality. Self-check: ensure at least 1 full weekly cycle.
- Peeking at results and stopping early without corrections. Self-check: adhere to your pre-registered stop rule.
- Traffic instability during test (launches, promos). Self-check: note major events; pause or extend if conditions change.
- SRM (sample ratio mismatch) ignored. Self-check: monitor allocation; large deviations indicate issues—investigate.
Practical projects
- Build a duration calculator sheet: inputs (p or mean, σ, α, power, MDE, traffic) → outputs (n per variant, days, calendar plan).
- Audit a past experiment: recompute required duration, compare to actual run, note risks and what you’d change.
- Create a one-page test plan template including duration rationale, stop rule, ramp plan, and guardrails.
Next steps
- Learn variance reduction (e.g., covariate adjustment) to reduce required sample size.
- Study sequential testing basics if you must look early; use proper alpha spending.
- Practice communicating duration trade-offs with product and engineering.
Mini challenge
You plan a pricing page test. Baseline conversion = 3.5%, absolute MDE = 0.35 pp, α = 0.05, power = 0.8, traffic = 14,000/day total, 50/50. There’s a 7-day sale next week. How do you set duration so you cover a full week and avoid confounding? Write your plan in 3–4 bullet points.
One possible approach
- Compute n_per_variant with the binary formula; convert to days using 7,000/day per variant.
- Include a 1-day ramp and ensure at least 1 full week of normal traffic (avoid mixing sale vs. non-sale weeks).
- If the sale cannot be avoided, plan the test entirely within the sale window or entirely outside it, not spanning both.
- Document the stopping rule and guardrails (SRM, outages).
Note on saving progress
The quick test is available to everyone for free. If you are logged in, your progress is saved automatically.