Why this matters
Choosing the right experiment duration is one of the highest-leverage decisions you make as a Data Scientist. Too short, and you risk false wins or misses. Too long, and you waste time and expose users to a poor variant. Seasonality (day-of-week, monthly cycles, holidays) can skew results if you stop at the wrong time.
- Estimate when you will reach the needed sample size for your Minimum Detectable Effect (MDE).
- Plan runs to cover seasonality cycles (e.g., full weeks).
- Control novelty effects and avoid biased stopping decisions.
Who this is for
- Data Scientists and Analysts running A/B/n tests or switchback tests.
- Product managers who need realistic experiment timelines.
- Engineers enabling experimentation who want guardrail-aware plans.
Prerequisites
- Basic hypothesis testing (alpha, power) and confidence intervals.
- Understanding of conversion rates or continuous metrics (mean, variance).
- Ability to estimate traffic volume per group per day.
Concept explained simply
Think of duration as filling a bucket to a marked line. The faucet flow (traffic) pulses by day and season. You must fill to the line (required sample size) and stop at a good moment: when one or more full cycles have passed, so a single pulse doesn’t mislead you.
Key inputs
- Baseline level/variance: for proportions p (conversion rate) or standard deviation σ for continuous metrics.
- MDE (minimum detectable effect): absolute or relative (convert relative to absolute units for formulas).
- Significance (alpha) and power (1 - beta).
- Daily events per group (after eligibility filters).
- Seasonality pattern: day-of-week, monthly/quarterly, holidays, marketing campaigns.
- Novelty or ramp time: initial stabilization period to exclude or downweight.
Simple formulas (back-of-envelope)
For a binary metric (conversion rate) with baseline p and absolute MDE = d (in proportion points, e.g., 0.005 for 0.5 pp), an approximate per-group sample size is:
n_per_group ≈ 2 × p × (1 − p) × (Z1−α/2 + Zpower)² / d²
For a continuous metric with standard deviation σ and absolute MDE = d:
n_per_group ≈ 2 × σ² × (Z1−α/2 + Zpower)² / d²
Duration (days) ≈ ceil(n_per_group / daily_events_per_group)
Good practice: commit to a fixed horizon that includes at least one full seasonality cycle (e.g., a full week), and stop only at cycle boundaries (end of week).
What Z-values should I use?
- α = 0.05 (two-sided) → Z1−α/2 ≈ 1.96
- Power = 0.80 → Zpower ≈ 0.84
- Power = 0.90 → Zpower ≈ 1.28
These are standard approximations for planning.
Seasonality playbook
- Day-of-week: Always cover at least one full week. If n is large or traffic variable, prefer multiples of full weeks.
- Known events (launches, holidays): Avoid or explicitly include as full cycles. Use blackouts (pause enrollment) if needed.
- Marketing bursts: Start after ramp settles, or run for full campaign windows.
- Switchbacks (if applicable to supply/demand systems): Use full cycles of assignment to cover temporal patterns.
Worked examples
Example 1: Website conversion A/B
Goal: Detect a 10% relative lift from 5% baseline. So absolute MDE d = 0.10 × 0.05 = 0.005 (0.5 pp). α = 0.05, power = 0.80 → Z sum ≈ 1.96 + 0.84 = 2.80.
n_per_group ≈ 2 × 0.05 × 0.95 × (2.80)² / 0.005² = 2 × 0.0475 × 7.84 / 0.000025 ≈ 0.095 × 7.84 / 0.000025 ≈ 0.7448 / 0.000025 ≈ 29,792 users per group.
Traffic: 20,000 eligible users per day total, 50/50 split → 10,000 per group per day. Duration for sample size ≈ ceil(29,792 / 10,000) = 3 days.
But: day-of-week seasonality exists. Plan: run at least one full week and include a 2-day novelty burn-in excluded from analysis. Practical plan: 9 calendar days total; analyze the last 7 days ending at a weekday boundary.
Example 2: Continuous metric (revenue per user)
Baseline σ = $3, MDE d = $0.20, α = 0.05, power = 0.90 → Z sum = 1.96 + 1.28 = 3.24.
n_per_group ≈ 2 × 3² × (3.24)² / 0.20² = 18 × 10.4976 / 0.04 ≈ 188.96 / 0.04 ≈ 4,724 users per group.
Traffic: 1,000 users per group per day → 4.7 → 5 days. Seasonality rule: round up to a full week. Practical plan: 7 days; optionally exclude day 1 as novelty.
Example 3: Uneven weekly traffic
Per-group daily events: Mon–Thu 6k, Fri 8k, Sat 12k, Sun 12k. Weekly total = 56k per group. Required n_per_group = 50k. Even if you reach 50k in 5–6 days, stopping mid-week over-represents weekend behavior. Plan: stop only at a week boundary. One full week covers the cycle and exceeds the sample need.
Example 4: Holiday interference
You plan a pricing test across late November. Black Friday/Cyber Monday will spike traffic and change buyer intent. Options:
- Exclude enrollment during the holiday (blackout) and extend the run to complete a full post-holiday week.
- Run two full-week phases: pre-holiday and post-holiday; analyze each phase separately and then meta-analyze if appropriate.
How to plan duration (step-by-step)
- Define metric, baseline (p or σ), MDE (absolute), alpha, and power.
- Compute back-of-envelope n_per_group with the formulas above.
- Estimate daily events per group and get a naive day count = ceil(n_per_group / daily_events_per_group).
- Map seasonality cycles. For day-of-week effects, round up to a whole number of weeks.
- Add novelty stabilization (e.g., 1–2 days) to the calendar; either exclude it from analysis or pre-register a weighting scheme.
- Pre-register stop rules: stop only at cycle boundaries; avoid peeking-based early stopping.
- Check guardrails (e.g., error rates, latency). If guardrails degrade, define safety stop conditions.
Mini reference: absolute vs relative MDE
- Relative MDE → absolute: d = relative × baseline. Example: 10% of 5% = 0.5 pp = 0.005.
- For continuous metrics, specify d in the same units as the metric (e.g., $0.20).
Common mistakes and self-check
- Stopping mid-week: can bias results if weekends/weekday behavior differ. Self-check: Does the run end at a week boundary?
- Ignoring novelty: early user curiosity inflates metrics. Self-check: Compare day 1–2 vs later days; large drifts suggest novelty.
- MDE confusion: using relative MDE in a formula that needs absolute. Self-check: Confirm units of d before calculating.
- Peeking repeatedly without correction: raises false-positive risk. Self-check: Did you commit to a fixed horizon or use a sequential method?
- Traffic overestimation: assuming all users are eligible. Self-check: Use observed eligible traffic from recent weeks.
- Holiday blindness: running across major events unintentionally. Self-check: Review the calendar for launches, holidays, and campaigns.
Pre-launch checklist
- Baseline and MDE are documented; d is absolute.
- n_per_group computed for primary metric; guardrails listed.
- Daily eligible traffic per group is realistic.
- Duration covers ≥1 full week (or ≥1 full relevant cycle).
- Novelty period decided (exclude or include with note).
- Stop rule: end at cycle boundary; no unplanned peeking.
- Holiday/campaign calendar reviewed; blackout or split phases planned.
Exercises (practice here, then check solutions)
- Exercise 1 (matches ex1): Baseline p = 0.04, target relative MDE = 12.5% (two-sided), α = 0.05, power = 0.80, total eligible traffic = 16,000/day (50/50 split). Compute n_per_group, naive duration, then propose a seasonality-safe plan with novelty handling.
- Exercise 2 (matches ex2): You need n_per_group = 80,000 events for your add-to-cart rate. Your per-group daily events are Mon–Thu 7,000, Fri 9,000, Sat 13,000, Sun 11,000. Plan a stop date if you start on Tuesday, considering novelty = 1 day and week-boundary stopping.
Practical projects
- Create a duration planner: a small spreadsheet that takes baseline, MDE, alpha, power, and daily traffic by weekday to output recommended end dates at week boundaries.
- Seasonality audit: analyze 8 weeks of historical metric data to quantify day-of-week lift factors; propose experiment run rules based on the findings.
- Post-hoc sensitivity: simulate different MDEs and powers to show how duration changes; present trade-offs to stakeholders.
Learning path
- Before: Hypothesis testing basics, power and sample size.
- This lesson: Duration planning and seasonality coverage.
- Next: Guardrails, novelty effects, and sequential testing considerations.
Next steps
- Use the checklist on your next A/B test plan.
- Build or adapt a duration spreadsheet for your team.
- Take the Quick Test below. Note: anyone can take it; only logged-in users have progress saved.
Mini challenge
You can only run for 10 days this month due to a planned campaign. Your naive calculation says 6 days is enough. Propose a plan that respects day-of-week coverage, includes novelty control, and avoids the campaign window. Write the exact calendar dates and which days you will analyze.