Why this matters
As a Data Analyst, your A/B test recommendations can change product direction, pricing, or campaign spend. Avoiding common pitfalls and biases keeps decisions trustworthy, protects revenue, and builds credibility with product and engineering teams.
- Decide whether to ship a new onboarding screen without accidentally overestimating its impact.
- Catch tracking and bucketing issues early so weeks of test data are not wasted.
- Communicate clear, defensible outcomes to stakeholders.
Concept explained simply
An A/B test is like flipping coins from two boxes to see which box has more heads. Pitfalls and biases are ways your coins get mixed up, your scale is off, or you stop counting too soon. The result looks real but is misleading.
Mental model
Think of three layers of risk:
- Assignment risks: People are not split the way you think (Sample Ratio Mismatch).
- Timing risks: You check too early or at a biased time (peeking, seasonality, novelty).
- Measurement risks: You measure the wrong unit or miscount events (wrong unit of analysis, instrumentation errors).
Core pitfalls and biases (with fixes)
Sample Ratio Mismatch (SRM)
Planned split is 50/50, but observed traffic deviates beyond random chance (e.g., 53/47 with large N). Usually caused by bucketing or tracking issues.
- Fix: Run an SRM check (chi-square or binomial test) before analyzing outcomes.
- Typical causes: Late event firing, bots filtered unevenly, users excluded in one variant, geo/device imbalances due to routing.
Peeking / Stopping early
Repeatedly checking p-values and stopping when significant inflates false positives.
- Fix: Predefine minimum sample size and duration, or use sequential methods with alpha spending. Do not stop on the first p < 0.05 if you planned a fixed-horizon test.
Multiple comparisons
Testing many metrics/segments/variants increases Type I error.
- Fix: Pre-register a primary metric; treat others as secondary/exploratory. Consider correction methods or false discovery control if making multiple simultaneous decisions.
Wrong unit of randomization/analysis
Randomizing by session but analyzing by user (or vice versa) breaks independence and can inflate variance or bias.
- Fix: Align randomization and analysis unit. If users see both variants across sessions, expect dilution and interference.
Seasonality and timing
Holidays, weekends, paydays, marketing bursts, or product outages can skew results.
- Fix: Run full business cycles when relevant, include guardrail metrics, and check for time trends before concluding.
Novelty and learning effects
Users (and internal teams) adapt over time. Early spikes or dips may fade.
- Fix: Run long enough for stabilization; evaluate trend during the test window, not just overall averages.
Selection and instrumentation bias
Biased inclusion/exclusion rules or broken tracking create misleading effects.
- Fix: Verify event completeness across variants; audit funnels; compare pre-experiment baselines between groups.
Interference and contamination
Users in different variants affect each other (e.g., referrals, shared carts, social features).
- Fix: Randomize at a higher unit (e.g., user or geo), or design cluster experiments.
Underpowered tests
Too few users or too short a duration leads to noisy results and flip-flopping significance.
- Fix: Power analysis before launch; set minimum detectable effect (MDE) and duration.
One-tailed vs two-tailed misuse
Choosing one-tailed after seeing the direction of effect inflates false positives.
- Fix: Predefine test type. Use two-tailed unless you have a strict directional hypothesis and will not consider the opposite outcome.
Worked examples
Example 1: SRM check
Plan: 50/50 split. Observed: A=102,340 users, B=97,660 users (total 200,000).
- Expected per group at 50/50: 100,000 each.
- Chi-square ≈ (2340^2/100,000) + (−2340^2/100,000) × 2 = 109.6 (approx). p-value ≪ 0.001.
- Conclusion: SRM. Do not trust outcome metrics; investigate bucketing/tracking.
Example 2: Peeking pitfall
Plan: 14-day test. At day 5, p=0.047 on the primary metric.
- Because you planned a fixed horizon, early stopping raises false positives.
- Conclusion: Continue to the planned duration or use a pre-specified sequential method next time.
Example 3: Wrong unit of analysis
Randomize by session; heavy users open many sessions and can see both variants. You analyze by user and attribute the best session to treatment.
- Result: Biased upward effect (winner’s curse on best sessions).
- Fix: Randomize and analyze consistently at the user level, or restrict to first exposure.
Example 4: Novelty effect
New UI shows −3% conversion in first 3 days, flat thereafter. Day 1–3 were 80% of the run.
- A short test would conclude harm; a longer window shows stabilization.
- Fix: Ensure duration covers the adjustment period.
Checklists you can use
Pre-flight (before launch)
- ☐ Define one primary metric and success criteria.
- ☐ Set MDE, sample size, and minimum duration.
- ☐ Choose unit of randomization and confirm analysis unit matches.
- ☐ Validate tracking events fire equally in A and B on a test environment.
- ☐ Plan guardrail metrics (e.g., error rate, latency).
- ☐ Document stopping rules (fixed or sequential).
Post-run (before reading outcomes)
- ☐ SRM check passes.
- ☐ No major outages or marketing shocks skewed traffic.
- ☐ Exposure integrity verified (users not bouncing between variants unintentionally).
- ☐ Check time trends; ensure no novelty-only effects.
- ☐ Analyze pre-specified primary metric first; label others as exploratory.
Exercises
Try these and then open the solutions to compare your approach.
Exercise 1: Spot SRM quickly
Plan: 50/50 split. Observed users: Variant A = 62,700; Variant B = 57,300 (total 120,000). Should you proceed with outcome analysis?
Exercise 2: Early stop or not?
Plan: 2-week test. You pre-committed to a fixed-horizon analysis. On day 6, your dashboard shows p=0.03 improvement for the primary metric. What do you do, and why?
- ☐ After solving, compare to the official solution below.
- ☐ If unsure, revisit the Pre-flight and Post-run checklists.
Solutions for Exercises
Exercise 1 solution
Expected at 50/50: 60,000 each. Chi-square ≈ (2700^2/60,000) × 2 ≈ 243 (very high); p ≪ 0.001. This is SRM. Do not analyze outcomes; investigate assignment/tracking.
Exercise 2 solution
Do not stop. You planned a fixed-horizon test; peeking inflates false positives. Continue to the planned duration or redesign with a sequential method and alpha spending in the future.
Common mistakes and how to self-check
- Mistake: Declaring a win on a mid-test p-value. Self-check: Did we predefine stopping rules?
- Mistake: Ignoring SRM. Self-check: Run SRM before any outcome analysis.
- Mistake: Mixing units (session vs user). Self-check: Does the analysis unit match assignment?
- Mistake: Fishing across many segments. Self-check: Is the result on the pre-registered primary metric?
- Mistake: Short tests during volatile periods. Self-check: Any shocks in the test window?
- Mistake: Broken event tracking in one variant. Self-check: Compare event fire rates on null periods or placebo events.
Practical projects
- Build an SRM checker in a spreadsheet: inputs (A count, B count, planned ratio) and auto-calculated p-value.
- Simulate peeking: generate two equal Bernoulli streams, check significance daily, and log how often you would have stopped incorrectly.
- Create a pre-launch A/B testing checklist tailored to your product’s metrics and systems.
- Design a cluster experiment plan for a feature with social interference (e.g., referrals).
Who this is for
- Data Analysts who run or interpret A/B tests.
- Product Managers who must make launch decisions.
- Engineers building experiment assignment and logging.
Prerequisites
- Basic probability and p-values.
- Understanding of experiment assignment and metrics.
- Comfort with reading dashboards and event logs.
Learning path
- Review experiment design basics (randomization, metrics, power).
- Learn to run SRM checks and verify instrumentation.
- Practice with time and seasonality diagnostics.
- Handle multiple comparisons and pre-register primary metrics.
- Align unit of randomization and analysis.
- Communicate results with caveats and next steps.
Mini challenge
Scenario: A test ran for 5 days during a holiday sale. Planned split 50/50. Observed: A=154,000 users, B=146,000 users. Primary metric is purchase conversion. Analyst stopped early at p=0.049 after checking daily. Secondary metrics (AOV, retention) were also scanned; one showed p=0.04. List at least four issues and how you would fix them next time.
Show one possible answer
- SRM likely (counts differ at large N) — run SRM and fix bucketing/tracking before outcomes.
- Peeking with fixed-horizon plan — predefine stopping rules or use sequential methods.
- Holiday confound — include full business cycle or avoid unusual periods.
- Multiple comparisons on secondary metrics — pre-register one primary metric; correct or label exploratory.
Next steps
- Use the checklists for your next experiment.
- Complete the quick test below to confirm understanding. The quick test is available to everyone; only logged-in users have saved progress.
- Move on to deeper topics like test power and variance reduction after you can reliably pass SRM and peeking checks.