luvv to helpDiscover the Best Free Online Tools
Topic 13 of 13

Common Pitfalls and Biases

Learn Common Pitfalls and Biases for free with explanations, exercises, and a quick test (for Data Analyst).

Published: December 20, 2025 | Updated: December 20, 2025

Why this matters

As a Data Analyst, your A/B test recommendations can change product direction, pricing, or campaign spend. Avoiding common pitfalls and biases keeps decisions trustworthy, protects revenue, and builds credibility with product and engineering teams.

  • Decide whether to ship a new onboarding screen without accidentally overestimating its impact.
  • Catch tracking and bucketing issues early so weeks of test data are not wasted.
  • Communicate clear, defensible outcomes to stakeholders.

Concept explained simply

An A/B test is like flipping coins from two boxes to see which box has more heads. Pitfalls and biases are ways your coins get mixed up, your scale is off, or you stop counting too soon. The result looks real but is misleading.

Mental model

Think of three layers of risk:

  • Assignment risks: People are not split the way you think (Sample Ratio Mismatch).
  • Timing risks: You check too early or at a biased time (peeking, seasonality, novelty).
  • Measurement risks: You measure the wrong unit or miscount events (wrong unit of analysis, instrumentation errors).

Core pitfalls and biases (with fixes)

Sample Ratio Mismatch (SRM)

Planned split is 50/50, but observed traffic deviates beyond random chance (e.g., 53/47 with large N). Usually caused by bucketing or tracking issues.

  • Fix: Run an SRM check (chi-square or binomial test) before analyzing outcomes.
  • Typical causes: Late event firing, bots filtered unevenly, users excluded in one variant, geo/device imbalances due to routing.
Peeking / Stopping early

Repeatedly checking p-values and stopping when significant inflates false positives.

  • Fix: Predefine minimum sample size and duration, or use sequential methods with alpha spending. Do not stop on the first p < 0.05 if you planned a fixed-horizon test.
Multiple comparisons

Testing many metrics/segments/variants increases Type I error.

  • Fix: Pre-register a primary metric; treat others as secondary/exploratory. Consider correction methods or false discovery control if making multiple simultaneous decisions.
Wrong unit of randomization/analysis

Randomizing by session but analyzing by user (or vice versa) breaks independence and can inflate variance or bias.

  • Fix: Align randomization and analysis unit. If users see both variants across sessions, expect dilution and interference.
Seasonality and timing

Holidays, weekends, paydays, marketing bursts, or product outages can skew results.

  • Fix: Run full business cycles when relevant, include guardrail metrics, and check for time trends before concluding.
Novelty and learning effects

Users (and internal teams) adapt over time. Early spikes or dips may fade.

  • Fix: Run long enough for stabilization; evaluate trend during the test window, not just overall averages.
Selection and instrumentation bias

Biased inclusion/exclusion rules or broken tracking create misleading effects.

  • Fix: Verify event completeness across variants; audit funnels; compare pre-experiment baselines between groups.
Interference and contamination

Users in different variants affect each other (e.g., referrals, shared carts, social features).

  • Fix: Randomize at a higher unit (e.g., user or geo), or design cluster experiments.
Underpowered tests

Too few users or too short a duration leads to noisy results and flip-flopping significance.

  • Fix: Power analysis before launch; set minimum detectable effect (MDE) and duration.
One-tailed vs two-tailed misuse

Choosing one-tailed after seeing the direction of effect inflates false positives.

  • Fix: Predefine test type. Use two-tailed unless you have a strict directional hypothesis and will not consider the opposite outcome.

Worked examples

Example 1: SRM check

Plan: 50/50 split. Observed: A=102,340 users, B=97,660 users (total 200,000).

  • Expected per group at 50/50: 100,000 each.
  • Chi-square ≈ (2340^2/100,000) + (−2340^2/100,000) × 2 = 109.6 (approx). p-value ≪ 0.001.
  • Conclusion: SRM. Do not trust outcome metrics; investigate bucketing/tracking.

Example 2: Peeking pitfall

Plan: 14-day test. At day 5, p=0.047 on the primary metric.

  • Because you planned a fixed horizon, early stopping raises false positives.
  • Conclusion: Continue to the planned duration or use a pre-specified sequential method next time.

Example 3: Wrong unit of analysis

Randomize by session; heavy users open many sessions and can see both variants. You analyze by user and attribute the best session to treatment.

  • Result: Biased upward effect (winner’s curse on best sessions).
  • Fix: Randomize and analyze consistently at the user level, or restrict to first exposure.

Example 4: Novelty effect

New UI shows −3% conversion in first 3 days, flat thereafter. Day 1–3 were 80% of the run.

  • A short test would conclude harm; a longer window shows stabilization.
  • Fix: Ensure duration covers the adjustment period.

Checklists you can use

Pre-flight (before launch)

  • ☐ Define one primary metric and success criteria.
  • ☐ Set MDE, sample size, and minimum duration.
  • ☐ Choose unit of randomization and confirm analysis unit matches.
  • ☐ Validate tracking events fire equally in A and B on a test environment.
  • ☐ Plan guardrail metrics (e.g., error rate, latency).
  • ☐ Document stopping rules (fixed or sequential).

Post-run (before reading outcomes)

  • ☐ SRM check passes.
  • ☐ No major outages or marketing shocks skewed traffic.
  • ☐ Exposure integrity verified (users not bouncing between variants unintentionally).
  • ☐ Check time trends; ensure no novelty-only effects.
  • ☐ Analyze pre-specified primary metric first; label others as exploratory.

Exercises

Try these and then open the solutions to compare your approach.

Exercise 1: Spot SRM quickly

Plan: 50/50 split. Observed users: Variant A = 62,700; Variant B = 57,300 (total 120,000). Should you proceed with outcome analysis?

Exercise 2: Early stop or not?

Plan: 2-week test. You pre-committed to a fixed-horizon analysis. On day 6, your dashboard shows p=0.03 improvement for the primary metric. What do you do, and why?

  • ☐ After solving, compare to the official solution below.
  • ☐ If unsure, revisit the Pre-flight and Post-run checklists.
Solutions for Exercises

Exercise 1 solution

Expected at 50/50: 60,000 each. Chi-square ≈ (2700^2/60,000) × 2 ≈ 243 (very high); p ≪ 0.001. This is SRM. Do not analyze outcomes; investigate assignment/tracking.

Exercise 2 solution

Do not stop. You planned a fixed-horizon test; peeking inflates false positives. Continue to the planned duration or redesign with a sequential method and alpha spending in the future.

Common mistakes and how to self-check

  • Mistake: Declaring a win on a mid-test p-value. Self-check: Did we predefine stopping rules?
  • Mistake: Ignoring SRM. Self-check: Run SRM before any outcome analysis.
  • Mistake: Mixing units (session vs user). Self-check: Does the analysis unit match assignment?
  • Mistake: Fishing across many segments. Self-check: Is the result on the pre-registered primary metric?
  • Mistake: Short tests during volatile periods. Self-check: Any shocks in the test window?
  • Mistake: Broken event tracking in one variant. Self-check: Compare event fire rates on null periods or placebo events.

Practical projects

  • Build an SRM checker in a spreadsheet: inputs (A count, B count, planned ratio) and auto-calculated p-value.
  • Simulate peeking: generate two equal Bernoulli streams, check significance daily, and log how often you would have stopped incorrectly.
  • Create a pre-launch A/B testing checklist tailored to your product’s metrics and systems.
  • Design a cluster experiment plan for a feature with social interference (e.g., referrals).

Who this is for

  • Data Analysts who run or interpret A/B tests.
  • Product Managers who must make launch decisions.
  • Engineers building experiment assignment and logging.

Prerequisites

  • Basic probability and p-values.
  • Understanding of experiment assignment and metrics.
  • Comfort with reading dashboards and event logs.

Learning path

  1. Review experiment design basics (randomization, metrics, power).
  2. Learn to run SRM checks and verify instrumentation.
  3. Practice with time and seasonality diagnostics.
  4. Handle multiple comparisons and pre-register primary metrics.
  5. Align unit of randomization and analysis.
  6. Communicate results with caveats and next steps.

Mini challenge

Scenario: A test ran for 5 days during a holiday sale. Planned split 50/50. Observed: A=154,000 users, B=146,000 users. Primary metric is purchase conversion. Analyst stopped early at p=0.049 after checking daily. Secondary metrics (AOV, retention) were also scanned; one showed p=0.04. List at least four issues and how you would fix them next time.

Show one possible answer
  • SRM likely (counts differ at large N) — run SRM and fix bucketing/tracking before outcomes.
  • Peeking with fixed-horizon plan — predefine stopping rules or use sequential methods.
  • Holiday confound — include full business cycle or avoid unusual periods.
  • Multiple comparisons on secondary metrics — pre-register one primary metric; correct or label exploratory.

Next steps

  • Use the checklists for your next experiment.
  • Complete the quick test below to confirm understanding. The quick test is available to everyone; only logged-in users have saved progress.
  • Move on to deeper topics like test power and variance reduction after you can reliably pass SRM and peeking checks.

Practice Exercises

2 exercises to complete

Instructions

Plan: 50/50 split. Observed users: Variant A = 62,700; Variant B = 57,300 (total 120,000). Should you proceed with outcome analysis? Briefly justify using an SRM check.

Expected Output
SRM suspected; do not analyze outcomes. Investigate assignment/tracking before interpreting metrics.

Common Pitfalls and Biases — Quick Test

Test your knowledge with 8 questions. Pass with 70% or higher.

8 questions70% to pass

Have questions about Common Pitfalls and Biases?

AI Assistant

Ask questions about this tool