Who this is for
Product Analysts and experiment owners who launch or QA A/B tests and need to ensure groups are comparable before trusting results.
Prerequisites
- Basic A/B testing concepts (variants, metrics, exposures).
- Descriptive statistics (means, proportions, standard deviation).
- Comfort reading simple statistical test outputs (t-test, chi-square).
Learning path
- Understand what randomization checks are and why they matter.
- Learn core checks: Sample Ratio Mismatch (SRM), covariate balance, missingness.
- Run checks with simple formulas and thresholds.
- Diagnose and act on issues without overreacting.
- Practice with worked examples and a quick test.
Why this matters
Randomization checks tell you if your A/B test actually created comparable groups. If randomization failed or traffic was skewed, you can get misleading lifts. Real tasks where this matters:
- QA a newly launched experiment for SRM within the first hours.
- Verify device, geography, and traffic-source balance before reporting results.
- Catch instrumentation bugs that send more qualified users to one variant.
- Decide whether to continue, pause, or re-run a test after detecting imbalance.
Concept explained simply
Randomization checks are early diagnostic tests that confirm your A and B groups are similar on factors unrelated to the treatment. You compare pre-treatment attributes (e.g., device, country, prior usage) across variants using simple statistics. If differences are too large, either fix the assignment or adjust your analysis.
Mental model
Imagine two jars filled by flipping a fair coin for each marble. If the coin or the process is faulty, one jar ends up with more red marbles (certain user types). Randomization checks look inside the jars early to ensure both jars have similar marbles before judging which jar is heavier (treatment effect).
What to check
- Sample Ratio Mismatch (SRM): Are observed allocations close to the planned split (e.g., 50/50)?
- Covariate balance (pre-treatment): device, platform, country/region, traffic source, user tenure, pre-period activity/spend.
- Missingness/eligibility: are data missing at similar rates across variants?
- Exposure parity: are both groups eligible and exposed under the same rules?
Quick thresholds to remember
- SRM: chi-square test on counts, often flag if p < 0.01.
- Standardized Mean Difference (SMD) for numeric covariates: |SMD| < 0.1 generally acceptable.
- Chi-square test for categorical distributions: large p-values mean balance; flag small p-values (e.g., p < 0.05) with practical judgment.
Step-by-step workflow
- Check SRM first. Compare observed counts to planned split using a chi-square goodness-of-fit test.
- Check covariate balance. For numeric variables (e.g., prior 7-day sessions), compute SMD; for categorical variables (e.g., device), use chi-square test.
- Check missingness. Compare missing rates of key fields across variants (proportion test).
- Investigate if flagged. Review assignment logic, eligibility filters, bot filters, rollout timing, and any targeting or holdouts.
- Decide. If issues are minor, proceed with covariate-adjusted analysis; if major (e.g., SRM, misrouting), pause and fix before continuing.
Worked examples
Example 1: Numeric covariate — SMD
Goal: Check balance on pre-experiment 7-day sessions.
- Variant A: n=10,000, mean=4.10, sd=3.0
- Variant B: n=10,050, mean=4.22, sd=3.1
Pooled SD ≈ sqrt(((9999*3.0^2)+(10049*3.1^2))/(10000+10050-2)) ≈ 3.05. SMD = (4.10−4.22)/3.05 ≈ −0.039. |SMD|=0.039 < 0.1 ⇒ balanced.
Interpretation
The groups have very similar prior engagement; any outcome difference is unlikely due to prior sessions.
Example 2: Categorical covariate — device (chi-square)
Counts:
- A: Mobile=6,400, Desktop=3,200, Tablet=400
- B: Mobile=6,300, Desktop=3,500, Tablet=250
Totals: A=10,000, B=10,050. Expected counts are proportional to totals by category. Running a chi-square test (3 categories, df=2) yields a p-value ≈ 0.06.
Interpretation
p≈0.06 does not strongly indicate imbalance. Combined with practical importance (e.g., is Desktop materially different in behavior?), you likely proceed.
Example 3: SRM
Planned 50/50 split. First 1 hour after launch: A=5,400 users, B=4,600 users (total 10,000). Under 50/50, expected each=5,000. Chi-square statistic = sum((obs-exp)^2/exp) = (400^2/5000) + (−400^2/5000) = 64/5 ≈ 12.8, p < 0.001 (df=1) ⇒ SRM flagged.
Interpretation
Pause and investigate: assignment code, feature flag bucketing, geo rollouts, bots, or eligibility filters.
Exercises
Do these before taking the Quick Test. Progress in the test is available to everyone; only logged-in users get saved progress.
- Exercise ex1 mirrors the data and questions below.
- Exercise ex2 focuses on diagnosis and action planning.
Checklist before you start
- Identify planned split and total sample size.
- List pre-treatment covariates to check (device, country, traffic source, prior usage).
- Pick tests: chi-square for categorical, SMD for numeric, proportion test for missingness.
- Decide thresholds (e.g., SRM p < 0.01, |SMD| < 0.1).
Common mistakes and self-check
- Using post-treatment variables (e.g., conversions) for balance checks. Self-check: only use data from before exposure.
- Overreacting to tiny p-values with huge sample sizes. Self-check: also review effect sizes (SMD) and practical impact.
- Ignoring SRM because outcomes look good. Self-check: SRM questions the validity of all estimates.
- Not re-checking after rollout stages. Self-check: re-run checks after major traffic shifts.
- Multiple testing without context. Self-check: expect a few small p-values by chance; look for patterns and magnitude.
Practical projects
- Build a reusable randomization check template: inputs (counts, means/SDs, category tables) and outputs (SRM p-value, SMDs, chi-square p-values, pass/fail flags).
- Create a covariate balance dashboard for top geos, devices, and prior engagement bands.
- Draft a runbook: what to do when SRM or imbalance is detected (contacts, logs to pull, decision tree).
Mini challenge
Your test shows SRM p=0.003 in the first 2 hours, driven by more traffic in B from one country. After a geo rollout completes, SRM disappears. What do you report? Write 3 bullet points: (1) finding, (2) cause hypothesis, (3) impact on analysis plan.
Next steps
- Automate: schedule daily randomization checks for active experiments.
- Augment: add covariate-adjusted estimators to reduce variance when balanced.
- Document: include a “Randomization Checks” section in every experiment readout.
Quick Test
Ready to check your understanding? The Quick Test is available to everyone; only logged-in users get saved progress.