Why this matters
Clean, trustworthy data is the foundation of any A/B test. Without quality checks, you risk shipping the wrong variant, losing money, or eroding user trust.
- As a Data Analyst, you will validate randomization, monitor traffic splits, catch tracking breaks, deduplicate events, and confirm metrics are computed correctly before making go/no-go calls.
- Business impact: avoiding bad launches, detecting bugs early, and making confident decisions.
Concept explained simply
Data quality checks are quick guardrails before, during, and after an experiment to ensure the numbers you analyze match reality.
Mental model: The Quality Gate
- Collect: Events fire as expected for all variants.
- Assign: Users are randomly assigned at the intended split (e.g., 50/50).
- Track: No missing, duplicated, or delayed events.
- Aggregate: Metrics compute on the same population and time window.
- Analyze: Invariant metrics are balanced and assumptions hold.
If any gate fails, pause and fix before trusting results.
What to check before, during, and after an A/B test
Before launch: readiness checks
- Eligibility & exposure: Define who can enter the test (e.g., new users only). Ensure users enter once.
- Tracking validation: Trigger key events (view, click, purchase) in a test environment; confirm they arrive with correct user_id, variant, and timestamp.
- Duplicate protection: Decide dedupe keys (e.g., user_id + day for conversion) and document them.
- Time windows: Confirm consistent timezone and attribution window (e.g., 7-day conversion).
- Guardrail metrics: Choose safety metrics (e.g., crash rate, refund rate) to monitor for harm.
- Plan SRM (Sample Ratio Mismatch) thresholds: Typically flag at p < 0.01 for the 50/50 split.
During the test: monitoring
- Traffic split (SRM): Check daily if allocation matches plan (e.g., 50/50). Investigate if not.
- Event completeness: Compare event volumes to baseline. Watch for sudden drops (broken tracking) or spikes (bots/duplication).
- Latency & backfills: Verify expected reporting delays. Avoid acting on partial data.
- Duplicates/outliers: Track unusual per-user event bursts; apply dedupe rules.
- Invariant metrics balance: Device mix, geography, new vs returning should be similar across variants.
Before analysis: final validations
- Final SRM check on exposures.
- De-duplicate conversions based on the documented rule (e.g., user_id + day).
- Filter bots/test accounts and out-of-scope users (not meeting eligibility).
- Consistent denominator: Ensure metrics use the same population (e.g., unique users exposed).
- Stable window: Lock analysis window and attribution consistently across variants.
- Overlaps: Check if users were in conflicting experiments that could bias results; if yes, segment or exclude.
Worked examples
Example 1: SRM (Sample Ratio Mismatch)
Planned split: 50/50. Observed exposures: Control=48,937; Variant=51,963; Total=100,900.
- Expected per group: 100,900/2 = 50,450.
- Compute chi-square: sum((obs - exp)^2 / exp) over both groups.
- Control: (48,937 - 50,450)^2 / 50,450; Variant: (51,963 - 50,450)^2 / 50,450.
- Result ≈ 90.8 (df=1) → p-value far < 0.01 → SRM detected.
- Action: Pause analysis; investigate assignment bugs, eligibility filters, traffic routing.
Example 2: Deduplicating conversions
Goal: Conversion is per-user success. Some users fire multiple conversion events.
- Variant A users (5): conversions by user = [1, 2 (dup), 0, 1, 0] → unique converters = 3/5 = 60%.
- Variant B users (5): conversions by user = [1, 1, 3 (dup), 0, 0] → unique converters = 3/5 = 60%.
- Naive event counting would overstate B (due to 3 events). Dedup keeps one conversion per user as defined.
Action: Always compute conversion rate as unique converters / exposed users unless your metric definition differs.
Example 3: Timezone misalignment
Control timestamps in UTC, Variant in local time. Weekend traffic looks higher in Variant.
- Symptom: Daily seasonality patterns differ across variants.
- Check: Compare timestamp distributions; verify timezone headers and ETL transformations.
- Fix: Standardize timezone pre-aggregation (e.g., convert all to UTC) and rebuild aggregates.
Hands-on exercises
Do these to practice. Mirror solutions are provided below each exercise.
Exercise 1 — Detect SRM and decide action
Given: Planned split 50/50. Observed exposures: Control = 48,937; Variant = 51,963.
- Your task: Perform an SRM check using a chi-square (df=1) or a z-test for proportions.
- Decision rule: Flag SRM if p < 0.01.
- Deliverable: A one-line verdict and your reasoning.
Show solution
Total = 100,900; expected per group = 50,450. Deviations = ±1,513.
Chi-square ≈ 2*(1,513^2 / 50,450) ≈ 90.8 → p-value ≪ 0.01 → SRM detected.
Verdict: SRM detected (p < 0.01). Pause and investigate randomization/traffic routing.
Exercise 2 — Deduplicate conversions and recompute rate
Dataset (10 users): A group users have conversions by user [1, 2, 0, 1, 0]. B group users have [1, 1, 3, 0, 0]. Conversion is defined per user (dedupe to 1 max).
- Your task: Compute conversion rate per group after dedupe.
- Deliverable: CR_A, CR_B and a short note about impact of dedupe.
Show solution
A unique converters: users with ≥1 conversion = 3/5 → 60%.
B unique converters: users with ≥1 conversion = 3/5 → 60%.
Note: Without dedupe, B would appear higher due to multiple events from one user.
Checklists
Pre-launch checklist
- [ ] Eligibility is defined and implemented.
- [ ] Key events fire with correct user_id, variant, and timestamp.
- [ ] Dedupe rule written (e.g., user_id + day).
- [ ] Timezone and attribution windows fixed and documented.
- [ ] Guardrails set (e.g., crash/refund rate).
- [ ] SRM threshold and monitoring plan agreed.
Daily monitoring checklist
- [ ] SRM within tolerance.
- [ ] Event volumes stable vs baseline.
- [ ] No spikes from bots or duplication.
- [ ] Invariant metrics balanced (device, geo, new/returning).
- [ ] Data latency within expected bounds.
Pre-analysis checklist
- [ ] Final SRM check passed.
- [ ] Duplicates removed per rule.
- [ ] Bots/test accounts excluded.
- [ ] Same population and time window across variants.
- [ ] Overlapping experiments considered.
Common mistakes and how to self-check
1) Ignoring SRM
Self-check: Run chi-square on exposures. If p < 0.01, pause and debug assignment.
2) Counting events, not users
Self-check: Confirm metric definition. For user-level conversion, dedupe to one per user (or per user-day if defined).
3) Inconsistent timezones
Self-check: Plot hourly/daily patterns per variant. Convert all timestamps to a single timezone before aggregation.
4) Missing tracking in one variant
Self-check: Compare event firing rates by variant. A sudden drop in only one variant indicates a tracking bug.
5) Population mismatch
Self-check: Verify inclusion criteria applied identically. Compare invariant distributions (device, geo, returning).
6) Overlapping experiments contamination
Self-check: Check experiment enrollment overlaps. Re-run analysis on non-overlapping users or segment by overlap.
7) Bot/QA traffic
Self-check: Filter known test accounts, unrealistic activity, and suspicious IP/device patterns.
8) Inconsistent denominators
Self-check: Ensure both variants use the same denominator (e.g., unique exposed users).
Practical projects
- Build an SRM monitor: Given daily exposures by variant, compute chi-square and flag alerts when p < 0.01.
- Event audit: For a week of logs, identify duplicate conversions and propose a dedupe key. Quantify impact on CR.
- Invariant balance dashboard: Track device, geo, and new/returning mix by variant during a live test.
- Timezone harmonizer: Take mixed-timezone data and produce a UTC-aligned dataset with a reproducible recipe.
Learning path
- Start: Data Quality Checks (this page)
- Next: Metric design and guardrails
- Then: Power, sample size, and duration
- Later: Interpreting results, lifts, and confidence intervals
Who this is for
- Data Analysts working with product, growth, or marketing experiments.
- Engineers and PMs who need reliable experiment readouts.
- Anyone running A/B tests and wants decisions backed by clean data.
Prerequisites
- Basic statistics (proportions, chi-square or z-tests).
- Comfort with event/metric definitions.
- Familiarity with your analytics stack and IDs (user_id, session_id).
Mini challenge
You see Variant B has 7% fewer pageviews/user but identical conversion rate. SRM is fine. Device mix shows B has 10% more low-end devices. Propose two quality checks and one segmentation you’d run before concluding Variant B is worse for engagement.
- Tip: Think tracking completeness, latency, and invariant balance.
Take the Quick Test
Ready to check your understanding? Take the Quick Test below. Everyone can take it for free; sign in if you want your progress saved.