Why this matters
Data Scientists often run several experiments at the same time. If these collide, your metrics can be biased or diluted, leading to wrong decisions and wasted traffic.
- Prioritization: choose which tests can safely run in parallel.
- Design: pick traffic splits and bucketing that reduce collisions.
- Inference: adjust for multiple comparisons to avoid false wins.
- Operations: set guardrails to catch harmful interactions quickly.
Concept explained simply
When two or more experiments run together, they can influence each other. This is called interaction or interference. Example: a pricing test and a notification test both try to change the same user behavior (purchase). The effect you measure for one test may be inflated, reduced, or even reversed because of the other.
Mental model
Think of your product as a marketplace of limited attention and supply. Experiments compete for the same users, impressions, sessions, or supply (e.g., recommendations, ad slots). Interactions occur when experiments share any of these resources.
Common interaction channels
- Sample overlap: the same user/session can be in multiple tests.
- Surface overlap: tests change the same page, slot, or component.
- Saturation/spillover: one userâs treatment affects others (network effects).
- Carryover: earlier exposures affect later behavior (e.g., novelty, learning).
- Shared systems: ranking, recommendations, auctions, caches.
Key risks and when they appear
- Contamination: A user sees both treatments; effects mix.
- Collision: Two experiments try to fill the same UI slot.
- Suppression/cannibalization: One experiment steals conversions from another.
- Metric drift: A shared system (like search ranking) shifts baselines globally.
- Multiplicity: Many tests inflate Type I error (false positives).
Planning: parallel vs sequential
- Run sequentially when: high interaction risk, safety-critical changes, or global systems (ranking/auction) are involved.
- Run in parallel when: disjoint audiences or surfaces, or when using strong isolation (MEGs) and guardrails.
Quick decision flow
- Do tests share users or surfaces? If yes, risk is medium-high.
- Does one test affect allocation/supply for the other (ranking, inventory)? If yes, run sequentially or use strong isolation.
- Is interaction itself of interest? If yes, prefer a factorial design.
Designing for minimal interaction
Mutually Exclusive Groups (MEGs)
Partition users into non-overlapping buckets. Each experiment only uses its assigned MEG. This greatly reduces interference at the user level.
- Trade-off: Each experiment gets less traffic, increasing runtime.
- Tip: Keep a small global holdout/unused MEG for emergencies or future priority tests.
Choose the right bucketing unit
- User-level: stable across sessions; best for most product experiments.
- Session-level: avoid carryover but can fragment user experience.
- Cluster-level: for network effects (e.g., geo, org, school, household).
Factorial (e.g., 2Ă2) vs independent tests
- Use factorial when you want to estimate interaction explicitly (A, B, and AĂB).
- Use independent MEGs when you want to avoid interactions and keep analyses simpler.
Geo/cluster experiments
When spillovers are likely (social, delivery time, marketplace), randomize clusters (cities, orgs). Analyze with cluster-aware methods.
Holdouts and ghost/virtual slots
- Global holdout: a small portion of traffic untouched by major changes to track baseline drift.
- Ghost experiments: simulate allocation without user exposure to measure supply-side impact safely.
Guardrail metrics
- Reliability: latency, error rate, crash rate.
- Business: cannibalization (channel mix), long-term retention.
- Ethics: user trust, complaint rate.
Statistical considerations with multiple experiments
- Multiple comparisons: control false discoveries across many tests using BenjaminiâHochberg (FDR) or HolmâBonferroni (FWER).
- Sequential looks: if you peek, use proper sequential methods; otherwise expect inflated Type I error.
- Power budget: splitting traffic across MEGs reduces power; compensate by running longer, using variance reduction (e.g., CUPED), or narrowing MDE.
- Reporting: declare primary metrics and analysis plan before launch to avoid p-hacking.
FDR vs FWER: quick guide
- Control FDR (e.g., BenjaminiâHochberg) when running many exploratory tests; allows more discoveries with some tolerated false positives.
- Control FWER (e.g., HolmâBonferroni) for few, high-stakes tests; stricter, fewer false positives.
Worked examples
Example 1: Checkout flow and free-shipping banner
Two tests: A) new checkout flow, site-wide; B) free-shipping banner on cart/checkout.
- Risk: same users and same surfaces; strong collision.
- Design: create 3 MEGs: MEG1 for A, MEG2 for B, MEG3 reserved/holdout.
- Traffic: If equally split, each experiment gets ~33% of users; with 50/50 variants, each variant gets ~16.7%.
Sample size implication
If each variant needs 200k users, total daily traffic is 2M users. Per variant: 2M Ă 16.7% â 334k/day. Runtime â 1 day for sample size, but extend for seasonality/stability.
Example 2: Notification send-time vs pricing test
Pricing changes overall demand; notifications change timing and frequency of sessions.
- Risk: cross-test effects on purchase propensity and session volume.
- Design options: run sequentially (pricing first), or cluster by country/time-of-day and stagger rollouts.
- Guardrails: revenue per user, notification opt-out rate, session count.
Recommended plan
Run pricing test first (short ramp with strict guardrails). Then notification test. If parallel is required, allocate MEGs by user and cap notification send volume equally across MEGs to avoid saturation differences.
Example 3: Ranking change and ads auction experiment
Search ranking affects which items and ads appear, altering the auction landscape.
- Risk: global baseline shift; ads experiment effects become confounded.
- Design: run ranking test in a dedicated MEG with a global holdout; use ghost ads in ranking test to measure supply changes without user exposure.
- Decision: prefer sequential rollout unless you can simulate and isolate ad allocation.
Analysis tip
Track drift in the global holdout. If holdout shows shifts during the ranking test, adjust your ad experiment timing or rebase metrics post-ranking rollout.
Practical checklist
- [ ] Define primary metric(s) for each test and guardrails shared across tests.
- [ ] Map surfaces and resources each test touches (pages, slots, systems).
- [ ] Choose bucketing unit (user/session/cluster) and confirm stability.
- [ ] Decide parallel vs sequential; if parallel, assign MEGs.
- [ ] Allocate traffic and estimate runtime under power constraints.
- [ ] Plan multiplicity control (FDR or FWER) and pre-register analysis.
- [ ] Set monitoring: daily sanity checks and alert thresholds for guardrails.
- [ ] Keep a small global holdout to detect baseline drift.
Exercises
Do Exercise 1 below. Then compare with the solution. Everyone can take the quick test after the lesson; only logged-in users have their progress saved.
Common mistakes and self-check
- Mistake: Assuming MEGs remove all interactions. Self-check: Do tests still share supply (ads, recs, caches)? If yes, risk remains.
- Mistake: Peeking daily without correction. Self-check: Are you using sequential methods or fixed horizon? If not, adjust.
- Mistake: Overlapping surfaces in parallel tests. Self-check: Draw a page-slot map and mark ownership per experiment.
- Mistake: Ignoring network spillovers. Self-check: Could treated users affect untreated ones (social, delivery times)? Consider cluster randomization.
- Mistake: No multiplicity plan. Self-check: Document FDR/FWER approach before launch.
Mini challenge
You have three proposed tests next month: (1) Homepage hero redesign, (2) Search ranking tweak, (3) Email subject line test. Create a one-paragraph plan stating which run in parallel, which are sequential, the bucketing unit, MEG allocation, and the multiplicity method. Keep it under 120 words.
Who this is for
- Data Scientists and Analysts designing or interpreting experiments.
- Product Managers planning roadmaps with concurrent tests.
- Engineers owning experimentation platforms.
Prerequisites
- Basic A/B testing: randomization, control vs treatment, sample size, MDE.
- Understanding of primary/secondary metrics and variance reduction basics.
Learning path
- Review A/B fundamentals and metric design.
- Learn bucketing and assignment stability.
- Study interference types: contamination, spillover, carryover.
- Practice MEGs, factorial designs, and cluster experiments.
- Apply multiplicity control (BH, Holm) and sequential monitoring.
- Build an operational playbook: registry, guardrails, ramp plans.
Practical projects
- Create an experiment registry template with fields for surfaces, MEG, guardrails, and multiplicity plan.
- Simulate two interacting tests in a spreadsheet: show bias from overlap and the improvement from MEGs.
- Design a 2Ă2 factorial for banner Ă pricing, including analysis plan for main effects and interaction.
Next steps
- Draft a parallel testing policy for your team (when allowed, how isolated, required guardrails).
- Set up a small always-on global holdout.
- Standardize FDR control for weekly releases.