How to learn Multiple Experiments And Interaction Risks for Experiment Design in Data Scientist for free

Why this matters

Data Scientists often run several experiments at the same time. If these collide, your metrics can be biased or diluted, leading to wrong decisions and wasted traffic.

Prioritization: choose which tests can safely run in parallel.
Design: pick traffic splits and bucketing that reduce collisions.
Inference: adjust for multiple comparisons to avoid false wins.
Operations: set guardrails to catch harmful interactions quickly.

Concept explained simply

When two or more experiments run together, they can influence each other. This is called interaction or interference. Example: a pricing test and a notification test both try to change the same user behavior (purchase). The effect you measure for one test may be inflated, reduced, or even reversed because of the other.

Mental model

Think of your product as a marketplace of limited attention and supply. Experiments compete for the same users, impressions, sessions, or supply (e.g., recommendations, ad slots). Interactions occur when experiments share any of these resources.

Common interaction channels

Sample overlap: the same user/session can be in multiple tests.
Surface overlap: tests change the same page, slot, or component.
Saturation/spillover: one user’s treatment affects others (network effects).
Carryover: earlier exposures affect later behavior (e.g., novelty, learning).
Shared systems: ranking, recommendations, auctions, caches.

Key risks and when they appear

Contamination: A user sees both treatments; effects mix.
Collision: Two experiments try to fill the same UI slot.
Suppression/cannibalization: One experiment steals conversions from another.
Metric drift: A shared system (like search ranking) shifts baselines globally.
Multiplicity: Many tests inflate Type I error (false positives).

Planning: parallel vs sequential

Run sequentially when: high interaction risk, safety-critical changes, or global systems (ranking/auction) are involved.
Run in parallel when: disjoint audiences or surfaces, or when using strong isolation (MEGs) and guardrails.

Quick decision flow

Do tests share users or surfaces? If yes, risk is medium-high.
Does one test affect allocation/supply for the other (ranking, inventory)? If yes, run sequentially or use strong isolation.
Is interaction itself of interest? If yes, prefer a factorial design.

Designing for minimal interaction

Mutually Exclusive Groups (MEGs)

Partition users into non-overlapping buckets. Each experiment only uses its assigned MEG. This greatly reduces interference at the user level.

Trade-off: Each experiment gets less traffic, increasing runtime.
Tip: Keep a small global holdout/unused MEG for emergencies or future priority tests.

Choose the right bucketing unit

User-level: stable across sessions; best for most product experiments.
Session-level: avoid carryover but can fragment user experience.
Cluster-level: for network effects (e.g., geo, org, school, household).

Factorial (e.g., 2×2) vs independent tests

Use factorial when you want to estimate interaction explicitly (A, B, and A×B).
Use independent MEGs when you want to avoid interactions and keep analyses simpler.

Geo/cluster experiments

When spillovers are likely (social, delivery time, marketplace), randomize clusters (cities, orgs). Analyze with cluster-aware methods.

Holdouts and ghost/virtual slots

Global holdout: a small portion of traffic untouched by major changes to track baseline drift.
Ghost experiments: simulate allocation without user exposure to measure supply-side impact safely.

Guardrail metrics

Reliability: latency, error rate, crash rate.
Business: cannibalization (channel mix), long-term retention.
Ethics: user trust, complaint rate.

Statistical considerations with multiple experiments

Multiple comparisons: control false discoveries across many tests using Benjamini–Hochberg (FDR) or Holm–Bonferroni (FWER).
Sequential looks: if you peek, use proper sequential methods; otherwise expect inflated Type I error.
Power budget: splitting traffic across MEGs reduces power; compensate by running longer, using variance reduction (e.g., CUPED), or narrowing MDE.
Reporting: declare primary metrics and analysis plan before launch to avoid p-hacking.

FDR vs FWER: quick guide

Control FDR (e.g., Benjamini–Hochberg) when running many exploratory tests; allows more discoveries with some tolerated false positives.
Control FWER (e.g., Holm–Bonferroni) for few, high-stakes tests; stricter, fewer false positives.

Worked examples

Example 1: Checkout flow and free-shipping banner

Two tests: A) new checkout flow, site-wide; B) free-shipping banner on cart/checkout.

Risk: same users and same surfaces; strong collision.
Design: create 3 MEGs: MEG1 for A, MEG2 for B, MEG3 reserved/holdout.
Traffic: If equally split, each experiment gets ~33% of users; with 50/50 variants, each variant gets ~16.7%.

Sample size implication

If each variant needs 200k users, total daily traffic is 2M users. Per variant: 2M × 16.7% ≈ 334k/day. Runtime ≈ 1 day for sample size, but extend for seasonality/stability.

Example 2: Notification send-time vs pricing test

Pricing changes overall demand; notifications change timing and frequency of sessions.

Risk: cross-test effects on purchase propensity and session volume.
Design options: run sequentially (pricing first), or cluster by country/time-of-day and stagger rollouts.
Guardrails: revenue per user, notification opt-out rate, session count.

Recommended plan

Run pricing test first (short ramp with strict guardrails). Then notification test. If parallel is required, allocate MEGs by user and cap notification send volume equally across MEGs to avoid saturation differences.

Example 3: Ranking change and ads auction experiment

Search ranking affects which items and ads appear, altering the auction landscape.

Risk: global baseline shift; ads experiment effects become confounded.
Design: run ranking test in a dedicated MEG with a global holdout; use ghost ads in ranking test to measure supply changes without user exposure.
Decision: prefer sequential rollout unless you can simulate and isolate ad allocation.

Analysis tip

Track drift in the global holdout. If holdout shows shifts during the ranking test, adjust your ad experiment timing or rebase metrics post-ranking rollout.

Practical checklist

[ ] Define primary metric(s) for each test and guardrails shared across tests.
[ ] Map surfaces and resources each test touches (pages, slots, systems).
[ ] Choose bucketing unit (user/session/cluster) and confirm stability.
[ ] Decide parallel vs sequential; if parallel, assign MEGs.
[ ] Allocate traffic and estimate runtime under power constraints.
[ ] Plan multiplicity control (FDR or FWER) and pre-register analysis.
[ ] Set monitoring: daily sanity checks and alert thresholds for guardrails.
[ ] Keep a small global holdout to detect baseline drift.

Exercises

Do Exercise 1 below. Then compare with the solution. Everyone can take the quick test after the lesson; only logged-in users have their progress saved.

Common mistakes and self-check

Mistake: Assuming MEGs remove all interactions. Self-check: Do tests still share supply (ads, recs, caches)? If yes, risk remains.
Mistake: Peeking daily without correction. Self-check: Are you using sequential methods or fixed horizon? If not, adjust.
Mistake: Overlapping surfaces in parallel tests. Self-check: Draw a page-slot map and mark ownership per experiment.
Mistake: Ignoring network spillovers. Self-check: Could treated users affect untreated ones (social, delivery times)? Consider cluster randomization.
Mistake: No multiplicity plan. Self-check: Document FDR/FWER approach before launch.

Mini challenge

You have three proposed tests next month: (1) Homepage hero redesign, (2) Search ranking tweak, (3) Email subject line test. Create a one-paragraph plan stating which run in parallel, which are sequential, the bucketing unit, MEG allocation, and the multiplicity method. Keep it under 120 words.

Who this is for

Data Scientists and Analysts designing or interpreting experiments.
Product Managers planning roadmaps with concurrent tests.
Engineers owning experimentation platforms.

Prerequisites

Basic A/B testing: randomization, control vs treatment, sample size, MDE.
Understanding of primary/secondary metrics and variance reduction basics.

Learning path

Review A/B fundamentals and metric design.
Learn bucketing and assignment stability.
Study interference types: contamination, spillover, carryover.
Practice MEGs, factorial designs, and cluster experiments.
Apply multiplicity control (BH, Holm) and sequential monitoring.
Build an operational playbook: registry, guardrails, ramp plans.

Practical projects

Create an experiment registry template with fields for surfaces, MEG, guardrails, and multiplicity plan.
Simulate two interacting tests in a spreadsheet: show bias from overlap and the improvement from MEGs.
Design a 2×2 factorial for banner × pricing, including analysis plan for main effects and interaction.

Next steps

Draft a parallel testing policy for your team (when allowed, how isolated, required guardrails).
Set up a small always-on global holdout.
Standardize FDR control for weekly releases.

Instructions

Scenario: You have daily 1,000,000 user sessions. Three proposed tests:

E1: Search ranking tweak (site-wide, user-level bucketing, 50/50, needs 150k users per variant).
E2: Free-shipping banner on homepage (surface-specific, 50/50, needs 120k per variant).
E3: New checkout flow (checkout pages, 50/50, needs 150k per variant).

Constraints and notes:

Ranking (E1) likely shifts traffic and product impressions globally.
Homepage banner (E2) and ranking (E1) both affect traffic entering checkout.
Checkout flow (E3) is safety-critical; guardrails include latency and crash rate.

Tasks:

Decide which tests run in parallel and which run sequentially. Justify.
Choose bucketing unit(s) and whether to use MEGs or factorial design.
Propose traffic allocation and estimate days to complete each test under your plan.
List guardrail metrics and a multiplicity approach.

Menu

Multiple Experiments And Interaction Risks

Table of Contents