luvv to helpDiscover the Best Free Online Tools
Topic 5 of 7

Experiment Planning

Learn Experiment Planning for free with explanations, exercises, and a quick test (for Applied Scientist).

Published: January 7, 2026 | Updated: January 7, 2026

Who this is for

This lesson is for Applied Scientists and ML practitioners who need to turn ideas into safe, measurable experiments. You will plan offline evaluations, online A/B tests, and hybrid rollouts.

  • Early-career Applied Scientists designing their first product experiments
  • Data Scientists moving from analysis to decision-focused experimentation
  • Engineers and PMs collaborating on ML model launches

Prerequisites

  • Basic statistics: probability, confidence intervals, p-values
  • Familiarity with classification/ranking metrics (e.g., precision, AUC, NDCG)
  • Comfort with Python/R or SQL for analysis

Why this matters

As an Applied Scientist, your models must improve real outcomes, not just offline metrics. Strong experiment planning helps you:

  • Turn vague ideas into testable hypotheses
  • Pick the right design: offline, online A/B, interleaving, switchback, or quasi-experiments
  • Define success metrics and guardrails to protect user experience and revenue
  • Estimate sample size and duration before writing code
  • Make confident decisions and avoid costly false launches

Concept explained simply

An experiment is a structured way to answer one question: Did the change cause an improvement? Planning ensures you know what to measure, how to measure it, and when to stop.

Mental model

  • Idea to Decision pipeline: Idea → Hypothesis → Variables → Design → Analysis Plan → Decision
  • Five Ws + H: Why (goal), What (treatment), Who (unit), Where (scope), When (duration), How (metrics and analysis)
  • Risk-first thinking: Define guardrails so you know when to stop or rollback

Key components of an experiment plan

  1. Problem statement: What decision will this experiment inform?
  2. Hypotheses: Null (no change) and Alternative (improvement). Example: H1 increases add-to-cart rate.
  3. Design:
    • Offline: holdout set, cross-validation, replay/shadow modes
    • Online: A/B test, interleaving (ranking), switchback (time-based), cluster randomization (stores, cities)
  4. Unit of randomization: user, session, device, store, city. Avoid interference across units.
  5. Primary success metric and guardrails:
    • Primary: the one metric used for the launch decision
    • Guardrails: safety metrics (e.g., bounce rate, latency, support tickets)
    • Define precisely: numerator, denominator, window, inclusion rules
  6. MDE and sample size: Choose a minimal detectable effect that matters. Estimate per-group n and expected duration.
  7. Assignment and ramp: 1% → 10% → 50% → 100% with SRM (sample ratio mismatch) checks.
  8. Data collection plan: exposures, variants, timestamps, identifiers, metrics, logging health checks.
  9. Analysis plan: statistical test, outlier policy, heterogeneity cuts, stopping rules, multiple testing policy.
  10. Risk, ethics, and rollback criteria: define before launch.
Power/MDE quick guide
  • Two-sided alpha = 0.05, power = 0.8 is a common default
  • Rule-of-thumb for proportions: n per group ≈ 16 · p(1−p) / Δ², where p is baseline rate and Δ is absolute MDE
  • For continuous metrics, variance matters: n per group ≈ 2 · (Zα/2 + Zβ)² · σ² / Δ²

Worked examples

Example 1: Search ranking model

  • Hypothesis: New ranker increases session CTR.
  • Design: Offline evaluation with NDCG@10 and error analysis. Then online interleaving for faster sensitivity, followed by A/B test for business impact.
  • Unit: user
  • Primary metric: session CTR; Guardrails: latency p95, bounce rate
  • MDE: 1% relative CTR lift; Duration: compute via baseline CTR and traffic
  • Analysis: difference-in-means on user-level session CTR; heterogeneity by device and country

Example 2: Fraud model threshold change

  • Hypothesis: Higher threshold reduces false declines without raising fraud rate.
  • Design: Shadow mode for a week to collect decisions and outcomes; then A/B
  • Unit: transaction
  • Primary metric: false decline rate; Guardrails: chargeback rate, manual review load
  • MDE: absolute −0.2 pp in false declines
  • Analysis: cost-weighted impact = savings − losses; sequential monitoring discouraged unless pre-specified

Example 3: Onboarding recommendations

  • Hypothesis: Personalized onboarding increases 7-day retention.
  • Design: A/B test with holdback; consider switchback if strong day-of-week effects
  • Unit: new user
  • Primary metric: 7-day retention; Guardrails: time-to-first-value, support contacts per 1k users
  • MDE: 3% relative lift; Duration: multiple weeks to observe retention window
  • Analysis: stratify by acquisition channel; check SRM and event logging completeness

Exercises

Try these. Compare your work with the solutions, then use the checklist to self-review.

Exercise 1: Sample size for a conversion test

Baseline add-to-cart rate is 8%. You need to detect a 5% relative lift (absolute +0.4 pp). Two-sided alpha 0.05, power 0.8. Estimate the per-group sample size using the rule-of-thumb for proportions.

Hints
  • Absolute MDE Δ = 0.004
  • Use n ≈ 16 · p(1−p) / Δ²
Show solution

p = 0.08, Δ = 0.004. Compute p(1−p) = 0.0736. n ≈ 16 · 0.0736 / 0.000016 = 1.1776 / 0.000016 ≈ 73,600 users per group (≈147,200 total).

Exercise 2: Choose the right design

You are testing a pricing algorithm across 50 cities. Users travel between nearby cities, and weekdays vs weekends differ a lot. Propose an experiment design, unit of randomization, guardrails, and a reasonable duration.

Hints
  • Avoid spillovers across users who might cross city boundaries
  • Control for day-of-week effects
Show solution

Use cluster randomization at the city level with a switchback schedule (city-week or city-day). For example, half the cities start as treatment and half as control, then swap weekly. Guardrails: order cancellation rate, ETA accuracy, customer support contacts per 1k trips, and driver acceptance rate. Duration: 4–6 weeks to cover multiple weekly cycles and reduce variance.

Self-checklist

  • Did you state a clear hypothesis and the exact primary metric?
  • Is the unit of randomization chosen to minimize interference?
  • Are guardrails defined with formulas and thresholds?
  • Is MDE realistic given traffic and business value?
  • Is the analysis plan fixed before looking at results?
  • Do you have a ramp and rollback plan?

Common mistakes and how to self-check

  • Vague metrics: Fix by writing exact numerators/denominators and time windows.
  • Ignoring interference: Choose clusters or switchbacks when users interact.
  • No pre-specified stopping rule: Pre-register duration and decision thresholds.
  • Overly small MDE: Align with business value and traffic; run power analysis.
  • Multiple peeks without correction: Limit looks or use alpha spending pre-specified.
  • SRM overlooked: Monitor variant counts; if SRM is significant, pause and debug.
  • Logging gaps: Add health checks and a plan to exclude affected windows.

Practical projects

  1. Offline to Online Pipeline
    • Pick a public dataset or your historical logs
    • Train a baseline and a new model; compare with cross-validation
    • Write a one-page experiment plan: hypotheses, metrics, MDE, analysis
    • Simulate an A/B by bootstrapping users to estimate duration
  2. Guardrail Dashboard
    • Define 3 guardrails relevant to your product
    • Create daily trend charts with alert thresholds
    • Document rollback triggers and who to page
  3. Interleaving Sandbox (Ranking)
    • Implement balanced interleaving for two rankers
    • Simulate clicks and measure preference
    • Compare sensitivity vs A/B on the same synthetic data

Learning path

  • Day 1–2: Principles of hypotheses, metrics, guardrails
  • Day 3–4: Power/MDE, duration estimates, SRM checks
  • Day 5–6: Designs (A/B, interleaving, switchback, clusters)
  • Day 7: Write a complete experiment plan and review with a peer

Mini challenge

You plan to test a new recommendation diversity boost.

  • Write a one-sentence hypothesis
  • Pick a primary metric and two guardrails
  • Choose a design and unit
  • Set a realistic MDE
  • State a rollout and rollback plan
Example answer

Hypothesis: Diversified ranking increases session GMV by 2% without hurting CTR or latency. Design: A/B at user level. Primary metric: session GMV/user. Guardrails: CTR (no worse than −1%), p95 latency (no worse than +20 ms). MDE: 1.5–2% relative GMV. Ramp: 1% → 10% → 50% → 100%, stop if guardrails breached.

Next steps

  • Draft a one-page plan for your next model change and review it with your PM/engineer
  • Set up automatic SRM and logging health checks
  • Run the quick test below to confirm understanding
Note on progress

The quick test is available to everyone. Sign in to save your progress and resume later.

Practice Exercises

2 exercises to complete

Instructions

Baseline add-to-cart rate: 8%. Detect a 5% relative lift (absolute +0.4 pp). Two-sided alpha 0.05, power 0.8. Using the rule-of-thumb for proportions, estimate per-group sample size.

  • State the absolute MDE
  • Apply n ≈ 16 · p(1−p) / Δ²
Expected Output
Approximately 73,600 users per group (≈147,200 total).

Experiment Planning — Quick Test

Test your knowledge with 8 questions. Pass with 70% or higher.

8 questions70% to pass

Have questions about Experiment Planning?

AI Assistant

Ask questions about this tool