Who this is for
This lesson is for Applied Scientists and ML practitioners who need to turn ideas into safe, measurable experiments. You will plan offline evaluations, online A/B tests, and hybrid rollouts.
- Early-career Applied Scientists designing their first product experiments
- Data Scientists moving from analysis to decision-focused experimentation
- Engineers and PMs collaborating on ML model launches
Prerequisites
- Basic statistics: probability, confidence intervals, p-values
- Familiarity with classification/ranking metrics (e.g., precision, AUC, NDCG)
- Comfort with Python/R or SQL for analysis
Why this matters
As an Applied Scientist, your models must improve real outcomes, not just offline metrics. Strong experiment planning helps you:
- Turn vague ideas into testable hypotheses
- Pick the right design: offline, online A/B, interleaving, switchback, or quasi-experiments
- Define success metrics and guardrails to protect user experience and revenue
- Estimate sample size and duration before writing code
- Make confident decisions and avoid costly false launches
Concept explained simply
An experiment is a structured way to answer one question: Did the change cause an improvement? Planning ensures you know what to measure, how to measure it, and when to stop.
Mental model
- Idea to Decision pipeline: Idea → Hypothesis → Variables → Design → Analysis Plan → Decision
- Five Ws + H: Why (goal), What (treatment), Who (unit), Where (scope), When (duration), How (metrics and analysis)
- Risk-first thinking: Define guardrails so you know when to stop or rollback
Key components of an experiment plan
- Problem statement: What decision will this experiment inform?
- Hypotheses: Null (no change) and Alternative (improvement). Example: H1 increases add-to-cart rate.
- Design:
- Offline: holdout set, cross-validation, replay/shadow modes
- Online: A/B test, interleaving (ranking), switchback (time-based), cluster randomization (stores, cities)
- Unit of randomization: user, session, device, store, city. Avoid interference across units.
- Primary success metric and guardrails:
- Primary: the one metric used for the launch decision
- Guardrails: safety metrics (e.g., bounce rate, latency, support tickets)
- Define precisely: numerator, denominator, window, inclusion rules
- MDE and sample size: Choose a minimal detectable effect that matters. Estimate per-group n and expected duration.
- Assignment and ramp: 1% → 10% → 50% → 100% with SRM (sample ratio mismatch) checks.
- Data collection plan: exposures, variants, timestamps, identifiers, metrics, logging health checks.
- Analysis plan: statistical test, outlier policy, heterogeneity cuts, stopping rules, multiple testing policy.
- Risk, ethics, and rollback criteria: define before launch.
Power/MDE quick guide
- Two-sided alpha = 0.05, power = 0.8 is a common default
- Rule-of-thumb for proportions: n per group ≈ 16 · p(1−p) / Δ², where p is baseline rate and Δ is absolute MDE
- For continuous metrics, variance matters: n per group ≈ 2 · (Zα/2 + Zβ)² · σ² / Δ²
Worked examples
Example 1: Search ranking model
- Hypothesis: New ranker increases session CTR.
- Design: Offline evaluation with NDCG@10 and error analysis. Then online interleaving for faster sensitivity, followed by A/B test for business impact.
- Unit: user
- Primary metric: session CTR; Guardrails: latency p95, bounce rate
- MDE: 1% relative CTR lift; Duration: compute via baseline CTR and traffic
- Analysis: difference-in-means on user-level session CTR; heterogeneity by device and country
Example 2: Fraud model threshold change
- Hypothesis: Higher threshold reduces false declines without raising fraud rate.
- Design: Shadow mode for a week to collect decisions and outcomes; then A/B
- Unit: transaction
- Primary metric: false decline rate; Guardrails: chargeback rate, manual review load
- MDE: absolute −0.2 pp in false declines
- Analysis: cost-weighted impact = savings − losses; sequential monitoring discouraged unless pre-specified
Example 3: Onboarding recommendations
- Hypothesis: Personalized onboarding increases 7-day retention.
- Design: A/B test with holdback; consider switchback if strong day-of-week effects
- Unit: new user
- Primary metric: 7-day retention; Guardrails: time-to-first-value, support contacts per 1k users
- MDE: 3% relative lift; Duration: multiple weeks to observe retention window
- Analysis: stratify by acquisition channel; check SRM and event logging completeness
Exercises
Try these. Compare your work with the solutions, then use the checklist to self-review.
Exercise 1: Sample size for a conversion test
Baseline add-to-cart rate is 8%. You need to detect a 5% relative lift (absolute +0.4 pp). Two-sided alpha 0.05, power 0.8. Estimate the per-group sample size using the rule-of-thumb for proportions.
Hints
- Absolute MDE Δ = 0.004
- Use n ≈ 16 · p(1−p) / Δ²
Show solution
p = 0.08, Δ = 0.004. Compute p(1−p) = 0.0736. n ≈ 16 · 0.0736 / 0.000016 = 1.1776 / 0.000016 ≈ 73,600 users per group (≈147,200 total).
Exercise 2: Choose the right design
You are testing a pricing algorithm across 50 cities. Users travel between nearby cities, and weekdays vs weekends differ a lot. Propose an experiment design, unit of randomization, guardrails, and a reasonable duration.
Hints
- Avoid spillovers across users who might cross city boundaries
- Control for day-of-week effects
Show solution
Use cluster randomization at the city level with a switchback schedule (city-week or city-day). For example, half the cities start as treatment and half as control, then swap weekly. Guardrails: order cancellation rate, ETA accuracy, customer support contacts per 1k trips, and driver acceptance rate. Duration: 4–6 weeks to cover multiple weekly cycles and reduce variance.
Self-checklist
- Did you state a clear hypothesis and the exact primary metric?
- Is the unit of randomization chosen to minimize interference?
- Are guardrails defined with formulas and thresholds?
- Is MDE realistic given traffic and business value?
- Is the analysis plan fixed before looking at results?
- Do you have a ramp and rollback plan?
Common mistakes and how to self-check
- Vague metrics: Fix by writing exact numerators/denominators and time windows.
- Ignoring interference: Choose clusters or switchbacks when users interact.
- No pre-specified stopping rule: Pre-register duration and decision thresholds.
- Overly small MDE: Align with business value and traffic; run power analysis.
- Multiple peeks without correction: Limit looks or use alpha spending pre-specified.
- SRM overlooked: Monitor variant counts; if SRM is significant, pause and debug.
- Logging gaps: Add health checks and a plan to exclude affected windows.
Practical projects
- Offline to Online Pipeline
- Pick a public dataset or your historical logs
- Train a baseline and a new model; compare with cross-validation
- Write a one-page experiment plan: hypotheses, metrics, MDE, analysis
- Simulate an A/B by bootstrapping users to estimate duration
- Guardrail Dashboard
- Define 3 guardrails relevant to your product
- Create daily trend charts with alert thresholds
- Document rollback triggers and who to page
- Interleaving Sandbox (Ranking)
- Implement balanced interleaving for two rankers
- Simulate clicks and measure preference
- Compare sensitivity vs A/B on the same synthetic data
Learning path
- Day 1–2: Principles of hypotheses, metrics, guardrails
- Day 3–4: Power/MDE, duration estimates, SRM checks
- Day 5–6: Designs (A/B, interleaving, switchback, clusters)
- Day 7: Write a complete experiment plan and review with a peer
Mini challenge
You plan to test a new recommendation diversity boost.
- Write a one-sentence hypothesis
- Pick a primary metric and two guardrails
- Choose a design and unit
- Set a realistic MDE
- State a rollout and rollback plan
Example answer
Hypothesis: Diversified ranking increases session GMV by 2% without hurting CTR or latency. Design: A/B at user level. Primary metric: session GMV/user. Guardrails: CTR (no worse than −1%), p95 latency (no worse than +20 ms). MDE: 1.5–2% relative GMV. Ramp: 1% → 10% → 50% → 100%, stop if guardrails breached.
Next steps
- Draft a one-page plan for your next model change and review it with your PM/engineer
- Set up automatic SRM and logging health checks
- Run the quick test below to confirm understanding
Note on progress
The quick test is available to everyone. Sign in to save your progress and resume later.