How to learn Experiment Planning for Research Problem Framing in Applied Scientist for free

Who this is for

This lesson is for Applied Scientists and ML practitioners who need to turn ideas into safe, measurable experiments. You will plan offline evaluations, online A/B tests, and hybrid rollouts.

Early-career Applied Scientists designing their first product experiments
Data Scientists moving from analysis to decision-focused experimentation
Engineers and PMs collaborating on ML model launches

Prerequisites

Basic statistics: probability, confidence intervals, p-values
Familiarity with classification/ranking metrics (e.g., precision, AUC, NDCG)
Comfort with Python/R or SQL for analysis

Why this matters

As an Applied Scientist, your models must improve real outcomes, not just offline metrics. Strong experiment planning helps you:

Turn vague ideas into testable hypotheses
Pick the right design: offline, online A/B, interleaving, switchback, or quasi-experiments
Define success metrics and guardrails to protect user experience and revenue
Estimate sample size and duration before writing code
Make confident decisions and avoid costly false launches

Concept explained simply

An experiment is a structured way to answer one question: Did the change cause an improvement? Planning ensures you know what to measure, how to measure it, and when to stop.

Mental model

Idea to Decision pipeline: Idea → Hypothesis → Variables → Design → Analysis Plan → Decision
Five Ws + H: Why (goal), What (treatment), Who (unit), Where (scope), When (duration), How (metrics and analysis)
Risk-first thinking: Define guardrails so you know when to stop or rollback

Key components of an experiment plan

Problem statement: What decision will this experiment inform?
Hypotheses: Null (no change) and Alternative (improvement). Example: H1 increases add-to-cart rate.
Design:
- Offline: holdout set, cross-validation, replay/shadow modes
- Online: A/B test, interleaving (ranking), switchback (time-based), cluster randomization (stores, cities)
Unit of randomization: user, session, device, store, city. Avoid interference across units.
Primary success metric and guardrails:
- Primary: the one metric used for the launch decision
- Guardrails: safety metrics (e.g., bounce rate, latency, support tickets)
- Define precisely: numerator, denominator, window, inclusion rules
MDE and sample size: Choose a minimal detectable effect that matters. Estimate per-group n and expected duration.
Assignment and ramp: 1% → 10% → 50% → 100% with SRM (sample ratio mismatch) checks.
Data collection plan: exposures, variants, timestamps, identifiers, metrics, logging health checks.
Analysis plan: statistical test, outlier policy, heterogeneity cuts, stopping rules, multiple testing policy.
Risk, ethics, and rollback criteria: define before launch.

Power/MDE quick guide

Two-sided alpha = 0.05, power = 0.8 is a common default
Rule-of-thumb for proportions: n per group ≈ 16 · p(1−p) / Δ², where p is baseline rate and Δ is absolute MDE
For continuous metrics, variance matters: n per group ≈ 2 · (Zα/2 + Zβ)² · σ² / Δ²

Worked examples

Example 1: Search ranking model

Hypothesis: New ranker increases session CTR.
Design: Offline evaluation with NDCG@10 and error analysis. Then online interleaving for faster sensitivity, followed by A/B test for business impact.
Unit: user
Primary metric: session CTR; Guardrails: latency p95, bounce rate
MDE: 1% relative CTR lift; Duration: compute via baseline CTR and traffic
Analysis: difference-in-means on user-level session CTR; heterogeneity by device and country

Example 2: Fraud model threshold change

Hypothesis: Higher threshold reduces false declines without raising fraud rate.
Design: Shadow mode for a week to collect decisions and outcomes; then A/B
Unit: transaction
Primary metric: false decline rate; Guardrails: chargeback rate, manual review load
MDE: absolute −0.2 pp in false declines
Analysis: cost-weighted impact = savings − losses; sequential monitoring discouraged unless pre-specified

Example 3: Onboarding recommendations

Hypothesis: Personalized onboarding increases 7-day retention.
Design: A/B test with holdback; consider switchback if strong day-of-week effects
Unit: new user
Primary metric: 7-day retention; Guardrails: time-to-first-value, support contacts per 1k users
MDE: 3% relative lift; Duration: multiple weeks to observe retention window
Analysis: stratify by acquisition channel; check SRM and event logging completeness

Exercises

Try these. Compare your work with the solutions, then use the checklist to self-review.

Exercise 1: Sample size for a conversion test

Baseline add-to-cart rate is 8%. You need to detect a 5% relative lift (absolute +0.4 pp). Two-sided alpha 0.05, power 0.8. Estimate the per-group sample size using the rule-of-thumb for proportions.

Hints

Absolute MDE Δ = 0.004
Use n ≈ 16 · p(1−p) / Δ²

Show solution

p = 0.08, Δ = 0.004. Compute p(1−p) = 0.0736. n ≈ 16 · 0.0736 / 0.000016 = 1.1776 / 0.000016 ≈ 73,600 users per group (≈147,200 total).

Exercise 2: Choose the right design

You are testing a pricing algorithm across 50 cities. Users travel between nearby cities, and weekdays vs weekends differ a lot. Propose an experiment design, unit of randomization, guardrails, and a reasonable duration.

Hints

Avoid spillovers across users who might cross city boundaries
Control for day-of-week effects

Show solution

Use cluster randomization at the city level with a switchback schedule (city-week or city-day). For example, half the cities start as treatment and half as control, then swap weekly. Guardrails: order cancellation rate, ETA accuracy, customer support contacts per 1k trips, and driver acceptance rate. Duration: 4–6 weeks to cover multiple weekly cycles and reduce variance.

Self-checklist

Did you state a clear hypothesis and the exact primary metric?
Is the unit of randomization chosen to minimize interference?
Are guardrails defined with formulas and thresholds?
Is MDE realistic given traffic and business value?
Is the analysis plan fixed before looking at results?
Do you have a ramp and rollback plan?

Common mistakes and how to self-check

Vague metrics: Fix by writing exact numerators/denominators and time windows.
Ignoring interference: Choose clusters or switchbacks when users interact.
No pre-specified stopping rule: Pre-register duration and decision thresholds.
Overly small MDE: Align with business value and traffic; run power analysis.
Multiple peeks without correction: Limit looks or use alpha spending pre-specified.
SRM overlooked: Monitor variant counts; if SRM is significant, pause and debug.
Logging gaps: Add health checks and a plan to exclude affected windows.

Practical projects

Offline to Online Pipeline
- Pick a public dataset or your historical logs
- Train a baseline and a new model; compare with cross-validation
- Write a one-page experiment plan: hypotheses, metrics, MDE, analysis
- Simulate an A/B by bootstrapping users to estimate duration
Guardrail Dashboard
- Define 3 guardrails relevant to your product
- Create daily trend charts with alert thresholds
- Document rollback triggers and who to page
Interleaving Sandbox (Ranking)
- Implement balanced interleaving for two rankers
- Simulate clicks and measure preference
- Compare sensitivity vs A/B on the same synthetic data

Learning path

Day 1–2: Principles of hypotheses, metrics, guardrails
Day 3–4: Power/MDE, duration estimates, SRM checks
Day 5–6: Designs (A/B, interleaving, switchback, clusters)
Day 7: Write a complete experiment plan and review with a peer

Mini challenge

You plan to test a new recommendation diversity boost.

Write a one-sentence hypothesis
Pick a primary metric and two guardrails
Choose a design and unit
Set a realistic MDE
State a rollout and rollback plan

Example answer

Hypothesis: Diversified ranking increases session GMV by 2% without hurting CTR or latency. Design: A/B at user level. Primary metric: session GMV/user. Guardrails: CTR (no worse than −1%), p95 latency (no worse than +20 ms). MDE: 1.5–2% relative GMV. Ramp: 1% → 10% → 50% → 100%, stop if guardrails breached.

Next steps

Draft a one-page plan for your next model change and review it with your PM/engineer
Set up automatic SRM and logging health checks
Run the quick test below to confirm understanding

Note on progress

The quick test is available to everyone. Sign in to save your progress and resume later.

Menu

Experiment Planning

Table of Contents