Who this is for
Null vs alternative: H0: No change in 28-day retention. H1: Increase by ≥2%.
Decision rule: Ship if point estimate ≥2% and 95% CI excludes 0, guardrails stable.
Assumptions/risks: Email deliverability; seasonality; user annoyance.
Example 2 — Improve recommender CTR with ranker change
Hypothesis: Replacing the ranker with Model B will increase feed CTR by 1–2% over 14 days for signed-in mobile users due to better context modeling. Metric: CTR; guardrails: session length, crash rate; unit: session.
H0: No CTR change. H1: CTR increases by ≥1%.
Decision rule: Roll out if CTR improves ≥1% with statistical significance and guardrails unaffected.
Example 3 — Fraud model threshold
Hypothesis: Increasing the fraud score threshold by +0.1 will reduce chargeback rate by 8–12% with less than 0.5% drop in approval rate over 30 days, because more suspicious transactions are blocked. Primary metric: chargeback rate; constraint: approval rate drop < 0.5%.
H0: No chargeback reduction. H1: Reduction ≥8% with constraints met.
Why these are good
- They specify lever, unit, metric, effect size, time, mechanism, guardrails, and decision rule.
- They set expectations and reduce analysis ambiguity.
Quick checks before running
- Is the primary metric sensitive and aligned with the goal?
- Is the population clearly defined and large enough?
- Is the MDE realistic given power and duration?
- Are guardrail metrics defined to prevent harm?
- Is the analysis plan pre-committed (avoid metric shopping)?
Exercises
Complete these here, then open the matching exercise cards below for guidance and solutions.
Exercise 1 — Metric and decision rule
Scenario: You plan to add a toxicity filter to a community forum. Write a hypothesis with primary metric, guardrails, and a clear decision rule.
- Metric must reflect community health.
- Include a constraint to protect engagement.
- Set a minimal effect size and time horizon.
Exercise 2 — Mechanism-first hypothesis
Scenario: A demand forecasting model adds a weather feature. Formulate a mechanism-driven hypothesis and define the segment where it should matter most.
- Identify the unit and segment.
- State why weather should help and by how much.
- Choose evaluation window.
Common mistakes and self-checks
- Vague metrics: Self-check — Can a teammate compute it without asking you?
- No mechanism: Self-check — Can you explain why the effect should exist?
- Missing guardrails: Self-check — What harm could a “win” still cause?
- Overfitting the story to data: Self-check — Is the hypothesis written before peeking?
- Wrong unit of analysis: Self-check — Does the metric align with the unit (user/session/request)?
- Unrealistic effect sizes: Self-check — Compare against historical variance and prior wins.
Practical projects
- Write three hypotheses for the same goal using different levers (model change, UI tweak, policy). Compare their testability and risk.
- Retrospective rewrite: Take one past “win” and write the hypothesis you wish you had. Evaluate whether the decision would be the same.
- Guardrail audit: For your team’s top 3 KPIs, define guardrails and thresholds you would monitor in any experiment.
Learning path
- Start here: Hypothesis structure, metrics, guardrails.
- Next: Experiment design and power analysis.
- Then: Causal inference for observational data.
- Later: Metric design and counter-metrics for robust optimization.
Progress and test
The quick test at the end is available to everyone. Only logged-in users have their progress saved.
Mini challenge
Fill this template for a project you care about: If we do [lever] for [population/segment], [primary metric] will [increase/decrease] by [A% or amount] within [T], because [mechanism]. Decision rule: ship if [criteria], while [guardrail] stays within [bounds].