Who this is for
Product Analysts, Product Managers, and Growth Analysts who run experiments and need to turn results into clear, confident product decisions.
Prerequisites
- Basic A/B testing concepts (control vs. variant, p-value or credible interval, MDE, power).
- Comfort reading experiment dashboards (conversion, revenue, guardrails).
- Familiarity with your product’s north-star and critical guardrail metrics.
Why this matters
In a real product role, decisions—not just numbers—drive outcomes. You’ll often need to recommend: ship, iterate, hold, or stop. You’ll justify tradeoffs (impact vs. risk), plan rollouts, and align with strategy. Getting this right saves time, avoids harmful launches, and accelerates wins.
Concept explained simply
Making decisions from A/B results means answering four questions:
- Is the test valid? (No sample bias, no tracking bugs, SRM passed.)
- Is the effect real? (Statistical significance or sufficient evidence.)
- Is it worth it? (Practical significance vs. cost, risk, and complexity.)
- How do we act? (Rollout plan, monitoring, follow-up experiments.)
Mental model
Use a 3-way decision tree: Ship, Iterate, or Stop.
- Ship: Valid test, clear benefit, guardrails okay, aligned with strategy.
- Iterate: Promising signal but inconclusive, or clear benefit with manageable risk that needs mitigation.
- Stop: Invalid, harmful, or misaligned. Learnings recorded; move on.
Decision checklist (open when deciding)
- Validity: SRM check, tracking sanity, stable traffic mix.
- Evidence: Effect size with interval; power/MDE met; pre-specified test type (two-sided, one-sided, non-inferiority).
- Practicality: Incremental revenue/users; engineering/ops cost; complexity.
- Risk: Guardrails (support, latency, cancellations, retention); error budgets.
- Strategy: Moves the metric that matters; doesn’t contradict roadmap goals.
- Rollout: Who, how fast, monitoring, fallback, follow-up experiment.
Step-by-step decision framework
- Verify validity: Check SRM, missing events, outliers, environment changes. If failed, stop and fix.
- Quantify impact: Report absolute and relative effects with intervals. Translate to weekly/monthly impact.
- Check guardrails: Look for harm in support contacts, latency, retention, refund rate, or other safety metrics.
- Assess practicality: Account for build/ops cost, maintenance, and complexity. Small wins that are cheap often beat big wins that are expensive.
- Choose decision:
- Ship: Evidence strong, guardrails pass.
- Iterate: Inconclusive or mixed with manageable risk—adjust design, increase power, or mitigate risks.
- Stop: Invalid or harmful relative to thresholds.
- Rollout plan: Staged rollout (10% → 50% → 100%), alerting, success thresholds, and rollback criteria.
Interpreting A/B results reliably
- Look at absolute (percentage points) and relative (%) changes.
- Use confidence/credible intervals to express uncertainty.
- Mind heterogeneity: pre-specified segments only; avoid post-hoc p-hacking.
- Consider novelty and seasonality: watch for temporary spikes/dips.
- Prefer pre-registered rules: stop rules, primary metric, and guardrails decided upfront.
Worked examples
Example 1: Clear win, low risk → Ship
Experiment: New pricing layout.
- Primary: Paid conversion +3.1% relative (CI: +1.2% to +5.0%), p=0.004.
- AOV: +0.2% (ns).
- Refund rate: +0.03 pp (ns).
- Guardrails: Support tickets +0.4% (ns); Latency unchanged.
Decision: Ship to 100% with 24–48h monitoring. Add a follow-up test to refine price copy.
Example 2: Benefit with risk → Iterate
Experiment: Fewer onboarding steps.
- Activation +5.5% (CI: +1.0% to +10.1%).
- 7-day retention −1.2% (CI: −2.2% to −0.2%), harmful.
Decision: Iterate. Test a variant that preserves the key removed step for high-risk users; consider staged rollout with retention monitoring.
Example 3: Inconclusive but promising → Extend or refine
Experiment: New recommendations widget.
- Revenue/session +1.0% (CI: −0.3% to +2.2%).
- Power below target (MDE 1.5%, observed 1.0%).
Decision: Extend to reach power or refine design to aim for larger effect. No ship yet.
Practical projects
- Write a 1-page decision memo for a past A/B test using the checklist.
- Build a simple decision dashboard: primary effect, guardrails, CI, and rollout status.
- Create a rollout playbook template: thresholds, alerting, and rollback steps.
Exercises
Note: Everyone can do the exercises and quick test for free; only logged-in users will have their progress saved.
Exercise 1 — Classify decision and rollout
You ran an experiment on checkout copy.
Traffic: 1,200,000 sessions (A=600k, B=600k), SRM passed Primary: Checkout conversion A=3.00%, B=3.24% (diff +0.24 pp, +8.0% rel), CI [+0.04, +0.44] pp, p=0.02 AOV: A=$52.10, B=$52.05 (ns) Refund rate: A=2.2%, B=2.3% (diff +0.1 pp), p=0.15 Support tickets/order: +0.5% (ns) Latency (p95): −20ms, p=0.01 (improved)
- Task: Choose Ship / Iterate / Stop and outline a 3-step rollout plan.
Exercise 2 — Non-inferiority decision memo
Goal: Replace SMS OTP with email magic link if not worse for login success by more than 0.3 pp (non-inferiority margin).
Login success: B −0.12 pp vs A, 95% CI [−0.28, +0.04] pp Security incidents: no change Cost: saves ~$40k/month User complaints: −6% (ns)
- Task: Write a 5-sentence decision memo: context, evidence, decision, rollout, monitoring.
See sample answers for Exercises
Exercise 1 — Sample answer
Decision: Ship. Evidence is significant, effect is practical, guardrails pass, latency improved.
Rollout plan: 1) 10% for 24h with alerting on conversion and refunds; 2) 50% for 48h; 3) 100% if metrics stable. Add a follow-up ticket to monitor refund trend weekly.
Exercise 2 — Sample memo
Context: We aimed to switch to magic link if it is not worse than SMS by more than 0.3 pp in login success. Evidence: Observed −0.12 pp with CI [−0.28, +0.04], which meets our non-inferiority criterion. Decision: Proceed to replace SMS with magic link. Rollout: 25% → 100% over one week, with on-call coverage during peak hours. Monitoring: Login success (lower bound −0.3 pp), security incidents (must be unchanged), cost savings, and user feedback volume.
Common mistakes and self-check
- Stopping early without a pre-specified rule. Self-check: Do we have a documented stop rule?
- Ignoring guardrails. Self-check: Are safety metrics within thresholds?
- Overweighting small, significant effects that are not practical. Self-check: Is the impact worth the cost/complexity?
- Chasing post-hoc segments. Self-check: Were segments pre-specified?
- Confusing non-inferiority with superiority. Self-check: Is the margin and hypothesis correct?
Mini challenge
Given an experiment with +1.4% revenue/session (CI: +0.1% to +2.7%), but +5% increase in support tickets (CI: +1% to +9%), propose a rollout strategy that maximizes upside while managing risk. Include: thresholds to pause, mitigation steps, and success criteria after 7 days.
Learning path
- Before this: Experiment design, metrics selection, powering/MDE.
- This lesson: Decision-making from results.
- Next: Rollout execution, monitoring, and post-launch validation.
Next steps
- Turn one historic experiment into a 1-page decision memo.
- Define your team’s default rollout tiers and guardrail thresholds.
- Prepare templates for non-inferiority and holdout validations.
Quick Test
Take the quick test to check your understanding. Available to everyone; only logged-in users get saved progress.