Why this matters
A/B testing is how AI Product Managers ship smarter AI features without hurting users or the business. With AI, small prompt or model changes can impact conversion, cost, latency, and safety. You’ll use A/B tests to:
- Decide between prompts or models (e.g., GPT-4 vs. distilled model)
- Validate guardrails (e.g., toxicity filter) without hurting task success
- Tune ranking or recommendations powered by embeddings
- Control rollout risk for cost and latency spikes
Real tasks you’ll do:
- Write a test plan with hypothesis, metrics, and sample size
- Choose the right unit of randomization (user/session/conversation)
- Monitor SRM (sample-ratio mismatch) and guardrail violations
- Make decisions with confidence intervals and practical significance
Note: The quick test is free for everyone. Sign in to save progress.
Concept explained simply
An A/B test randomly assigns users (or sessions) to Variant A (control) or Variant B (treatment). If the treatment improves the primary metric and doesn’t violate guardrails, you ship it. Randomization makes the groups comparable; you compare outcomes to estimate the treatment effect.
Mental model: Four-layer decision cake
- Decision: What will you do if B wins or fails?
- Metrics: One primary success metric; a few guardrails (safety, cost, latency, user satisfaction).
- Power: Enough traffic to detect the change you care about (MDE).
- Integrity: Clean randomization, no leaks, monitor SRM, avoid peeking mistakes.
Common AI metrics you can use
- Business: conversion rate, activation, retention, revenue per user
- User experience: task success rate, time-to-completion, CSAT
- Model quality: helpfulness score, hallucination rate, safety flag rate
- Cost & latency: cost per 1k queries, p95 latency
Key steps for A/B testing AI features
- Define the decision & hypothesis
Example: “If the new prompt reduces time-to-resolution by 10% without raising hallucination flags, roll to 100%.” - Pick metrics
Primary: one metric tied to your goal. Guardrails: 2–4 that must not degrade (e.g., p95 latency > 600ms is a fail). - Estimate sample size (MDE)
Rough rule for proportions (e.g., CTR): n per variant ≈ 16 × p × (1 − p) ÷ Δ², where p is baseline rate and Δ is absolute lift you care about.
For averages (e.g., minutes): n per variant ≈ 16 × σ² ÷ Δ², where σ is standard deviation and Δ is the absolute change you want to detect. - Choose randomization unit
Prefer user-level for persistent experiences; session or conversation-level if users can have multiple independent trials. Avoid cross-over contamination. - Run safely
Start small (e.g., 5–10%), monitor SRM and guardrails, then ramp. Use AA test or shadow mode first for high-risk features. - Analyze & decide
Compare A vs B, compute difference and 95% CI. If the CI excludes zero and meets practical lift and guardrails, ship. If inconclusive, consider longer run or bigger effect. - Roll out & watch
Canary the release and monitor post-launch metrics for novelty effects.
How to spot SRM (sample-ratio mismatch)
- Planned split: 50/50. Observed: 55/45 for many hours → suspicious.
- Causes: caching, bot filtering, region imbalance, late assignment.
- Fix: assign at user-id boundary, ensure assignment before expensive logic, exclude bots consistently.
Worked examples
Example 1: AI support assistant prompt change
Goal: Reduce time-to-resolution (TTR). Baseline mean 10 min, σ ≈ 6 min. Target Δ = −1 min (10%).
- Primary: mean TTR
- Guardrails: hallucination flag rate (≤ baseline), CSAT (no drop), p95 latency ≤ 2s, cost per ticket ≤ baseline
- Sample size (rough): n ≈ 16 × 6² ÷ 1² = 16 × 36 = 576 per variant
- Decision: If mean TTR drops ≥ 1 min and guardrails hold, roll out
Example 2: LLM-generated product descriptions
Goal: Improve CTR from 3.0% to 3.6% (Δ = 0.006).
- Primary: CTR
- Guardrails: conversion rate (no drop), p95 latency ≤ 600ms, cost per 1k tokens ≤ $X
- Sample size (rough): n ≈ 16 × 0.03 × 0.97 ÷ 0.006² ≈ 16 × 0.0291 ÷ 0.000036 ≈ 12,933 per variant
- Decision: If CTR increases and conversion holds, ramp to 50% then 100%
Example 3: RAG-based ranking tweaks
Goal: Increase purchase conversion with new retrieval settings.
- Primary: purchase conversion
- Guardrails: p95 latency ≤ 800ms, cost per query ≤ baseline + 5%, harmful content rate ≤ baseline
- Unit: user-level (avoid mixed experiences)
- Run: 10% → 25% → 50%, monitor SRM and guardrails each step
- Decision: Ship if conversion lift is significant and guardrails pass
Who this is for
- AI Product Managers and PMs adding AI features
- Data-minded Designers and Engineers partnering on experiments
- Founders validating AI product decisions
Prerequisites
- Basic product metrics (conversion, retention, CSAT)
- Intro stats intuition (averages, variability, confidence interval)
- Familiarity with your product’s event and ID structure
Learning path
- Define success metrics and guardrails for AI
- Design and run A/B tests (this subskill)
- Analyze results and make decisions
- Plan safe rollouts and post-launch monitoring
Practical steps to design your test
- State the decision rule in one sentence
- Choose 1 primary metric and 2–4 guardrails
- Pick the unit (user/session/conversation)
- Estimate MDE and rough n with the rules above
- Plan a ramp (e.g., 10% → 25% → 50%)
- Write stop conditions (SRM, safety spike, latency breach)
Copy-ready test plan template
- Hypothesis: [treatment] will [impact] by [Δ], given [context]
- Primary metric: [exact definition]
- Guardrails: [list and thresholds]
- Unit & attribution window: [e.g., user, 7 days]
- Traffic & duration: [%, days], target n ≈ [calc]
- Stop conditions: [SRM, safety, latency, cost]
- Decision rule: [ship/ramp/rollback criteria]
Common mistakes (and how to self-check)
- Peeking too often → Inflate false positives. Self-check: predefine looks or wait for sample size; use a consistent decision rule.
- Wrong unit of randomization → Leakage/carryover. Self-check: can a user touch both variants in one journey? If yes, move to user-level.
- Sample-ratio mismatch (SRM) → Biased results. Self-check: monitor split hourly; investigate caching/filters.
- Too many metrics → Confused decisions. Self-check: one primary, few guardrails.
- Underpowered test → Inconclusive. Self-check: MDE matches business value? If not, increase traffic or run longer.
- Ignoring cost/latency → Unsustainable wins. Self-check: include them as guardrails.
- Seasonality or novelty effects → Temporary spikes. Self-check: run long enough, monitor post-ship.
Exercises
Do these now. They match the graded exercises below.
Exercise 1: Draft a test plan for AI reply auto-draft
Context: An AI feature drafts replies for customer emails. Baseline TTR is 12 minutes (σ ≈ 7). You want at least a 1.5-minute reduction. Define:
- Hypothesis and decision rule
- Primary metric and 3 guardrails
- Unit of randomization
- Rough sample size per variant
Hint: n ≈ 16 × σ² ÷ Δ².
Exercise 2: SRM and unit pitfalls
Scenario: You planned 50/50 traffic but observe 57/43 after 24 hours. Also, returning users see different variants across sessions.
- List likely SRM causes and fixes
- Choose the right unit of randomization for a chat assistant and explain why
Pre-launch checklist
- Decision rule written in one sentence
- Primary metric and guardrails with thresholds
- Randomization unit chosen and justified
- Estimated n and planned duration
- SRM, safety, latency monitors set up
- Ramp plan and stop conditions documented
Mini challenge
You have two prompts: A is safer but slightly verbose; B is concise but sometimes incomplete. Design a 2-week test where safety violations must not increase and resolution rate must not drop. What’s your decision rule? Write it in a single sentence.
Practical projects
- Project 1: Compare two prompts for support replies. Deliver a one-page test plan, a ramp schedule, and a post-mortem template.
- Project 2: Test a cheaper model vs. premium. Include cost per 1k queries and p95 latency as guardrails. Decide based on total contribution margin.
- Project 3: Recommendation ranking tweak. Measure conversion and long-click depth, with a toxicity guardrail for generated snippets.
Next steps
- Apply this to one real feature this week
- Automate a basic SRM alert
- Standardize your test plan template across the team