Who this is for
- AI Product Managers who make decisions about launching AI features and need confidence those changes really help users.
- PMs, data analysts, and UX leads collaborating on metrics, shipping experiments, and interpreting results.
Prerequisites
- Comfort with product metrics (conversion rate, retention, CTR).
- Basic statistics vocabulary (mean, variance, confidence interval) at a conceptual level.
- Ability to work with spreadsheets for simple calculations.
Why this matters
As an AI PM, you will often:
- Decide whether a new model or prompt improves user outcomes.
- Choose a primary metric (OEC) and guardrails for safety, quality, and performance.
- Plan experiment ramp, sample size, and runtime to avoid misleading results.
- Explain trade-offs and make launch calls with clarity and evidence.
Concept explained simply
An online experiment (often A/B test) compares a control (current experience) to a treatment (new experience) using randomized assignment. You pick a unit of randomization (usually user) and a primary metric that reflects value (your OEC). You add guardrail metrics to ensure you do no harm (e.g., crash rate, latency, abuse reports). Then you run until you have enough data to estimate the difference with acceptable uncertainty.
Mental model
Think of experiments as a 5-step loop:
- Hypothesis: Specific, directional, testable.
- Metrics: Primary (OEC), secondary, guardrails.
- Design: Unit, randomization, segments, sample size, runtime, ramps.
- Run: Monitor data quality, guardrails, novelty effects.
- Decide: Interpret results; ship, iterate, or stop.
Core design choices (quick reference)
1) Hypothesis
Write it as: "If we [change], then [user/product behavior] will [direction] because [mechanism]." Example: "If we switch to the new ranking model, session purchases per user will increase because improved relevance surfaces higher-converting items."
2) Experiment unit and exposure
Common units: user, session, request. Prefer user-level to avoid cross-contamination. Ensure each unit sees only one variant during the test.
3) Randomization
Split traffic equally (50/50) unless you need asymmetric ramps for risk mitigation. Consider stratified or blocked randomization if key segments are imbalanced (e.g., new vs returning users).
4) Metrics
Primary (OEC) aligns with business/user value; secondary explore mechanisms; guardrails protect reliability and safety (e.g., latency, error rate, abuse flags). Define exact formulas and event windows up front.
5) Variants
Keep differences minimal to isolate causal impact. If testing multiple changes, prefer a factorial or sequence of tests over a single bundle.
6) Sample size and runtime
Choose a minimum detectable effect (MDE) that is practically meaningful. Estimate sample size from baseline rate, variability, and MDE. Run for full cycles (at least 1–2 business cycles such as a week) even if you reach the sample size earlier.
7) Novelty and ramp
New experiences can cause short-term spikes or drops. Use gradual ramps (e.g., 5% → 25% → 50%) with monitoring. Watch for stabilization before making decisions.
8) Data quality and AA tests
Run an AA test (control vs control) to validate randomization and instrumentation when introducing new metrics or platforms.
9) Interference and spillover
Users can affect each other (e.g., messaging). Choose a unit that contains interference (e.g., team/workspace) or avoid experiments where interference is severe.
10) Ethics and risk
Define stop conditions for harm (e.g., error rate, abuse). For AI features, include misuse potential and bias checks in guardrails.
Worked examples
Example 1: Ranking model for recommendations
- Hypothesis: New model increases add-to-cart rate per session by 3%+ due to better relevance.
- Unit: User; 50/50 split.
- Primary metric: Add-to-cart rate per user. Secondary: CTR, revenue per user. Guardrails: latency p95, crash rate.
- Design: 1-week minimum to cover weekday/weekend patterns. Ramp 10% → 50% → 100% of experiment traffic.
- Decision: If add-to-cart improves with guardrails within limits, proceed to staged rollout; else iterate on features or training data.
Example 2: AI chat assistant temperature change
- Hypothesis: Lower temperature reduces hallucinations, increasing conversation resolution rate without hurting satisfaction.
- Unit: Conversation session.
- Primary metric: Resolution rate (issue marked solved). Secondary: CSAT, time to first useful reply. Guardrails: content safety flags, escalation rate to human.
- Design: Monitor content safety daily; predefine thresholds to pause if flags increase.
- Decision: Launch if resolution improves and safety is stable or better.
Example 3: Signup funnel with AI fraud screening
- Hypothesis: AI fraud screen reduces fake accounts while minimally impacting legit signup conversion.
- Unit: User.
- Primary metric: Verified legitimate signups per 1000 visitors (quality-adjusted conversion). Guardrails: false-positive rate on known-good traffic, page load time.
- Design: Stratify by traffic source (ads vs organic). Run 2 weeks to capture source variability.
- Decision: Ship if quality-adjusted conversion rises and false positives stay below threshold.
Design checklist
- Hypothesis is specific, directional, and mechanism-based.
- Unit of randomization avoids contamination.
- Primary, secondary, and guardrail metrics are precisely defined.
- Target MDE is practically meaningful.
- Sample size and runtime cover at least one full cycle.
- Ramp plan and stop conditions are documented.
- Interference, seasonality, and novelty considered.
- Data quality checks (including optional AA) planned.
- Ethics, safety, and abuse guardrails in place.
Exercises
Complete the exercises right on this page in the Exercises section below. Then take the quick test. Note: The quick test is available to everyone; log in to save your progress.
- Exercise 1: Draft a one-page experiment brief for AI reply suggestions.
- Exercise 2: Estimate sample size and runtime with a simple rule-of-thumb.
Common mistakes and self-check
- Mistake: Picking a vague primary metric. Fix: Write the exact formula and unit (e.g., purchases per user per week).
- Mistake: Stopping early on a good day. Fix: Pre-commit runtime; review weekly patterns.
- Mistake: Ignoring guardrails. Fix: Define thresholds that trigger pause.
- Mistake: Over-segmentation fishing for wins. Fix: Pre-register key segments; treat others as exploratory.
- Mistake: Cross-contamination between variants. Fix: Ensure users cannot switch variants mid-test.
Quick self-check
- Can you explain why your primary metric is the right proxy for value?
- Do you know your MDE and why it is practical?
- If results are null, what is your iteration plan?
Practical projects
- Design and document an experiment for switching your search ranking model. Include hypothesis, metrics, MDE, ramp, and stop conditions.
- Create a monitoring dashboard plan: daily guardrail checks and decision readiness criteria.
- Run a mock AA test using past data split into two groups to practice data quality validation.
Learning path
- Before this: Metrics fundamentals, event instrumentation basics.
- This subskill: Hypotheses, metrics, units, sample size/runtime, guardrails, and decision-making.
- Next: Advanced experimentation (variance reduction, sequential testing), causal inference for non-experimental data, and multi-armed bandits.
Next steps
- Use the checklist to review an experiment you or your team recently ran.
- Complete the exercises below and take the quick test to confirm understanding.
- Apply these basics to your next AI feature proposal.
Mini challenge
You have 20% of traffic and one week to test a new AI summarization. Your primary metric improves, but p95 latency is 20% worse and abuse flags are slightly up. In 5–7 sentences, outline your decision and immediate follow-ups.