How to learn Online Experiment Design Basics for Evaluation And Experimentation in AI Product Manager for free

Who this is for

AI Product Managers who make decisions about launching AI features and need confidence those changes really help users.
PMs, data analysts, and UX leads collaborating on metrics, shipping experiments, and interpreting results.

Prerequisites

Comfort with product metrics (conversion rate, retention, CTR).
Basic statistics vocabulary (mean, variance, confidence interval) at a conceptual level.
Ability to work with spreadsheets for simple calculations.

Why this matters

As an AI PM, you will often:

Decide whether a new model or prompt improves user outcomes.
Choose a primary metric (OEC) and guardrails for safety, quality, and performance.
Plan experiment ramp, sample size, and runtime to avoid misleading results.
Explain trade-offs and make launch calls with clarity and evidence.

Concept explained simply

An online experiment (often A/B test) compares a control (current experience) to a treatment (new experience) using randomized assignment. You pick a unit of randomization (usually user) and a primary metric that reflects value (your OEC). You add guardrail metrics to ensure you do no harm (e.g., crash rate, latency, abuse reports). Then you run until you have enough data to estimate the difference with acceptable uncertainty.

Mental model

Think of experiments as a 5-step loop:

Hypothesis: Specific, directional, testable.
Metrics: Primary (OEC), secondary, guardrails.
Design: Unit, randomization, segments, sample size, runtime, ramps.
Run: Monitor data quality, guardrails, novelty effects.
Decide: Interpret results; ship, iterate, or stop.

Core design choices (quick reference)

1) Hypothesis

Write it as: "If we [change], then [user/product behavior] will [direction] because [mechanism]." Example: "If we switch to the new ranking model, session purchases per user will increase because improved relevance surfaces higher-converting items."

2) Experiment unit and exposure

Common units: user, session, request. Prefer user-level to avoid cross-contamination. Ensure each unit sees only one variant during the test.

3) Randomization

Split traffic equally (50/50) unless you need asymmetric ramps for risk mitigation. Consider stratified or blocked randomization if key segments are imbalanced (e.g., new vs returning users).

4) Metrics

Primary (OEC) aligns with business/user value; secondary explore mechanisms; guardrails protect reliability and safety (e.g., latency, error rate, abuse flags). Define exact formulas and event windows up front.

5) Variants

Keep differences minimal to isolate causal impact. If testing multiple changes, prefer a factorial or sequence of tests over a single bundle.

6) Sample size and runtime

Choose a minimum detectable effect (MDE) that is practically meaningful. Estimate sample size from baseline rate, variability, and MDE. Run for full cycles (at least 1–2 business cycles such as a week) even if you reach the sample size earlier.

7) Novelty and ramp

New experiences can cause short-term spikes or drops. Use gradual ramps (e.g., 5% → 25% → 50%) with monitoring. Watch for stabilization before making decisions.

8) Data quality and AA tests

Run an AA test (control vs control) to validate randomization and instrumentation when introducing new metrics or platforms.

9) Interference and spillover

Users can affect each other (e.g., messaging). Choose a unit that contains interference (e.g., team/workspace) or avoid experiments where interference is severe.

10) Ethics and risk

Define stop conditions for harm (e.g., error rate, abuse). For AI features, include misuse potential and bias checks in guardrails.

Worked examples

Example 1: Ranking model for recommendations

Hypothesis: New model increases add-to-cart rate per session by 3%+ due to better relevance.
Unit: User; 50/50 split.
Primary metric: Add-to-cart rate per user. Secondary: CTR, revenue per user. Guardrails: latency p95, crash rate.
Design: 1-week minimum to cover weekday/weekend patterns. Ramp 10% → 50% → 100% of experiment traffic.
Decision: If add-to-cart improves with guardrails within limits, proceed to staged rollout; else iterate on features or training data.

Example 2: AI chat assistant temperature change

Hypothesis: Lower temperature reduces hallucinations, increasing conversation resolution rate without hurting satisfaction.
Unit: Conversation session.
Primary metric: Resolution rate (issue marked solved). Secondary: CSAT, time to first useful reply. Guardrails: content safety flags, escalation rate to human.
Design: Monitor content safety daily; predefine thresholds to pause if flags increase.
Decision: Launch if resolution improves and safety is stable or better.

Hypothesis: AI fraud screen reduces fake accounts while minimally impacting legit signup conversion.
Unit: User.
Primary metric: Verified legitimate signups per 1000 visitors (quality-adjusted conversion). Guardrails: false-positive rate on known-good traffic, page load time.
Design: Stratify by traffic source (ads vs organic). Run 2 weeks to capture source variability.
Decision: Ship if quality-adjusted conversion rises and false positives stay below threshold.

Design checklist

Hypothesis is specific, directional, and mechanism-based.
Unit of randomization avoids contamination.
Primary, secondary, and guardrail metrics are precisely defined.
Target MDE is practically meaningful.
Sample size and runtime cover at least one full cycle.
Ramp plan and stop conditions are documented.
Interference, seasonality, and novelty considered.
Data quality checks (including optional AA) planned.
Ethics, safety, and abuse guardrails in place.

Exercises

Complete the exercises right on this page in the Exercises section below. Then take the quick test. Note: The quick test is available to everyone; log in to save your progress.

Exercise 1: Draft a one-page experiment brief for AI reply suggestions.
Exercise 2: Estimate sample size and runtime with a simple rule-of-thumb.

Common mistakes and self-check

Mistake: Picking a vague primary metric. Fix: Write the exact formula and unit (e.g., purchases per user per week).
Mistake: Stopping early on a good day. Fix: Pre-commit runtime; review weekly patterns.
Mistake: Ignoring guardrails. Fix: Define thresholds that trigger pause.
Mistake: Over-segmentation fishing for wins. Fix: Pre-register key segments; treat others as exploratory.
Mistake: Cross-contamination between variants. Fix: Ensure users cannot switch variants mid-test.

Quick self-check

Can you explain why your primary metric is the right proxy for value?
Do you know your MDE and why it is practical?
If results are null, what is your iteration plan?

Practical projects

Design and document an experiment for switching your search ranking model. Include hypothesis, metrics, MDE, ramp, and stop conditions.
Create a monitoring dashboard plan: daily guardrail checks and decision readiness criteria.
Run a mock AA test using past data split into two groups to practice data quality validation.

Learning path

Before this: Metrics fundamentals, event instrumentation basics.
This subskill: Hypotheses, metrics, units, sample size/runtime, guardrails, and decision-making.
Next: Advanced experimentation (variance reduction, sequential testing), causal inference for non-experimental data, and multi-armed bandits.

Next steps

Use the checklist to review an experiment you or your team recently ran.
Complete the exercises below and take the quick test to confirm understanding.
Apply these basics to your next AI feature proposal.

Mini challenge

You have 20% of traffic and one week to test a new AI summarization. Your primary metric improves, but p95 latency is 20% worse and abuse flags are slightly up. In 5–7 sentences, outline your decision and immediate follow-ups.

Menu

Online Experiment Design Basics

Table of Contents

Who this is for

Prerequisites

Why this matters

Concept explained simply

Mental model

Core design choices (quick reference)

Worked examples

Example 1: Ranking model for recommendations

Example 2: AI chat assistant temperature change

Design checklist

Exercises

Common mistakes and self-check

Practical projects

Learning path

Next steps

Mini challenge

Practice Exercises

Draft an experiment brief for AI reply suggestions

Instructions

Expected Output

Estimate sample size and runtime (rule-of-thumb)

Online Experiment Design Basics — Quick Test

Have questions about Online Experiment Design Basics?

AI Assistant

Menu

Online Experiment Design Basics

Table of Contents

Who this is for

Prerequisites

Why this matters

Concept explained simply

Mental model

Core design choices (quick reference)

Worked examples

Example 1: Ranking model for recommendations

Example 2: AI chat assistant temperature change

Example 3: Signup funnel with AI fraud screening

Design checklist

Exercises

Common mistakes and self-check

Practical projects

Learning path

Next steps

Mini challenge

Practice Exercises

Draft an experiment brief for AI reply suggestions

Instructions

Expected Output

Estimate sample size and runtime (rule-of-thumb)

Online Experiment Design Basics — Quick Test

Have questions about Online Experiment Design Basics?

AI Assistant