Why this matters
As a Data Analyst, you will be asked to design experiments that the team can trust. Good design lets stakeholders make decisions with confidence, avoid false wins, and detect real improvements quickly. You’ll translate product ideas into testable hypotheses, pick the right metrics, choose sample sizes, and spot risks that could invalidate results.
- Prioritize features: Decide if a new design should roll out.
- Quantify impact: Estimate how much a change improves conversion or revenue.
- Reduce risk: Catch negative side effects with guardrail metrics.
- Move fast with rigor: Avoid re-running tests due to preventable flaws.
Who this is for
- Data Analysts supporting product, growth, or marketing teams.
- PMs and Designers who collaborate on A/B tests.
- Engineers implementing experiments who want analytic clarity.
Prerequisites
- Basic statistics (mean, proportion, variance, confidence interval).
- Comfort with SQL/spreadsheets to compute metrics and segments.
- Understanding of your product’s events and user identifiers.
Concept explained simply
An experiment compares two worlds: control (current) and treatment (new). We randomly assign units (e.g., users) to each world so that the only systematic difference is the change we’re testing. After enough data, we measure if the treatment moved the metric in the desired direction beyond normal noise.
Key pieces:
- Hypothesis: A clear, directional statement (e.g., “New checkout reduces drop-off by 5%”).
- Unit of randomization: Who or what is assigned (user, session, store, email send).
- Primary metric: The single, decision-making metric tied to the goal.
- Guardrail metrics: Safety checks to catch harm (e.g., latency, refund rate).
- Minimum detectable effect (MDE): Smallest change worth detecting.
- Sample size & duration: How many units and how long to run to reach power.
- Exclusions & segmentation: Rules to keep the test fair and insights deeper.
Mental model
- Fair coin split: Randomization is a fair coin sending units to A or B, balancing known and unknown factors on average.
- Recipe card: Your design doc is a repeatable recipe—same inputs, same steps, same checks—so peers can reproduce it.
- One lever at a time: Change one meaningful thing per test to attribute outcomes confidently.
Core steps to design a clean experiment
- Define the decision. What will you do if the metric goes up/down/unchanged?
- Write the hypothesis. Be specific about direction and audience.
- Choose unit of randomization. Prefer user-level; switch to higher-level (e.g., store, geography) if there’s cross-user interference.
- Pick metrics. One primary; a few secondaries; guardrails for safety.
- Set MDE and power. What change matters to the business? Typical: 80% power, 5% alpha.
- Estimate sample size & duration. Use historical baselines and traffic to plan realistically.
- Plan exclusions & pre-checks. Bots, staff, test users; ensure balanced covariates at start.
- Freeze plan. Avoid mid-test changes; predefine stop criteria.
Tip: a quick sample size rule-of-thumb
For a binary metric with baseline rate p and desired absolute lift d (MDE), per-group sample size n ≈ 16 × p × (1 − p) / d^2 for ~80% power and 5% alpha. This is a rough guide—treat as directional.
Worked examples
Example 1 — Checkout button contrast
- Decision: Roll out if purchase conversion improves without raising refund rate.
- Hypothesis: A higher-contrast button increases purchase conversion by 3% (absolute).
- Unit: User (avoid session-level to prevent one user seeing both).
- Primary metric: Purchase conversion per user within 7 days.
- Guardrails: Refund rate, checkout latency, error rate.
- Baseline: 12% conversion; MDE: 3% absolute (0.12 → 0.15).
- Rough n per group: 16×0.12×0.88/0.03^2 ≈ 16×0.1056/0.0009 ≈ 1879.
- Duration: If you get ~800 new-unique users/day/group, ~3 days after ramp-up. Add buffer for day-of-week effects: target 1–2 weeks.
- Notes: Keep other UI changes off; log exposures and conversions consistently.
Example 2 — Email subject line
- Decision: Adopt subject B if open rate improves without hurting unsubscribe rate.
- Unit: Email-address (recipient-level). Do not randomize per send if multiple sends cause contamination.
- Primary metric: Open rate within 48 hours of send.
- Guardrails: Unsubscribe rate, spam complaint rate.
- Baseline: 25% open; MDE: 2% absolute.
- Rough n per group: 16×0.25×0.75/0.02^2 ≈ 16×0.1875/0.0004 ≈ 7500 recipients.
- Notes: Send both variants at the same time of day to avoid time biases.
Example 3 — Pricing page layout
- Risk: Users share links; some arrive via ads. Interference across users is possible.
- Unit: User if login-based; if many anonymous users share devices, consider cookie+device as unit and enforce sticky assignment.
- Primary metric: Paid conversion within 14 days; Secondary: Plan mix; Guardrails: Support tickets, refund rate.
- Baseline: 5% paid conversion; MDE: 1% absolute.
- Rough n per group: 16×0.05×0.95/0.01^2 ≈ 16×0.0475/0.0001 ≈ 7600.
- Notes: Freeze any ad targeting changes during the test window to avoid traffic shifts.
Choosing metrics that lead to decisions
- Primary metric: One metric that aligns to your goal and is sensitive enough (e.g., conversion, activation rate, revenue/user).
- Secondary metrics: Diagnostics to explain “why” (e.g., CTR, add-to-cart rate).
- Guardrails: Safety-first (e.g., latency, crash rate, unsubscribe, refunds).
- Time window: Define a fixed observation window (e.g., 7-day conversion) to avoid peeking bias.
- MDE: Choose the smallest effect worth shipping. Smaller MDE → larger sample.
Mini check: is your primary metric good?
- Would this metric alone justify the decision?
- Is it clearly defined and reproducible?
- Can it go the wrong way without immediate harm (guardrails cover that)?
Randomization and sample size basics
- Randomize at the unit that isolates interference. If treatment to one unit can affect another, randomize at a higher level (household/store/city).
- Balance check: After launch, compare baseline covariates across groups (traffic source mix, device, geo). They should be similar by chance.
- Runtime: Cover at least one full business cycle (usually 1–2 weeks) to average day-of-week patterns.
- Stopping: Use your pre-registered stop rule (time + sample) to avoid peeking and false positives.
Back-of-the-envelope sample size (binary metrics)
Per group: n ≈ 16 × p × (1 − p) / d^2, where p is baseline rate and d is absolute MDE. Rough; use a full calculator for production-critical calls.
Common mistakes and how to self-check
- Wrong unit of randomization → contamination. Self-check: Can a treated unit affect a control unit? If yes, move up a level.
- Changing anything mid-test (code, targeting, attribution) → bias. Self-check: Freeze plan; if something changes, document and consider re-start.
- Multiple concurrent changes → attribution unclear. Self-check: Limit to one meaningful change per test.
- Peeking and early stopping → inflated false positives. Self-check: Report only after reaching planned sample/time.
- Unstable metric definitions → non-reproducible results. Self-check: Version metrics; log calculation SQL.
- Seasonality and event spikes → misleading lifts. Self-check: Ensure test spans full cycle; avoid launches near major events.
- Unbalanced traffic sources → confounding. Self-check: Compare source/device/geo distribution at start and during test.
Self-check checklist you can copy
- Hypothesis written, directional, and tied to a decision.
- Unit of randomization chosen to avoid interference.
- Primary, secondary, and guardrail metrics clearly defined.
- MDE, power, alpha documented.
- Sample size and duration estimated from baseline.
- Exclusions listed (bots, staff, test devices).
- Stop rule and analysis plan frozen.
- Instrumentation validated before ramp-up.
Exercises
These mirror the exercises below. You can do them in a doc or spreadsheet. Use the rule-of-thumb sample size where needed.
Exercise 1 — Pick unit, metrics, and sample
Scenario: Your mobile app shows a nudge to enable push notifications on the home screen. You want to test a new nudge design.
- Choose the unit of randomization and justify it.
- Define a primary metric, 2 guardrails, and a 7-day observation window.
- Baseline enable rate is 20%. You care about a 2% absolute lift. Estimate per-group sample size and a rough duration if you get 2,000 unique users/day total.
Show a hint
- Interference: Could one user’s treatment affect another? Probably not.
- Decision metric: Notification enable within 7 days is directly tied to the goal.
- Rule-of-thumb: n ≈ 16×p×(1−p)/d^2.
Show solution
- Unit: User-level; ensure sticky assignment.
- Primary: Notification enable within 7 days.
- Guardrails: Uninstall rate within 7 days; app crash rate; optional: session length.
- Sample: p=0.20, d=0.02 → n ≈ 16×0.20×0.80/0.0004 ≈ 16×0.16/0.0004 ≈ 6400 per group.
- Duration: Need 12,800 total users. With ~2,000/day total (≈1,000/day/group), ≈13 days; include a weekly cycle → plan ~2–3 weeks.
Exercise 2 — Spot threats to validity
Scenario: A restaurant marketplace tests a new listing layout. Traffic is randomized at user-level. During the test: marketing paused a weekend promo, iOS app was updated only for treatment, and a new city launched mid-test.
- List at least 4 threats to validity and propose fixes or mitigations.
Show hints
- Think traffic mix, platform parity, geography, mid-test changes.
- Consider re-start vs. segmenting analysis.
Show solution
- Traffic mix shift: Paused weekend promo changes source mix → Fix: re-run or segment by source/day; ensure consistent marketing next time.
- Platform imbalance: iOS update only in treatment → Fix: align app versions across groups or exclude the affected window.
- Geo expansion: New city adds new-user heavy traffic → Fix: exclude the city during its ramp or start after launch; segment by city age.
- Mid-test changes: Violates frozen plan → Fix: pause and re-start with stable conditions; document deviations.
- Optional: Vendor bot traffic spike → Add bot filters and re-check balance.
Mini challenge
Write a one-page design for testing a new free-shipping threshold on an e-commerce site. Include: decision rule, hypothesis, unit, metrics (primary/guardrails), MDE, sample/duration, exclusions, and stop criteria.
See a sample structure
- Decision: Ship if conversion improves ≥ X% with no increase in refund or support tickets.
- Hypothesis: Raising threshold from $35 to $39 increases AOV by ≥ $1 without hurting conversion beyond 0.3%.
- Unit: User; sticky across sessions.
- Metrics: Primary AOV/user (7 days); Secondary conversion; Guardrails refund rate, NPS tickets.
- MDEs: +$1 AOV; −0.3% conversion.
- Sample: Use baseline AOV variance and conversion baseline to size; cover 2 full weeks.
- Exclusions: Staff, bots, extreme outliers > P99.9 order value (winsorize).
- Stop: 14 days or target sample reached; analyze as pre-registered.
Practical projects
- Build a reusable experiment design template (doc or notebook) with auto-filled baselines from your data warehouse.
- Create a “metric dictionary” for your primary, secondary, and guardrail metrics with precise definitions and SQL logic.
- Run a sandbox A/B test on a non-critical UI text change to practice end-to-end: design, instrumentation check, analysis, and write-up.
Learning path
- Next: Sample size calculators, power, and variance reduction (stratification, CUPED).
- Then: Sequential testing and peeking control; false discovery correction.
- Also: Experimentation at scale—holdouts, long-term tests, and novelty/backfire effects.
- Finally: Quasi-experiments when randomization isn’t possible.
Next steps
- Pick one upcoming feature and draft a one-page design using this structure.
- Review with PM and Engineer; align on unit, metrics, and stop rule before implementation.
- Set up a pre-launch instrumentation check (exposure and outcome events).
- Schedule the analysis review date now to prevent peeking.
Note on progress saving
The quick test on this page is available to everyone. Only logged-in users will have their progress saved automatically.