Why this matters
As a Data Analyst, you will regularly evaluate experiments with many metrics, segments, and repeated looks over time. Each extra comparison raises the chance of a false win. Without multiple testing awareness, teams may launch changes that actually hurt users because a random fluctuation looked significant.
- Real task: Decide if a test is a win when you tracked 8 KPIs and sliced by device and country.
- Real task: Answer stakeholder questions after peeking at results daily.
- Real task: Present results in a weekly meeting without inflating false positives.
Concept explained simply
Every statistical test has a false-positive rate (e.g., 5%). When you run many tests, the chance of at least one false positive grows. This is the multiple testing problem.
Mental model: Fishing in a big lake. One cast (one test) rarely hooks a boot (a false positive). But if you cast many times, the odds of catching a boot rise fast.
- Family-wise error rate (FWER): Probability of at least one false positive in a family of tests. Typical control methods: Bonferroni, Holm-Bonferroni.
- False discovery rate (FDR): Expected proportion of false positives among the discoveries. Typical control: Benjamini–Hochberg (BH).
- Sequential looks (peeking): Checking results repeatedly increases false positives like running many tests. Use fixed-horizon testing or spending/always-valid approaches to control error.
Quick math intuition
If you run m independent tests at alpha = 0.05, the chance of at least one false positive is approximately 1 - (1 - 0.05)^m. For m=10, that’s ~40%.
Quick rules of thumb
- Pre-define your test family (e.g., "primary KPIs for Experiment X").
- Few, important metrics: control FWER (Bonferroni/Holm).
- Many exploratory metrics: control FDR (BH).
- Plan your stopping rule. Avoid ad-hoc peeking.
- Report both raw and adjusted results.
Worked examples
Example 1 — Many metrics in one test
Situation: One A/B test tracks 10 independent KPIs at alpha = 0.05.
- FWER ≈ 1 - (1 - 0.05)^10 ≈ 40%.
- Bonferroni control: test each at alpha' = 0.05/10 = 0.005.
- Tradeoff: Fewer false wins but lower power; consider prioritizing a small set of primary KPIs.
Example 2 — Segments × metrics
You checked 3 metrics across 4 segments (device × country), so m = 12 tests. Under the global null, expected false positives ≈ m × 0.05 = 0.6.
Approach: Use FDR (BH) at q=0.10 to allow discovery while controlling the proportion of false discoveries.
Example 3 — Daily peeking
You peek at significance daily for 14 days. Even if there is no effect, repeated looks inflate false positives. Approx inflation like m = 14 tests: FWER ≈ 1 - 0.95^14 ≈ 51%.
Safer choices: fix the analysis day, or use a sequential design with error control; minimally, document the plan and avoid ad-hoc early stops.
How to choose a correction
- Are you making a launch/no-launch decision on a small set of critical KPIs? Use FWER control (Holm-Bonferroni preferred over plain Bonferroni for power).
- Are you screening many metrics/features to find promising signals? Use FDR (BH) for a good power–error tradeoff.
- Correlated metrics? Methods still work; just be conservative in interpretation and reduce redundant metrics when possible.
- Hierarchical goals? Use a hierarchy: test primary first; only test secondary if primary passes.
Step-by-step workflow
Exercises
Complete the exercises below. Then check your work using the solutions.
Exercise 1 — FWER and Bonferroni
You test 12 independent metrics at alpha = 0.05.
- Approximate the probability of at least one false positive.
- Compute the Bonferroni per-test alpha.
- Explain the tradeoff in one sentence.
See the Exercise 1 block below for the solution.
Exercise 2 — Apply BH (FDR)
Given p-values: [0.002, 0.013, 0.021, 0.041, 0.055, 0.12, 0.18, 0.33], m = 8, target q = 0.10. Use Benjamini–Hochberg to decide which to reject.
See the Exercise 2 block below for the solution.
Common mistakes and self-check
- Mistake: Calling a win after scanning many segments without adjustment. Self-check: Did I define the test family and adjust?
- Mistake: Peeking every day and stopping on the first significant result. Self-check: Is my stopping rule pre-specified?
- Mistake: Using Bonferroni on dozens of exploratory metrics, losing power. Self-check: Would FDR be more suitable?
- Mistake: Hiding nonsignificant metrics. Self-check: Are all planned tests reported with method used?
Practical projects
Project 1 — Metric family plan
- Pick a past or hypothetical A/B test.
- List primary and secondary metrics; define segments.
- Choose FWER or FDR and justify in one paragraph.
- Create a one-page plan with stopping rule and report template.
Project 2 — Adjustment workbook
- Create a spreadsheet with input p-values in column A.
- Compute Bonferroni-adjusted p (p_adj = p × m) and Holm step-down decisions.
- Compute BH: sort p, find thresholds i·q/m, mark largest i with p_i ≤ threshold, reject 1..i.
- Compare decisions across methods and document differences.
Project 3 — Peek impact demo
- Simulate or conceptually outline a fixed-horizon vs. 14 daily peeks scenario.
- Calculate approximate FWER using 1 - (1 - 0.05)^m for m peeks.
- Write a short note recommending a stopping rule for your team.
Mini challenge
Your experiment tracks 4 primary KPIs and 20 exploratory metrics. You also plan to check 3 user segments.
- Which families do you define?
- Which method for each family?
- What do you put in the release note so stakeholders don’t over-interpret exploratory wins?
Hints
- Primary KPIs likely one family with FWER control; exploratory metrics another with FDR.
- Segments can be within-family multipliers—decide if they belong to the same family or separate exploratory family.
- State raw vs. adjusted results and the method used.
Who this is for
- Data Analysts who evaluate experiments with multiple metrics or segments.
- Product analysts and growth analysts making launch decisions.
Prerequisites
- Understanding of p-values, confidence intervals, and statistical significance.
- Basic A/B testing workflow (control vs. treatment, metrics, sample sizing).
Learning path
- Before: Hypothesis framing and metric selection.
- Now: Multiple testing awareness (this lesson).
- Next: Sequential testing basics and power analysis.
Next steps
- Take the quick test below to check understanding.
- Build the adjustment workbook and apply it to a recent experiment.
- Note: The test is available to everyone; only logged-in users will have their progress saved.