luvv to helpDiscover the Best Free Online Tools
Topic 11 of 13

Multiple Testing Awareness

Learn Multiple Testing Awareness for free with explanations, exercises, and a quick test (for Data Analyst).

Published: December 20, 2025 | Updated: December 20, 2025

Why this matters

As a Data Analyst, you will regularly evaluate experiments with many metrics, segments, and repeated looks over time. Each extra comparison raises the chance of a false win. Without multiple testing awareness, teams may launch changes that actually hurt users because a random fluctuation looked significant.

  • Real task: Decide if a test is a win when you tracked 8 KPIs and sliced by device and country.
  • Real task: Answer stakeholder questions after peeking at results daily.
  • Real task: Present results in a weekly meeting without inflating false positives.

Concept explained simply

Every statistical test has a false-positive rate (e.g., 5%). When you run many tests, the chance of at least one false positive grows. This is the multiple testing problem.

Mental model: Fishing in a big lake. One cast (one test) rarely hooks a boot (a false positive). But if you cast many times, the odds of catching a boot rise fast.

  • Family-wise error rate (FWER): Probability of at least one false positive in a family of tests. Typical control methods: Bonferroni, Holm-Bonferroni.
  • False discovery rate (FDR): Expected proportion of false positives among the discoveries. Typical control: Benjamini–Hochberg (BH).
  • Sequential looks (peeking): Checking results repeatedly increases false positives like running many tests. Use fixed-horizon testing or spending/always-valid approaches to control error.
Quick math intuition

If you run m independent tests at alpha = 0.05, the chance of at least one false positive is approximately 1 - (1 - 0.05)^m. For m=10, that’s ~40%.

Quick rules of thumb

  • Pre-define your test family (e.g., "primary KPIs for Experiment X").
  • Few, important metrics: control FWER (Bonferroni/Holm).
  • Many exploratory metrics: control FDR (BH).
  • Plan your stopping rule. Avoid ad-hoc peeking.
  • Report both raw and adjusted results.

Worked examples

Example 1 — Many metrics in one test

Situation: One A/B test tracks 10 independent KPIs at alpha = 0.05.

  • FWER ≈ 1 - (1 - 0.05)^10 ≈ 40%.
  • Bonferroni control: test each at alpha' = 0.05/10 = 0.005.
  • Tradeoff: Fewer false wins but lower power; consider prioritizing a small set of primary KPIs.
Example 2 — Segments × metrics

You checked 3 metrics across 4 segments (device × country), so m = 12 tests. Under the global null, expected false positives ≈ m × 0.05 = 0.6.

Approach: Use FDR (BH) at q=0.10 to allow discovery while controlling the proportion of false discoveries.

Example 3 — Daily peeking

You peek at significance daily for 14 days. Even if there is no effect, repeated looks inflate false positives. Approx inflation like m = 14 tests: FWER ≈ 1 - 0.95^14 ≈ 51%.

Safer choices: fix the analysis day, or use a sequential design with error control; minimally, document the plan and avoid ad-hoc early stops.

How to choose a correction

  • Are you making a launch/no-launch decision on a small set of critical KPIs? Use FWER control (Holm-Bonferroni preferred over plain Bonferroni for power).
  • Are you screening many metrics/features to find promising signals? Use FDR (BH) for a good power–error tradeoff.
  • Correlated metrics? Methods still work; just be conservative in interpretation and reduce redundant metrics when possible.
  • Hierarchical goals? Use a hierarchy: test primary first; only test secondary if primary passes.

Step-by-step workflow

Step 1: Define families upfront (primary vs. exploratory; segments included).
Step 2: Pick the error control (FWER for confirmatory, FDR for exploratory) and alpha/q.
Step 3: Set the stopping rule (fixed horizon or pre-specified sequential plan).
Step 4: Run the test and compute raw p-values for all planned comparisons.
Step 5: Apply the chosen adjustment (Holm or BH) and mark discoveries.
Step 6: Report: raw p, adjusted p or adjusted alpha, method used, and decision.
Step 7: Reflect: if many surprises, consider replication or holdout tests.

Exercises

Complete the exercises below. Then check your work using the solutions.

Exercise 1 — FWER and Bonferroni

You test 12 independent metrics at alpha = 0.05.

  1. Approximate the probability of at least one false positive.
  2. Compute the Bonferroni per-test alpha.
  3. Explain the tradeoff in one sentence.

See the Exercise 1 block below for the solution.

Exercise 2 — Apply BH (FDR)

Given p-values: [0.002, 0.013, 0.021, 0.041, 0.055, 0.12, 0.18, 0.33], m = 8, target q = 0.10. Use Benjamini–Hochberg to decide which to reject.

See the Exercise 2 block below for the solution.

Common mistakes and self-check

  • Mistake: Calling a win after scanning many segments without adjustment. Self-check: Did I define the test family and adjust?
  • Mistake: Peeking every day and stopping on the first significant result. Self-check: Is my stopping rule pre-specified?
  • Mistake: Using Bonferroni on dozens of exploratory metrics, losing power. Self-check: Would FDR be more suitable?
  • Mistake: Hiding nonsignificant metrics. Self-check: Are all planned tests reported with method used?

Practical projects

Project 1 — Metric family plan
  1. Pick a past or hypothetical A/B test.
  2. List primary and secondary metrics; define segments.
  3. Choose FWER or FDR and justify in one paragraph.
  4. Create a one-page plan with stopping rule and report template.
Project 2 — Adjustment workbook
  1. Create a spreadsheet with input p-values in column A.
  2. Compute Bonferroni-adjusted p (p_adj = p × m) and Holm step-down decisions.
  3. Compute BH: sort p, find thresholds i·q/m, mark largest i with p_i ≤ threshold, reject 1..i.
  4. Compare decisions across methods and document differences.
Project 3 — Peek impact demo
  1. Simulate or conceptually outline a fixed-horizon vs. 14 daily peeks scenario.
  2. Calculate approximate FWER using 1 - (1 - 0.05)^m for m peeks.
  3. Write a short note recommending a stopping rule for your team.

Mini challenge

Your experiment tracks 4 primary KPIs and 20 exploratory metrics. You also plan to check 3 user segments.

  • Which families do you define?
  • Which method for each family?
  • What do you put in the release note so stakeholders don’t over-interpret exploratory wins?
Hints
  • Primary KPIs likely one family with FWER control; exploratory metrics another with FDR.
  • Segments can be within-family multipliers—decide if they belong to the same family or separate exploratory family.
  • State raw vs. adjusted results and the method used.

Who this is for

  • Data Analysts who evaluate experiments with multiple metrics or segments.
  • Product analysts and growth analysts making launch decisions.

Prerequisites

  • Understanding of p-values, confidence intervals, and statistical significance.
  • Basic A/B testing workflow (control vs. treatment, metrics, sample sizing).

Learning path

  • Before: Hypothesis framing and metric selection.
  • Now: Multiple testing awareness (this lesson).
  • Next: Sequential testing basics and power analysis.

Next steps

  • Take the quick test below to check understanding.
  • Build the adjustment workbook and apply it to a recent experiment.
  • Note: The test is available to everyone; only logged-in users will have their progress saved.

Practice Exercises

2 exercises to complete

Instructions

You test 12 independent metrics at alpha = 0.05.

  1. Approximate the probability of at least one false positive (FWER).
  2. Compute the Bonferroni per-test alpha.
  3. Write one-sentence guidance you would tell stakeholders about the power tradeoff.
Expected Output
FWER around mid-40%. Bonferroni per-test alpha ≈ 0.0042. Clear note about fewer false wins but lower power.

Multiple Testing Awareness — Quick Test

Test your knowledge with 8 questions. Pass with 70% or higher.

8 questions70% to pass

Have questions about Multiple Testing Awareness?

AI Assistant

Ask questions about this tool