luvv to helpDiscover the Best Free Online Tools
Topic 9 of 9

Multiple Testing And False Discovery Awareness

Learn Multiple Testing And False Discovery Awareness for free with explanations, exercises, and a quick test (for Data Scientist).

Published: January 1, 2026 | Updated: January 1, 2026

Why this matters

As a Data Scientist, you often run many statistical tests at once: multiple A/B variants, many features in a model, dozens of metrics, or hundreds of biomarkers. Without correcting for multiple comparisons, you will "discover" patterns that are just noise. This wastes time, misguides product decisions, and can harm users.

  • Experimentation: Compare many variants or segments without inflating false positives.
  • Feature selection: Pick signals while controlling how many false picks you accept.
  • Monitoring: Scan many metrics for alerts with disciplined error control.

Concept explained simply

Each hypothesis test has a chance to be wrong (usually 5% at alpha = 0.05). If you run lots of tests, those chances add up, and you get false positives just by luck.

Two key targets to control:

  • FWER (Family-Wise Error Rate): Probability of making at least one false discovery.
  • FDR (False Discovery Rate): Expected proportion of false discoveries among all discoveries.

Mental model

Imagine fishing in a lake with 100 lines. With one line (one test), a rare fish (false positive) is unlikely. With 100 lines, catching at least one rare fish becomes likely. FWER control says "I want to almost never catch any rare fish at all" (very strict). FDR says "If I show a bucket of fish, I want only a small fraction to be rare fish" (more discoveries allowed, but keep the false fraction low).

Key tools you will use

  • Bonferroni: Divide alpha by number of tests m. Very conservative. Controls FWER.
  • Holm (step-down): Sequentially compares sorted p-values to alpha/(m - i + 1). More powerful than Bonferroni. Controls FWER.
  • Benjamini–Hochberg (BH): Sort p-values and find largest k with p(k) ≤ (k/m)·q. Controls FDR under independence or positive dependence.
  • Benjamini–Yekutieli (BY): Like BH but more conservative; works under arbitrary dependence.
Bonferroni – how to apply
  1. Choose per-family alpha (e.g., 0.05) and count tests m.
  2. Use threshold alpha/m. A test is significant if p ≤ alpha/m.
  3. Report which tests pass and note strict FWER control.
Holm (step-down) – how to apply
  1. Sort p-values ascending: p(1) ≤ p(2) ≤ ... ≤ p(m).
  2. For i = 1..m, compare p(i) to alpha/(m - i + 1).
  3. Find first i where p(i) > alpha/(m - i + 1); stop. Reject all hypotheses for j < i.
Benjamini–Hochberg (BH) – how to apply
  1. Choose FDR level q (e.g., 0.05) and sort p-values ascending.
  2. Compute thresholds T(i) = (i/m)·q for i = 1..m.
  3. Find largest k with p(k) ≤ T(k). Reject the k smallest hypotheses.

Worked examples

Example 1: 20 A/B variants

m = 20 tests, alpha = 0.05, target either FWER or FDR. Suppose the five smallest p-values are: 0.0009, 0.0018, 0.0030, 0.0060, 0.0120 (others are larger).

  • Bonferroni (FWER): threshold = 0.05/20 = 0.0025 → significant: 0.0009, 0.0018 (2 variants).
  • BH (FDR q = 0.05): thresholds: 0.0025, 0.005, 0.0075, 0.010, 0.0125 → the first five p-values are each ≤ their thresholds → 5 variants significant.

Interpretation: If you want very few false alarms overall, use Bonferroni/Holm. If you want more power with an acceptable fraction of false positives, use BH.

Example 2: 100 genes

m = 100, choose FDR q = 0.10. Smallest p-values: 0.0002, 0.0009, 0.0010, 0.0030, 0.0060, 0.0090, 0.0120, 0.0200, 0.0300, 0.0490.

BH thresholds for k = 1..10: 0.001, 0.002, 0.003, 0.004, 0.005, 0.006, 0.007, 0.008, 0.009, 0.010.

  • Compare: 0.0002 ≤ 0.001 (pass), 0.0009 ≤ 0.002 (pass), 0.0010 ≤ 0.003 (pass), 0.0030 ≤ 0.004 (pass), 0.0060 ≤ 0.005 (fail). Largest k that passes is 4 → 4 discoveries at FDR 10%.

Example 3: Day-of-week effect (7 tests) with Holm

p-values (sorted): 0.004, 0.011, 0.018, 0.021, 0.033, 0.041, 0.060; alpha = 0.05, m = 7.

  • i=1: threshold 0.05/7 ≈ 0.00714 → 0.004 ≤ 0.00714 (reject H1)
  • i=2: threshold 0.05/6 ≈ 0.00833 → 0.011 > 0.00833 (stop)

Result: Only the smallest p-value is significant under Holm with FWER control.

Choosing a method quickly

  • Small m, very high stakes (e.g., medical risk): prefer FWER (Holm or Bonferroni).
  • Large m, discovery-driven (e.g., features/genes): prefer FDR (BH; BY if dependence is complex and strong).
  • Correlated tests: BH still works under positive dependence; if unsure and risk-averse, consider BY or hierarchical modeling.

Practical workflow checklist

  • Define the family of tests (what counts together?).
  • Choose error target: FWER (alpha) or FDR (q).
  • Select method: Bonferroni/Holm for FWER; BH/BY for FDR.
  • Sort p-values and apply the chosen procedure.
  • Report: method, thresholds, number of discoveries, and limitations.
  • Sensitivity check: try another q/alpha and compare stability.

Your turn: exercises

Do the two exercises below. They mirror the auto-graded quick test style. Tip: write down sorted p-values and thresholds before deciding.

Exercise 1 (mirrors ex1)

Compute BH rejections for 12 tests at q = 0.05 with p-values: 0.001, 0.004, 0.009, 0.013, 0.017, 0.021, 0.028, 0.031, 0.041, 0.052, 0.074, 0.12.

Exercise 2 (mirrors ex2)

You ran 10 independent A/B tests with p-values: 0.0007, 0.002, 0.006, 0.012, 0.019, 0.026, 0.033, 0.04, 0.071, 0.2. Compare Bonferroni at alpha=0.05 vs BH at q=0.05. How many discoveries under each?

Common mistakes and self-check

  • Mistake: Running many tests, then reporting only the smallest p-value without correction. Self-check: Did you define the test family and apply a correction?
  • Mistake: Mixing families (e.g., combining metrics from different experiments). Self-check: Is the family definition consistent with the decision you’re making?
  • Mistake: Using Bonferroni with very large m when discoveries are important. Self-check: Would FDR control (BH) be more appropriate?
  • Mistake: Ignoring dependence. Self-check: If correlations are strong and complex, consider BY or modeling.
  • Mistake: Changing q/alpha after seeing results (p-hacking). Self-check: Predefine thresholds; document any deviations and rerun sensitivity checks.

Practical projects

  • Experiment dashboard: Build a small report that takes a list of p-values and outputs Bonferroni, Holm, and BH decisions, including thresholds and counts.
  • Feature screening: Simulate 100 features with known true/false effects. Compare precision/recall when using no correction, Bonferroni, Holm, BH at q=0.05.
  • Alert triage: Given daily p-values for 50 metrics, apply BH to each day and track FDR-controlled alerts. Summarize stability across days.

Who this is for

  • Data Scientists and Analysts running multiple experiments or scanning many metrics.
  • ML practitioners doing feature selection or model diagnostics at scale.

Prerequisites

  • Understanding of p-values, null/alternative hypotheses, and Type I/II errors.
  • Basic probability and interpretation of alpha levels.

Learning path

  • Before this: Hypothesis testing basics → p-values and confidence intervals.
  • This lesson: Why multiple testing matters; FWER vs FDR; Bonferroni, Holm, BH; choosing a method.
  • After this: Power analysis under multiple testing; sequential testing; hierarchical modeling; Bayesian FDR analogs.

Next steps

  • Apply BH at q=0.05 on your next multi-metric analysis; report discoveries and expected false fraction.
  • Repeat with Holm and compare conclusions; document trade-offs.
  • Automate a small helper to compute sorted p-values and thresholds.

Mini challenge

You have 30 segment-level lift tests with 6 small p-values: 0.0004, 0.001, 0.004, 0.007, 0.011, 0.019 (others > 0.05). You need to propose segments for a pilot rollout, accepting some false leads but keeping them limited. Which method and threshold do you choose, and which segments would you flag?

Show a possible approach

Use BH at q = 0.05 (discovery-focused with controlled false fraction). m = 30. Thresholds: (1..6)/30*0.05 = 0.0017, 0.0033, 0.005, 0.0067, 0.0083, 0.01. Compare: 0.0004 ≤ 0.0017 (pass), 0.001 ≤ 0.0033 (pass), 0.004 ≤ 0.005 (pass), 0.007 ≤ 0.0067 (fail). Largest k = 3 → flag the 3 smallest p-values for pilot.

Ready for the Quick Test?

The quick test below is available to everyone for free. If you log in, your progress and score will be saved.

Practice Exercises

2 exercises to complete

Instructions

Compute BH rejections for 12 tests at q = 0.05. p-values: 0.001, 0.004, 0.009, 0.013, 0.017, 0.021, 0.028, 0.031, 0.041, 0.052, 0.074, 0.12.

  • Sort the p-values.
  • Compute thresholds T(i) = (i/12)·0.05.
  • Find the largest k with p(k) ≤ T(k) and report how many rejections.
Expected Output
Number of BH discoveries at q=0.05 and which p-values are rejected.

Multiple Testing And False Discovery Awareness — Quick Test

Test your knowledge with 8 questions. Pass with 70% or higher.

8 questions70% to pass

Have questions about Multiple Testing And False Discovery Awareness?

AI Assistant

Ask questions about this tool