How to learn Multiple Testing Awareness for A/B Testing Basics in Data Analyst for free

Why this matters

As a Data Analyst, you will regularly evaluate experiments with many metrics, segments, and repeated looks over time. Each extra comparison raises the chance of a false win. Without multiple testing awareness, teams may launch changes that actually hurt users because a random fluctuation looked significant.

Real task: Decide if a test is a win when you tracked 8 KPIs and sliced by device and country.
Real task: Answer stakeholder questions after peeking at results daily.
Real task: Present results in a weekly meeting without inflating false positives.

Concept explained simply

Every statistical test has a false-positive rate (e.g., 5%). When you run many tests, the chance of at least one false positive grows. This is the multiple testing problem.

Mental model: Fishing in a big lake. One cast (one test) rarely hooks a boot (a false positive). But if you cast many times, the odds of catching a boot rise fast.

Family-wise error rate (FWER): Probability of at least one false positive in a family of tests. Typical control methods: Bonferroni, Holm-Bonferroni.
False discovery rate (FDR): Expected proportion of false positives among the discoveries. Typical control: Benjamini–Hochberg (BH).
Sequential looks (peeking): Checking results repeatedly increases false positives like running many tests. Use fixed-horizon testing or spending/always-valid approaches to control error.

Quick math intuition

If you run m independent tests at alpha = 0.05, the chance of at least one false positive is approximately 1 - (1 - 0.05)^m. For m=10, that’s ~40%.

Quick rules of thumb

Pre-define your test family (e.g., "primary KPIs for Experiment X").
Few, important metrics: control FWER (Bonferroni/Holm).
Many exploratory metrics: control FDR (BH).
Plan your stopping rule. Avoid ad-hoc peeking.
Report both raw and adjusted results.

Worked examples

Example 1 — Many metrics in one test

Situation: One A/B test tracks 10 independent KPIs at alpha = 0.05.

FWER ≈ 1 - (1 - 0.05)^10 ≈ 40%.
Bonferroni control: test each at alpha' = 0.05/10 = 0.005.
Tradeoff: Fewer false wins but lower power; consider prioritizing a small set of primary KPIs.

Example 2 — Segments × metrics

You checked 3 metrics across 4 segments (device × country), so m = 12 tests. Under the global null, expected false positives ≈ m × 0.05 = 0.6.

Approach: Use FDR (BH) at q=0.10 to allow discovery while controlling the proportion of false discoveries.

Example 3 — Daily peeking

You peek at significance daily for 14 days. Even if there is no effect, repeated looks inflate false positives. Approx inflation like m = 14 tests: FWER ≈ 1 - 0.95^14 ≈ 51%.

Safer choices: fix the analysis day, or use a sequential design with error control; minimally, document the plan and avoid ad-hoc early stops.

How to choose a correction

Are you making a launch/no-launch decision on a small set of critical KPIs? Use FWER control (Holm-Bonferroni preferred over plain Bonferroni for power).
Are you screening many metrics/features to find promising signals? Use FDR (BH) for a good power–error tradeoff.
Correlated metrics? Methods still work; just be conservative in interpretation and reduce redundant metrics when possible.
Hierarchical goals? Use a hierarchy: test primary first; only test secondary if primary passes.

Step-by-step workflow

Step 1: Define families upfront (primary vs. exploratory; segments included).

Step 2: Pick the error control (FWER for confirmatory, FDR for exploratory) and alpha/q.

Step 3: Set the stopping rule (fixed horizon or pre-specified sequential plan).

Step 4: Run the test and compute raw p-values for all planned comparisons.

Step 5: Apply the chosen adjustment (Holm or BH) and mark discoveries.

Step 6: Report: raw p, adjusted p or adjusted alpha, method used, and decision.

Step 7: Reflect: if many surprises, consider replication or holdout tests.

Exercises

Complete the exercises below. Then check your work using the solutions.

I computed FWER for multiple metrics.
I applied Bonferroni or Holm and compared decisions.
I ran BH for a list of p-values and identified discoveries.

Exercise 1 — FWER and Bonferroni

You test 12 independent metrics at alpha = 0.05.

Approximate the probability of at least one false positive.
Compute the Bonferroni per-test alpha.
Explain the tradeoff in one sentence.

See the Exercise 1 block below for the solution.

Exercise 2 — Apply BH (FDR)

Given p-values: [0.002, 0.013, 0.021, 0.041, 0.055, 0.12, 0.18, 0.33], m = 8, target q = 0.10. Use Benjamini–Hochberg to decide which to reject.

See the Exercise 2 block below for the solution.

Common mistakes and self-check

Mistake: Calling a win after scanning many segments without adjustment. Self-check: Did I define the test family and adjust?
Mistake: Peeking every day and stopping on the first significant result. Self-check: Is my stopping rule pre-specified?
Mistake: Using Bonferroni on dozens of exploratory metrics, losing power. Self-check: Would FDR be more suitable?
Mistake: Hiding nonsignificant metrics. Self-check: Are all planned tests reported with method used?

Practical projects

Project 1 — Metric family plan

Pick a past or hypothetical A/B test.
List primary and secondary metrics; define segments.
Choose FWER or FDR and justify in one paragraph.
Create a one-page plan with stopping rule and report template.

Project 2 — Adjustment workbook

Create a spreadsheet with input p-values in column A.
Compute Bonferroni-adjusted p (p_adj = p × m) and Holm step-down decisions.
Compute BH: sort p, find thresholds i·q/m, mark largest i with p_i ≤ threshold, reject 1..i.
Compare decisions across methods and document differences.

Project 3 — Peek impact demo

Simulate or conceptually outline a fixed-horizon vs. 14 daily peeks scenario.
Calculate approximate FWER using 1 - (1 - 0.05)^m for m peeks.
Write a short note recommending a stopping rule for your team.

Mini challenge

Your experiment tracks 4 primary KPIs and 20 exploratory metrics. You also plan to check 3 user segments.

Which families do you define?
Which method for each family?
What do you put in the release note so stakeholders don’t over-interpret exploratory wins?

Hints

Primary KPIs likely one family with FWER control; exploratory metrics another with FDR.
Segments can be within-family multipliers—decide if they belong to the same family or separate exploratory family.
State raw vs. adjusted results and the method used.

Who this is for

Data Analysts who evaluate experiments with multiple metrics or segments.
Product analysts and growth analysts making launch decisions.

Prerequisites

Understanding of p-values, confidence intervals, and statistical significance.
Basic A/B testing workflow (control vs. treatment, metrics, sample sizing).

Learning path

Before: Hypothesis framing and metric selection.
Now: Multiple testing awareness (this lesson).
Next: Sequential testing basics and power analysis.

Next steps

Take the quick test below to check understanding.
Build the adjustment workbook and apply it to a recent experiment.
Note: The test is available to everyone; only logged-in users will have their progress saved.

Menu

Multiple Testing Awareness

Table of Contents

Why this matters

Concept explained simply

Quick rules of thumb

Worked examples

How to choose a correction

Step-by-step workflow

Exercises

Common mistakes and self-check

Practical projects

Mini challenge

Who this is for

Prerequisites

Learning path

Next steps

Practice Exercises

FWER and Bonferroni on many metrics

Instructions

Expected Output

Benjamini–Hochberg (FDR) in practice

Multiple Testing Awareness — Quick Test

Have questions about Multiple Testing Awareness?

AI Assistant