luvv to helpDiscover the Best Free Online Tools
Topic 5 of 9

Hypothesis Testing

Learn Hypothesis Testing for free with explanations, exercises, and a quick test (for Data Scientist).

Published: January 1, 2026 | Updated: January 1, 2026

Who this is for

  • Data Scientists and Analysts running A/B tests, experiments, or validating insights.
  • ML Engineers comparing models or checking data shifts.
  • Anyone who needs evidence-based decisions, not gut feeling.

Prerequisites

  • Basic probability and distributions (normal, binomial).
  • Mean, variance, standard error, sampling basics.
  • Comfort with a calculator or spreadsheet. Python/R helpful but not required.

Learning path

  1. Learn the roles of H0 (null) and H1 (alternative).
  2. Understand alpha, p-value, and Type I/II errors.
  3. Pick the right test (t-test, z-test for proportions, chi-square, etc.).
  4. Check assumptions (independence, normality, equal variances when needed).
  5. Compute the test statistic and p-value; interpret clearly.
  6. Report effect size and confidence interval.
  7. Plan power and sample size for future tests.

Why this matters

  • A/B testing: Decide if Variant B truly increases conversion.
  • Product analytics: Validate if a metric changed after a rollout.
  • ML monitoring: Detect data drift or performance regressions.
  • Operations: Compare defect rates or response times across processes.

Concept explained simply

Hypothesis testing asks: If nothing changed (H0), how surprising is our data? If it is very surprising (p-value below alpha), we reject H0 and favor H1.

Mental model: The courtroom

H0 is "innocent by default." Data is the evidence. Alpha (e.g., 0.05) is the threshold for how much surprise we need to convict (reject H0). A small p-value means the evidence is strong against H0.

Core workflow (step-by-step)

  1. Translate the question into H0 and H1. Choose one- or two-tailed.
  2. Select a test that matches data type and design (paired vs independent, means vs proportions).
  3. Check assumptions (randomization, independence; normality or large-sample; equal variances if needed).
  4. Compute the test statistic and p-value.
  5. Decide using alpha (commonly 0.05). Do not over-interpret tiny effects.
  6. Report: effect size, CI, p-value, test used, assumptions, and practical impact.

Worked examples

Example 1: One-sample t-test (mean)

Scenario: Average session duration target is 10. Sample of n=30 has mean=9.5, sd=2. H0: mean=10 (two-tailed), alpha=0.05.

  • SE = 2 / sqrt(30) ≈ 0.365
  • t = (9.5 - 10) / 0.365 ≈ -1.37, df=29
  • p ≈ 0.18 > 0.05 → Fail to reject H0. No evidence the mean differs from 10.

Example 2: Two-proportion z-test (A/B conversion)

A: n=5000, conv=250 (5%). B: n=4900, conv=294 (6%). H0: pA = pB (two-tailed), alpha=0.05.

  • Pooled p = (250+294)/(5000+4900) ≈ 0.055
  • SE ≈ sqrt(0.055*0.945*(1/5000+1/4900)) ≈ 0.00458
  • z = (0.06 - 0.05)/0.00458 ≈ 2.18 → two-tailed p ≈ 0.029
  • Decision: Reject H0. B is statistically higher than A.

Example 3: Chi-square test of independence (categories)

Device vs churn (2x2): Mobile Yes=60, No=140; Desktop Yes=40, No=160 (total=400). H0: device and churn are independent.

  • Expected counts (Yes=100 total): Mobile Yes=50, Desktop Yes=50; Mobile No=150, Desktop No=150
  • Chi-square ≈ 2 + 0.667 + 2 + 0.667 = 5.33; df=1 → p ≈ 0.021
  • Decision: Reject H0. Churn is associated with device.
Choosing a test quick guide
  • Mean, 1 sample: one-sample t-test (if sd unknown)
  • Mean, 2 independent groups: two-sample t-test (Welch if variances differ)
  • Paired measurements: paired t-test
  • Proportions, 2 groups: two-proportion z-test
  • Categories: chi-square (or Fisher if small counts)
  • More than 2 means: ANOVA (or Kruskal–Wallis nonparametric)

Assumptions and checks

  • Random, independent observations.
  • t-tests: approximately normal means; robust if n is moderate (CLT). For small n, check normality/outliers.
  • Equal variances only if using pooled t-test; Welch t-test avoids this.
  • Proportion tests: adequate counts (np and n(1-p) typically ≥ 5–10).
  • Chi-square: expected cell counts usually ≥ 5 (else use Fisher).

Exercises you can do now

These match the exercises below. Try them here, then compare with the solutions.

  1. Exercise 1 (one-sample t-test): A sample of n=40 users has mean daily active time 52.4 minutes with sd=8. Is it different from 50 minutes at alpha=0.05 (two-tailed)? State H0/H1, compute t, p, and a decision.
  2. Exercise 2 (two-proportion z-test): A: 270/6000 converted. B: 342/6100 converted. Test H0: pA = pB (two-tailed, alpha=0.05). Compute z, p, and decision. Also report the absolute difference in percentage points.
Need hints?
  • For t-tests: SE = s / sqrt(n) and t = (xbar - mu0) / SE.
  • For two-proportion tests: pooled p = (x1+x2)/(n1+n2); SE = sqrt(p*(1-p)*(1/n1+1/n2)).
  • Two-tailed p is about twice the one-tailed area for |z| or |t|.

Checklist before checking solutions

  • Clear H0/H1 and alpha defined.
  • Right test chosen for data type and design.
  • Test statistic and df (if applicable) computed.
  • Approximate p-value and a decision stated.
  • Effect size or raw difference reported.
  • Assumptions mentioned briefly.

Common mistakes and self-check

  • Misreading p-value: It is P(data | H0), not P(H0 | data). Self-check: Rephrase result properly.
  • P-hacking/peeking: Stopping early when p dips below 0.05. Self-check: Predefine sample size and analysis plan.
  • Wrong test tail: Using one-tailed after seeing data. Self-check: Tail is set before looking.
  • Ignoring power: Small samples miss real effects. Self-check: Run a power or sample-size check.
  • Assumption violations: Non-independence or tiny expected counts. Self-check: Review design and counts.
  • Confusing statistical with practical significance. Self-check: Always include effect size and impact.
  • Multiple comparisons without correction. Self-check: Control FDR (e.g., Benjamini–Hochberg) or adjust alpha.

Practical projects

  • Run a simulated A/B test in a spreadsheet: generate conversions, apply a two-proportion z-test, and write a one-paragraph decision with effect size.
  • Compare model error rates across two datasets (before/after a data change) using a two-proportion test; include a minimal power analysis.
  • Audit a metric change: one-sample or paired t-test on pre/post values, plus a 95% confidence interval and a brief assumption check.

Mini challenge

You release a new onboarding flow. In a week, 600 of 10,000 new users activated in the old flow; 720 of 10,500 activated in the new flow. Pick an appropriate test, justify one- vs two-tailed, compute the decision at alpha=0.05, and state the practical implication for product rollout.

Tip

Decide whether direction matters for your business goal; include an effect size (percentage-point difference).

Next steps

  • Practice with your own data (or small simulations) and write full test reports (hypotheses, test, p-value, CI, effect size, assumptions).
  • Learn power and sample-size planning to design stronger experiments.
  • Explore multiple testing control (FDR) for experiments with many variants or metrics.

Quick Test

Take the Quick Test below to check your understanding. Available to everyone; only logged-in users get saved progress.

Practice Exercises

2 exercises to complete

Instructions

A sample of n=40 users has mean daily active time x̄=52.4 minutes with s=8. Test H0: mu=50 versus H1: mu≠50 at alpha=0.05. Compute t, df, approximate p-value, and state your decision. Report the effect size as the raw difference (minutes) and a brief interpretation.

Expected Output
Clear H0/H1; t statistic around 1.90; df=39; p ≈ 0.06–0.07; fail to reject H0; effect size ~+2.4 minutes.

Hypothesis Testing — Quick Test

Test your knowledge with 8 questions. Pass with 70% or higher.

8 questions70% to pass

Have questions about Hypothesis Testing?

AI Assistant

Ask questions about this tool