How to learn Hypothesis Testing for Statistics in Data Scientist for free

Who this is for

Data Scientists and Analysts running A/B tests, experiments, or validating insights.
ML Engineers comparing models or checking data shifts.
Anyone who needs evidence-based decisions, not gut feeling.

Prerequisites

Basic probability and distributions (normal, binomial).
Mean, variance, standard error, sampling basics.
Comfort with a calculator or spreadsheet. Python/R helpful but not required.

Learning path

Learn the roles of H0 (null) and H1 (alternative).
Understand alpha, p-value, and Type I/II errors.
Pick the right test (t-test, z-test for proportions, chi-square, etc.).
Check assumptions (independence, normality, equal variances when needed).
Compute the test statistic and p-value; interpret clearly.
Report effect size and confidence interval.
Plan power and sample size for future tests.

Why this matters

A/B testing: Decide if Variant B truly increases conversion.
Product analytics: Validate if a metric changed after a rollout.
ML monitoring: Detect data drift or performance regressions.
Operations: Compare defect rates or response times across processes.

Concept explained simply

Hypothesis testing asks: If nothing changed (H0), how surprising is our data? If it is very surprising (p-value below alpha), we reject H0 and favor H1.

Mental model: The courtroom

H0 is "innocent by default." Data is the evidence. Alpha (e.g., 0.05) is the threshold for how much surprise we need to convict (reject H0). A small p-value means the evidence is strong against H0.

Core workflow (step-by-step)

Translate the question into H0 and H1. Choose one- or two-tailed.
Select a test that matches data type and design (paired vs independent, means vs proportions).
Check assumptions (randomization, independence; normality or large-sample; equal variances if needed).
Compute the test statistic and p-value.
Decide using alpha (commonly 0.05). Do not over-interpret tiny effects.
Report: effect size, CI, p-value, test used, assumptions, and practical impact.

Worked examples

Example 1: One-sample t-test (mean)

Scenario: Average session duration target is 10. Sample of n=30 has mean=9.5, sd=2. H0: mean=10 (two-tailed), alpha=0.05.

SE = 2 / sqrt(30) ≈ 0.365
t = (9.5 - 10) / 0.365 ≈ -1.37, df=29
p ≈ 0.18 > 0.05 → Fail to reject H0. No evidence the mean differs from 10.

Example 2: Two-proportion z-test (A/B conversion)

A: n=5000, conv=250 (5%). B: n=4900, conv=294 (6%). H0: pA = pB (two-tailed), alpha=0.05.

Pooled p = (250+294)/(5000+4900) ≈ 0.055
SE ≈ sqrt(0.055*0.945*(1/5000+1/4900)) ≈ 0.00458
z = (0.06 - 0.05)/0.00458 ≈ 2.18 → two-tailed p ≈ 0.029
Decision: Reject H0. B is statistically higher than A.

Example 3: Chi-square test of independence (categories)

Device vs churn (2x2): Mobile Yes=60, No=140; Desktop Yes=40, No=160 (total=400). H0: device and churn are independent.

Expected counts (Yes=100 total): Mobile Yes=50, Desktop Yes=50; Mobile No=150, Desktop No=150
Chi-square ≈ 2 + 0.667 + 2 + 0.667 = 5.33; df=1 → p ≈ 0.021
Decision: Reject H0. Churn is associated with device.

Choosing a test quick guide

Mean, 1 sample: one-sample t-test (if sd unknown)
Mean, 2 independent groups: two-sample t-test (Welch if variances differ)
Paired measurements: paired t-test
Proportions, 2 groups: two-proportion z-test
Categories: chi-square (or Fisher if small counts)
More than 2 means: ANOVA (or Kruskal–Wallis nonparametric)

Assumptions and checks

Random, independent observations.
t-tests: approximately normal means; robust if n is moderate (CLT). For small n, check normality/outliers.
Equal variances only if using pooled t-test; Welch t-test avoids this.
Proportion tests: adequate counts (np and n(1-p) typically ≥ 5–10).
Chi-square: expected cell counts usually ≥ 5 (else use Fisher).

Exercises you can do now

These match the exercises below. Try them here, then compare with the solutions.

Exercise 1 (one-sample t-test): A sample of n=40 users has mean daily active time 52.4 minutes with sd=8. Is it different from 50 minutes at alpha=0.05 (two-tailed)? State H0/H1, compute t, p, and a decision.
Exercise 2 (two-proportion z-test): A: 270/6000 converted. B: 342/6100 converted. Test H0: pA = pB (two-tailed, alpha=0.05). Compute z, p, and decision. Also report the absolute difference in percentage points.

Need hints?

For t-tests: SE = s / sqrt(n) and t = (xbar - mu0) / SE.
For two-proportion tests: pooled p = (x1+x2)/(n1+n2); SE = sqrt(p*(1-p)*(1/n1+1/n2)).
Two-tailed p is about twice the one-tailed area for |z| or |t|.

Checklist before checking solutions

Clear H0/H1 and alpha defined.
Right test chosen for data type and design.
Test statistic and df (if applicable) computed.
Approximate p-value and a decision stated.
Effect size or raw difference reported.
Assumptions mentioned briefly.

Common mistakes and self-check

Misreading p-value: It is P(data | H0), not P(H0 | data). Self-check: Rephrase result properly.
P-hacking/peeking: Stopping early when p dips below 0.05. Self-check: Predefine sample size and analysis plan.
Wrong test tail: Using one-tailed after seeing data. Self-check: Tail is set before looking.
Ignoring power: Small samples miss real effects. Self-check: Run a power or sample-size check.
Assumption violations: Non-independence or tiny expected counts. Self-check: Review design and counts.
Confusing statistical with practical significance. Self-check: Always include effect size and impact.
Multiple comparisons without correction. Self-check: Control FDR (e.g., Benjamini–Hochberg) or adjust alpha.

Practical projects

Run a simulated A/B test in a spreadsheet: generate conversions, apply a two-proportion z-test, and write a one-paragraph decision with effect size.
Compare model error rates across two datasets (before/after a data change) using a two-proportion test; include a minimal power analysis.
Audit a metric change: one-sample or paired t-test on pre/post values, plus a 95% confidence interval and a brief assumption check.

Mini challenge

You release a new onboarding flow. In a week, 600 of 10,000 new users activated in the old flow; 720 of 10,500 activated in the new flow. Pick an appropriate test, justify one- vs two-tailed, compute the decision at alpha=0.05, and state the practical implication for product rollout.

Tip

Decide whether direction matters for your business goal; include an effect size (percentage-point difference).

Next steps

Practice with your own data (or small simulations) and write full test reports (hypotheses, test, p-value, CI, effect size, assumptions).
Learn power and sample-size planning to design stronger experiments.
Explore multiple testing control (FDR) for experiments with many variants or metrics.

Quick Test

Take the Quick Test below to check your understanding. Available to everyone; only logged-in users get saved progress.

Menu

Hypothesis Testing

Table of Contents

Who this is for

Prerequisites

Learning path

Why this matters

Concept explained simply

Core workflow (step-by-step)

Worked examples

Example 1: One-sample t-test (mean)

Example 2: Two-proportion z-test (A/B conversion)

Example 3: Chi-square test of independence (categories)

Assumptions and checks

Exercises you can do now

Checklist before checking solutions

Common mistakes and self-check

Practical projects

Mini challenge

Next steps

Quick Test

Practice Exercises

One-sample t-test: Daily active time

Instructions

Expected Output

Two-proportion z-test: A/B conversion

Hypothesis Testing — Quick Test

Have questions about Hypothesis Testing?

AI Assistant