Who this is for
- Data Scientists and Analysts running A/B tests, experiments, or validating insights.
- ML Engineers comparing models or checking data shifts.
- Anyone who needs evidence-based decisions, not gut feeling.
Prerequisites
- Basic probability and distributions (normal, binomial).
- Mean, variance, standard error, sampling basics.
- Comfort with a calculator or spreadsheet. Python/R helpful but not required.
Learning path
- Learn the roles of H0 (null) and H1 (alternative).
- Understand alpha, p-value, and Type I/II errors.
- Pick the right test (t-test, z-test for proportions, chi-square, etc.).
- Check assumptions (independence, normality, equal variances when needed).
- Compute the test statistic and p-value; interpret clearly.
- Report effect size and confidence interval.
- Plan power and sample size for future tests.
Why this matters
- A/B testing: Decide if Variant B truly increases conversion.
- Product analytics: Validate if a metric changed after a rollout.
- ML monitoring: Detect data drift or performance regressions.
- Operations: Compare defect rates or response times across processes.
Concept explained simply
Hypothesis testing asks: If nothing changed (H0), how surprising is our data? If it is very surprising (p-value below alpha), we reject H0 and favor H1.
Mental model: The courtroom
H0 is "innocent by default." Data is the evidence. Alpha (e.g., 0.05) is the threshold for how much surprise we need to convict (reject H0). A small p-value means the evidence is strong against H0.
Core workflow (step-by-step)
- Translate the question into H0 and H1. Choose one- or two-tailed.
- Select a test that matches data type and design (paired vs independent, means vs proportions).
- Check assumptions (randomization, independence; normality or large-sample; equal variances if needed).
- Compute the test statistic and p-value.
- Decide using alpha (commonly 0.05). Do not over-interpret tiny effects.
- Report: effect size, CI, p-value, test used, assumptions, and practical impact.
Worked examples
Example 1: One-sample t-test (mean)
Scenario: Average session duration target is 10. Sample of n=30 has mean=9.5, sd=2. H0: mean=10 (two-tailed), alpha=0.05.
- SE = 2 / sqrt(30) ≈ 0.365
- t = (9.5 - 10) / 0.365 ≈ -1.37, df=29
- p ≈ 0.18 > 0.05 → Fail to reject H0. No evidence the mean differs from 10.
Example 2: Two-proportion z-test (A/B conversion)
A: n=5000, conv=250 (5%). B: n=4900, conv=294 (6%). H0: pA = pB (two-tailed), alpha=0.05.
- Pooled p = (250+294)/(5000+4900) ≈ 0.055
- SE ≈ sqrt(0.055*0.945*(1/5000+1/4900)) ≈ 0.00458
- z = (0.06 - 0.05)/0.00458 ≈ 2.18 → two-tailed p ≈ 0.029
- Decision: Reject H0. B is statistically higher than A.
Example 3: Chi-square test of independence (categories)
Device vs churn (2x2): Mobile Yes=60, No=140; Desktop Yes=40, No=160 (total=400). H0: device and churn are independent.
- Expected counts (Yes=100 total): Mobile Yes=50, Desktop Yes=50; Mobile No=150, Desktop No=150
- Chi-square ≈ 2 + 0.667 + 2 + 0.667 = 5.33; df=1 → p ≈ 0.021
- Decision: Reject H0. Churn is associated with device.
Choosing a test quick guide
- Mean, 1 sample: one-sample t-test (if sd unknown)
- Mean, 2 independent groups: two-sample t-test (Welch if variances differ)
- Paired measurements: paired t-test
- Proportions, 2 groups: two-proportion z-test
- Categories: chi-square (or Fisher if small counts)
- More than 2 means: ANOVA (or Kruskal–Wallis nonparametric)
Assumptions and checks
- Random, independent observations.
- t-tests: approximately normal means; robust if n is moderate (CLT). For small n, check normality/outliers.
- Equal variances only if using pooled t-test; Welch t-test avoids this.
- Proportion tests: adequate counts (np and n(1-p) typically ≥ 5–10).
- Chi-square: expected cell counts usually ≥ 5 (else use Fisher).
Exercises you can do now
These match the exercises below. Try them here, then compare with the solutions.
- Exercise 1 (one-sample t-test): A sample of n=40 users has mean daily active time 52.4 minutes with sd=8. Is it different from 50 minutes at alpha=0.05 (two-tailed)? State H0/H1, compute t, p, and a decision.
- Exercise 2 (two-proportion z-test): A: 270/6000 converted. B: 342/6100 converted. Test H0: pA = pB (two-tailed, alpha=0.05). Compute z, p, and decision. Also report the absolute difference in percentage points.
Need hints?
- For t-tests: SE = s / sqrt(n) and t = (xbar - mu0) / SE.
- For two-proportion tests: pooled p = (x1+x2)/(n1+n2); SE = sqrt(p*(1-p)*(1/n1+1/n2)).
- Two-tailed p is about twice the one-tailed area for |z| or |t|.
Checklist before checking solutions
- Clear H0/H1 and alpha defined.
- Right test chosen for data type and design.
- Test statistic and df (if applicable) computed.
- Approximate p-value and a decision stated.
- Effect size or raw difference reported.
- Assumptions mentioned briefly.
Common mistakes and self-check
- Misreading p-value: It is P(data | H0), not P(H0 | data). Self-check: Rephrase result properly.
- P-hacking/peeking: Stopping early when p dips below 0.05. Self-check: Predefine sample size and analysis plan.
- Wrong test tail: Using one-tailed after seeing data. Self-check: Tail is set before looking.
- Ignoring power: Small samples miss real effects. Self-check: Run a power or sample-size check.
- Assumption violations: Non-independence or tiny expected counts. Self-check: Review design and counts.
- Confusing statistical with practical significance. Self-check: Always include effect size and impact.
- Multiple comparisons without correction. Self-check: Control FDR (e.g., Benjamini–Hochberg) or adjust alpha.
Practical projects
- Run a simulated A/B test in a spreadsheet: generate conversions, apply a two-proportion z-test, and write a one-paragraph decision with effect size.
- Compare model error rates across two datasets (before/after a data change) using a two-proportion test; include a minimal power analysis.
- Audit a metric change: one-sample or paired t-test on pre/post values, plus a 95% confidence interval and a brief assumption check.
Mini challenge
You release a new onboarding flow. In a week, 600 of 10,000 new users activated in the old flow; 720 of 10,500 activated in the new flow. Pick an appropriate test, justify one- vs two-tailed, compute the decision at alpha=0.05, and state the practical implication for product rollout.
Tip
Decide whether direction matters for your business goal; include an effect size (percentage-point difference).
Next steps
- Practice with your own data (or small simulations) and write full test reports (hypotheses, test, p-value, CI, effect size, assumptions).
- Learn power and sample-size planning to design stronger experiments.
- Explore multiple testing control (FDR) for experiments with many variants or metrics.
Quick Test
Take the Quick Test below to check your understanding. Available to everyone; only logged-in users get saved progress.