Why this matters
As a Data Analyst, you are often asked: Did the new variant actually improve the metric, or is it just random noise? Statistical significance helps you make that call with confidence. You will use it to:
- Decide if a new design increases conversion rate.
- Verify whether average order value changed after a pricing tweak.
- Communicate results clearly with confidence intervals and error rates.
Concept explained simply
Think of experimentation as listening for a faint signal (the true effect) in a noisy room (random variation). Statistical significance sets a rule for when the signal is strong enough to act on.
Mental model
- Null hypothesis (H0): There is no real difference; any observed change is noise.
- Alternative (H1): There is a real difference.
- Alpha (α): The false alarm tolerance (commonly 0.05). If p-value < α, you reject H0.
- p-value: How unusual your data would be if H0 were true. Smaller p = stronger evidence against H0.
- Confidence interval (CI): A range of plausible true effects. If the CI excludes 0, it’s significant at that level.
- Power (1−β): Chance to detect a real effect. Plan sample size to get adequate power for your Minimum Detectable Effect (MDE).
- Type I error: False positive. Type II error: False negative.
Formula mini-cheat sheet (practical)
- Two-proportion z-test (conversion): compare pA=xA/nA vs pB=xB/nB. Pooled p=(xA+xB)/(nA+nB). SE = sqrt(p(1−p)(1/nA+1/nB)). z = (pB−pA)/SE.
- Two-sample t-test (means): diff = meanB−meanA; SE = sqrt(sA^2/nA + sB^2/nB). t = diff/SE (Welch approximation is robust).
- 95% CI (approx): diff ± 1.96×SE.
- Sample size (rough, proportions): n per group ≈ 16×p×(1−p)/MDE^2 for 80% power at α≈0.05.
Quick reference
- Use a two-proportion z-test for conversion rates; a two-sample t-test for averages (AOV, time, revenue/user).
- Default to two-sided tests unless you have a strong, pre-registered one-sided hypothesis.
- Avoid “peeking” repeatedly at the p-value without correction; it inflates false positives. If you must, use sequential designs or adjust α.
- Report effect size and CI, not just the p-value.
- Plan MDE and sample size before launching. Underpowered tests waste time.
Worked examples
Example 1: Conversion rate (two-proportion z-test)
Variant A: 460 conversions / 10,000 users (4.6%). Variant B: 520 / 10,000 (5.2%).
Show calculation
- pA = 0.046, pB = 0.052, diff = 0.006 (0.6 pp).
- Pooled p = (460+520)/(10000+10000) = 980/20000 = 0.049.
- SE = sqrt(0.049×0.951×(1/10000+1/10000)) ≈ 0.00305.
- z = 0.006 / 0.00305 ≈ 1.97 → two-sided p ≈ 0.049.
- 95% CI ≈ 0.006 ± 1.96×0.00305 ≈ [0.000, 0.012].
Interpretation: Borderline significant at α=0.05; effect is small but likely positive.
Example 2: Average order value (two-sample t-test)
A: n=600, mean=52, sd=20. B: n=620, mean=55, sd=21.
Show calculation
- diff = 55−52 = 3.
- SE = sqrt(20^2/600 + 21^2/620) = sqrt(400/600 + 441/620) ≈ sqrt(0.6667 + 0.7113) ≈ 1.17.
- t ≈ 3 / 1.17 ≈ 2.56 → two-sided p ≈ 0.01.
- 95% CI ≈ 3 ± 1.96×1.17 ≈ [0.7, 5.3].
Interpretation: Significant increase in AOV, with a plausible lift between ~0.7 and ~5.3 currency units.
Example 3: Rough sample size for a conversion lift
Baseline p=5%, desired MDE=+0.5 pp, α=0.05, power≈80%.
Show calculation
- n per group ≈ 16×p×(1−p)/MDE^2.
- n ≈ 16×0.05×0.95 / 0.005^2 ≈ 30,400 per variant (rough planning figure).
Interpretation: You need large samples to detect small lifts. Use this as a ballpark, then refine with a proper calculator.
How to do it (step-by-step)
- Define primary metric and direction of interest (two-sided by default).
- Predefine α (commonly 0.05) and MDE. Estimate sample size and duration.
- Choose test: two-proportion z-test for conversion, t-test for means.
- Collect clean, randomized data with stable tracking.
- Compute effect size and SE; get p-value and CI.
- Decide: if p < α (or CI excludes 0), reject H0. Also check if the effect is practically meaningful.
- Document: metric, α, test type, effect size, CI, p-value, duration, traffic, caveats.
Pre-launch checklist
- [ ] Clear primary metric and variant naming.
- [ ] α and test type documented.
- [ ] MDE and sample size estimate.
- [ ] Randomization and data quality checks.
- [ ] Decision rules (stop, extend, or iterate).
Exercises (hands-on)
Try these before peeking at solutions.
Exercise 1: Two-proportion z-test (conversion)
A: 300 conversions / 8,000 users. B: 360 / 8,000. Two-sided, α=0.05. Is B significantly better?
- Compute pA, pB, pooled p, SE, z, p-value, CI, and decision.
Exercise 2: Two-sample t-test (mean)
A: n=500, mean=24.0, sd=9.5. B: n=480, mean=25.2, sd=9.2. Two-sided, α=0.05. Is the mean higher in B?
- Compute diff, SE, t, p-value, CI, and decision.
Common mistakes and self-checks
- Peeking too often: Repeated looks inflate Type I error. Self-check: Did you predefine looks or adjust α?
- Ignoring effect size: A tiny but significant effect may be useless. Self-check: Did you report CI and practical impact?
- Underpowered tests: Inconclusive results waste time. Self-check: Did you plan MDE and sample size?
- Multiple comparisons: Testing many metrics/segments without correction increases false positives. Self-check: Limit primary metrics or adjust for multiplicity.
- Mismatched test: Using a t-test for rates or z-test for skewed means without care. Self-check: Choose tests matched to metric type.
- Dirty data: Bot traffic, tracking bugs, or non-random exposure bias results. Self-check: Run data quality checks.
Practical projects
- Build a simple AB significance calculator in a spreadsheet: inputs (xA, nA, xB, nB), outputs (diff, SE, z, p, CI).
- Simulate experiments: Generate fake conversions with a known true lift; verify how often you detect significance at α=0.05.
- Experiment audit: Take a past test, recompute p-value and CI, and write a one-page results summary with decision and caveats.
Who this is for
- Data Analysts and growth practitioners running or validating A/B tests.
- Anyone interpreting experiment results for product decisions.
Prerequisites
- Comfort with basic arithmetic and percentages.
- Understanding of mean, standard deviation, and proportions.
- Basic familiarity with spreadsheets or analytical tools.
Learning path
- Before: Experiment design, randomization, metric selection.
- Now: Statistical significance, p-values, confidence intervals.
- Next: Power analysis, MDE tuning, sequential testing and multiple comparisons.
Mini challenge
Baseline conversion is 7%. You want to detect a +0.7 pp lift with 80% power at α=0.05. Approximately how many users per variant do you need?
Reveal answer
n ≈ 16×0.07×0.93 / 0.007^2 ≈ 16×0.0651 / 0.000049 ≈ 1.0416 / 0.000049 ≈ ~21,265 per variant (rough planning figure).
Next steps
- Turn these steps into a repeatable checklist and template.
- Implement a lightweight review process: pre-analysis plan, post-mortem, and dashboarding.
- Advance to power analysis and sequential methods to run faster, safer experiments.
Quick Test
Take the short quiz to check your understanding. Available to everyone; only logged-in users get saved progress.