How to learn Statistics for Data Scientist for free

Why Statistics matters for a Data Scientist

Statistics turns raw data into trustworthy decisions. As a Data Scientist, you will design experiments, estimate uncertainty, test hypotheses, build predictive models, and communicate risk. Statistics helps you avoid false wins, quantify impact, and select models that generalize—critical for A/B tests, product metrics, forecasting, and machine learning validation.

Typical tasks this skill unlocks

Design and analyze A/B/n tests with power and sample size.
Estimate metrics and build confidence intervals that stakeholders can trust.
Choose and validate models (regression, time series) with correct assumptions.
Handle small samples with resampling (bootstrap) or Bayesian methods.
Control false discoveries when testing many metrics or segments.

What you will be able to do

Summarize data with robust descriptive statistics and visual checks.
Use sampling, distributions, and the central limit theorem to reason about uncertainty.
Build confidence intervals and interpret p-values correctly.
Run t-tests, proportion tests, chi-square tests, and understand power and effect size.
Fit and diagnose basic regression models.
Use simple Bayesian updates for rates and proportions.
Work with time series trends, seasonality, and stationarity.
Check statistical assumptions and avoid false discoveries across many comparisons.

Practical roadmap

Describe: Understand variables, distributions, mean/median/variance, quantiles, outliers; plot histograms/boxplots; compute z-scores.
Sample & reason: Random vs biased sampling; law of large numbers; central limit theorem; common distributions (Normal, t, Binomial, Poisson).
Estimate: Standard error; confidence intervals for means and proportions; bootstrap for non-normal data.
Test: Formulate H0/H1; choose tests (t, z for proportions, chi-square); interpret p-values; understand power, Type I/II errors.
Model: Linear regression for prediction and inference; residual diagnostics; basic regularization awareness.
Bayes basics: Priors for proportions (Beta); posterior update; credible intervals; compare to frequentist CI.
Time series: Trend/seasonality decomposition; stationarity checks; simple forecasting baselines; autocorrelation intuition.
Multiple testing: Why it inflates false positives; control FDR with Benjamini–Hochberg; preregister metrics mindset.
Communicate: Report estimates with uncertainty, assumptions made, and practical conclusions.

Milestone checklist

I can pick the right test for a metric and justify assumptions.
I can compute and explain a 95% CI for a mean and a proportion.
I can size an A/B test given target lift and power.
I can fit a linear regression and check residual diagnostics.
I can control FDR when comparing many variants/segments.

Worked examples

1) A/B test on conversion rate (proportions z-test)

Scenario: Variant B shows 6.0% conversion vs 5.2% for control A. Is it significant at 5%?

import numpy as np
from statsmodels.stats.proportion import proportions_ztest

conv_A, n_A = 260, 5000  # 5.2%
conv_B, n_B = 300, 5000  # 6.0%
count = np.array([conv_B, conv_A])
nobs = np.array([n_B, n_A])

z_stat, p_val = proportions_ztest(count, nobs, alternative='larger')
print(z_stat, p_val)

# 95% CI for difference using normal approx
pA, pB = conv_A/n_A, conv_B/n_B
diff = pB - pA
se = np.sqrt(pA*(1-pA)/n_A + pB*(1-pB)/n_B)
ci_low, ci_high = diff - 1.96*se, diff + 1.96*se
print(diff, (ci_low, ci_high))

Interpretation: If p-value < 0.05 and the CI for the difference excludes 0, you have evidence that B outperforms A by about the CI range.

Try it: compute required sample size

Target detectable lift: +0.6pp (5.2% to 5.8%), alpha=0.05, power=0.8. Use a power calculator or approximate with normal formula:

from math import sqrt
from scipy.stats import norm

p1 = 0.052
p2 = 0.058
alpha = 0.05
power = 0.80
z_alpha = norm.ppf(1 - alpha/2)
z_beta = norm.ppf(power)
pooled = (p1 + p2)/2
se_part = p1*(1-p1) + p2*(1-p2)
n_per_group = se_part * (z_alpha + z_beta)**2 / (p2 - p1)**2
print(round(n_per_group))

2) Linear regression for pricing

Predict price from size and location rating; check assumptions and interpret coefficients.

import statsmodels.api as sm
import pandas as pd

# Fake data
df = pd.DataFrame({
    'price': [210, 220, 250, 260, 275, 300, 320, 350, 360, 390],
    'size_m2': [45, 48, 52, 55, 58, 60, 65, 70, 72, 80],
    'loc_rating': [3.1, 3.0, 3.2, 3.4, 3.5, 3.6, 3.8, 4.0, 4.1, 4.3]
})
X = sm.add_constant(df[['size_m2', 'loc_rating']])
model = sm.OLS(df['price'], X).fit()
print(model.summary())

# Residual diagnostics
resid = model.resid
print('Mean residual ~ 0:', resid.mean())
print('Homoskedasticity proxy: corr(|resid|, fitted)')
import numpy as np
print(np.corrcoef(np.abs(resid), model.fittedvalues)[0,1])

Interpret: Coefficients show marginal effect holding other features fixed. Check linearity (residual vs fitted), normality (QQ plot), and influential points (Cook's distance) when making inference.

Try it: add an interaction

Add size_m2 * loc_rating and see if fit improves (lower AIC, significant coefficient).

3) Bootstrap CI for the median

When data are skewed or heavy-tailed, bootstrap the median's CI.

import numpy as np
rng = np.random.default_rng(7)
data = rng.lognormal(mean=1.5, sigma=0.8, size=100)

B = 5000
boot_meds = []
for _ in range(B):
    sample = rng.choice(data, size=len(data), replace=True)
    boot_meds.append(np.median(sample))

ci = (np.percentile(boot_meds, 2.5), np.percentile(boot_meds, 97.5))
print('Median:', np.median(data), '95% CI:', ci)

Report: median and 95% bootstrap CI. State that CI is from resampling.

4) Bayesian update for a conversion rate

Prior belief: Beta(1,1) (uniform). Observe 60 conversions out of 1000. Posterior is Beta(1+60, 1+940).

from scipy.stats import beta
alpha_post, beta_post = 1+60, 1+940
mean = alpha_post / (alpha_post + beta_post)
ci = beta.ppf([0.025, 0.975], alpha_post, beta_post)
print('Posterior mean:', mean, '95% credible interval:', ci)

Compare to frequentist CI for a proportion; they will be similar for weak priors and moderate n.

5) Time series quickstart: trend, seasonality, stationarity

import pandas as pd
import numpy as np
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.tsa.stattools import adfuller

# Simulated monthly data
dates = pd.date_range('2022-01-01', periods=36, freq='M')
trend = np.linspace(100, 140, 36)
season = 10*np.sin(2*np.pi*np.arange(36)/12)
noise = np.random.default_rng(0).normal(0, 3, 36)
series = pd.Series(trend + season + noise, index=dates)

result = seasonal_decompose(series, model='additive', period=12)
# result.trend, result.seasonal, result.resid are available

adf_stat, pvalue, *_ = adfuller(series.dropna())
print('ADF p-value:', pvalue)

If p-value is high, the series may be non-stationary; difference the series and test again.

6) Benjamini–Hochberg (FDR) for many metrics

Suppose you ran 10 significance tests and got these p-values:

import numpy as np
p = np.array([0.001, 0.004, 0.012, 0.019, 0.041, 0.052, 0.12, 0.23, 0.31, 0.77])
alpha = 0.05
m = len(p)
order = np.argsort(p)
p_sorted = p[order]
thresholds = alpha * (np.arange(1, m+1)/m)
# Find largest k with p_(k) <= threshold_k
k = np.where(p_sorted <= thresholds)[0]
cut_index = k.max() if len(k) else -1
significant_mask_sorted = np.zeros(m, dtype=bool)
if cut_index >= 0:
    significant_mask_sorted[:cut_index+1] = True
# Map back to original order
significant_mask = np.zeros(m, dtype=bool)
significant_mask[order] = significant_mask_sorted
print('Significant flags:', significant_mask)

This controls expected false discovery rate at 5% across all tests.

Drills and micro-exercises

Compute mean, median, standard deviation, and IQR for a skewed dataset. Which is the most robust center?
From a sample mean 12.3 with standard error 0.8, write the 95% CI and interpret it.
Choose the right test: rate comparison (two groups, large n) vs mean comparison (small n, unknown variance).
Simulate 10,000 coin flips to visualize the binomial approximating normal as n grows.
Fit a simple linear regression; check residual vs fitted plot for patterns.
Apply BH FDR to 20 random p-values and note how many remain significant vs naive 0.05 cutoff.

Common mistakes and how to debug

Peeking in A/B tests: Stopping early on a significant result inflates false positives. Fix: use a fixed sample plan or a sequential method designed for peeking.
Misreading p-values: p=0.04 is not a 96% chance the effect is real. Correct: it is the probability of data as extreme assuming no effect.
Ignoring assumptions: t-tests and OLS need approximate normality of errors and independence. Fix: inspect residuals; use non-parametric or robust methods if violated.
Multiple comparisons: Testing many segments inflates false positives. Fix: pre-register metrics; control FDR with BH.
Overfitting regression: Too many features relative to n. Fix: cross-validate; simplify model; regularize.
Confusing correlation with causation: Observational differences are not causal. Fix: use experiments or causal methods.

Debugging checklist

Plot data and residuals before trusting p-values.
Re-run with robust/bootstrapped SEs for sensitivity.
Check power: was the test capable of detecting your effect?
Reproduce results with a different seed or bootstrap to test stability.

Mini project: Ship a trustworthy A/B test report

Define metrics: Primary conversion rate; secondary metrics (e.g., revenue per user, click-through). State hypotheses and alpha. Decide on FDR control for secondary metrics.
Plan sample size: Pick a minimum detectable effect and 80% power; compute per-group n.
Collect data: Ensure random assignment, logging of exposure, conversions, and timestamps.
Analyze: For the primary metric, run a two-proportion z-test and compute a 95% CI. For secondary metrics, apply BH FDR.
Diagnostics: Check balance (user counts, baseline rate). Look for novelty effects or time trends.
Report: Summarize effect sizes with uncertainty, decisions, limitations, and recommendations.

Deliverables

Notebook/script with computations and plots.
One-page summary: objective, method, main result with CI, guardrail metrics, decision, next steps.

Subskills

Descriptive Statistics: Summaries (mean, median, variance, quantiles), outliers, and shape of distributions.
Sampling And Distributions: Random sampling, CLT, and common distributions (Normal, t, Binomial, Poisson).
Estimation And Confidence Intervals: Standard error, CIs for means/proportions, bootstrap.
Hypothesis Testing: t-tests, z-tests for proportions, chi-square tests, p-values, power.
Regression Basics: Linear regression, interpretation, diagnostics, simple regularization awareness.
Bayesian Basics: Priors, likelihood, Beta-Binomial updates, credible intervals.
Time Series Basics: Trend, seasonality, stationarity, simple forecasting baselines.
Statistical Assumptions And Diagnostics: Residual checks, influence, robustness.
Multiple Testing And False Discovery Awareness: FDR control with Benjamini–Hochberg.

Who this is for

Aspiring and junior Data Scientists who need solid inference skills for experiments and modeling.
Analysts and ML engineers who want to quantify uncertainty and make defensible decisions.

Prerequisites

Comfort with basic algebra and functions.
Python basics (lists, arrays) or R basics; ability to run notebooks or scripts.
Familiarity with data frames (pandas or similar) is helpful.

Learning path

Start with Descriptive Statistics and Sampling And Distributions.
Move to Estimation And Confidence Intervals and Hypothesis Testing.
Practice Regression Basics and Statistical Assumptions And Diagnostics.
Add Bayesian Basics and Time Series Basics.
Finish with Multiple Testing And False Discovery Awareness and a capstone A/B analysis.

Practical projects

Analyze a funnel: compute stage-wise rates with CIs; identify the biggest drop with uncertainty.
Marketing uplift: test email versions; size the test; run BH over multiple segments.
Retention forecast: decompose weekly active users; build a naive seasonal forecast; evaluate error.
Pricing model: regress price on features; validate assumptions; communicate elasticities.

Next steps

Complete the subskills below in order.
Do the mini project and share your one-page report with a peer.
Take the skill exam to check readiness. Anyone can take it; logged-in users get saved progress.

Menu

Statistics

Table of Contents

Why Statistics matters for a Data Scientist

What you will be able to do

Practical roadmap

Worked examples

1) A/B test on conversion rate (proportions z-test)

2) Linear regression for pricing

3) Bootstrap CI for the median

4) Bayesian update for a conversion rate

5) Time series quickstart: trend, seasonality, stationarity

6) Benjamini–Hochberg (FDR) for many metrics

Drills and micro-exercises

Common mistakes and how to debug

Mini project: Ship a trustworthy A/B test report

Subskills

Who this is for

Prerequisites

Learning path

Practical projects

Next steps

Statistics — Skill Exam

Topics

Time Series Basics

Descriptive Statistics

Sampling And Distributions

Estimation And Confidence Intervals

Hypothesis Testing

Regression Basics

Bayesian Basics

Statistical Assumptions And Diagnostics

Multiple Testing And False Discovery Awareness

Have questions about Statistics?

AI Assistant