Why Statistics matters for a Data Scientist
Statistics turns raw data into trustworthy decisions. As a Data Scientist, you will design experiments, estimate uncertainty, test hypotheses, build predictive models, and communicate risk. Statistics helps you avoid false wins, quantify impact, and select models that generalize—critical for A/B tests, product metrics, forecasting, and machine learning validation.
Typical tasks this skill unlocks
- Design and analyze A/B/n tests with power and sample size.
- Estimate metrics and build confidence intervals that stakeholders can trust.
- Choose and validate models (regression, time series) with correct assumptions.
- Handle small samples with resampling (bootstrap) or Bayesian methods.
- Control false discoveries when testing many metrics or segments.
What you will be able to do
- Summarize data with robust descriptive statistics and visual checks.
- Use sampling, distributions, and the central limit theorem to reason about uncertainty.
- Build confidence intervals and interpret p-values correctly.
- Run t-tests, proportion tests, chi-square tests, and understand power and effect size.
- Fit and diagnose basic regression models.
- Use simple Bayesian updates for rates and proportions.
- Work with time series trends, seasonality, and stationarity.
- Check statistical assumptions and avoid false discoveries across many comparisons.
Practical roadmap
- Describe: Understand variables, distributions, mean/median/variance, quantiles, outliers; plot histograms/boxplots; compute z-scores.
- Sample & reason: Random vs biased sampling; law of large numbers; central limit theorem; common distributions (Normal, t, Binomial, Poisson).
- Estimate: Standard error; confidence intervals for means and proportions; bootstrap for non-normal data.
- Test: Formulate H0/H1; choose tests (t, z for proportions, chi-square); interpret p-values; understand power, Type I/II errors.
- Model: Linear regression for prediction and inference; residual diagnostics; basic regularization awareness.
- Bayes basics: Priors for proportions (Beta); posterior update; credible intervals; compare to frequentist CI.
- Time series: Trend/seasonality decomposition; stationarity checks; simple forecasting baselines; autocorrelation intuition.
- Multiple testing: Why it inflates false positives; control FDR with Benjamini–Hochberg; preregister metrics mindset.
- Communicate: Report estimates with uncertainty, assumptions made, and practical conclusions.
Milestone checklist
Worked examples
1) A/B test on conversion rate (proportions z-test)
Scenario: Variant B shows 6.0% conversion vs 5.2% for control A. Is it significant at 5%?
import numpy as np
from statsmodels.stats.proportion import proportions_ztest
conv_A, n_A = 260, 5000 # 5.2%
conv_B, n_B = 300, 5000 # 6.0%
count = np.array([conv_B, conv_A])
nobs = np.array([n_B, n_A])
z_stat, p_val = proportions_ztest(count, nobs, alternative='larger')
print(z_stat, p_val)
# 95% CI for difference using normal approx
pA, pB = conv_A/n_A, conv_B/n_B
diff = pB - pA
se = np.sqrt(pA*(1-pA)/n_A + pB*(1-pB)/n_B)
ci_low, ci_high = diff - 1.96*se, diff + 1.96*se
print(diff, (ci_low, ci_high))Interpretation: If p-value < 0.05 and the CI for the difference excludes 0, you have evidence that B outperforms A by about the CI range.
Try it: compute required sample size
Target detectable lift: +0.6pp (5.2% to 5.8%), alpha=0.05, power=0.8. Use a power calculator or approximate with normal formula:
from math import sqrt
from scipy.stats import norm
p1 = 0.052
p2 = 0.058
alpha = 0.05
power = 0.80
z_alpha = norm.ppf(1 - alpha/2)
z_beta = norm.ppf(power)
pooled = (p1 + p2)/2
se_part = p1*(1-p1) + p2*(1-p2)
n_per_group = se_part * (z_alpha + z_beta)**2 / (p2 - p1)**2
print(round(n_per_group))2) Linear regression for pricing
Predict price from size and location rating; check assumptions and interpret coefficients.
import statsmodels.api as sm
import pandas as pd
# Fake data
df = pd.DataFrame({
'price': [210, 220, 250, 260, 275, 300, 320, 350, 360, 390],
'size_m2': [45, 48, 52, 55, 58, 60, 65, 70, 72, 80],
'loc_rating': [3.1, 3.0, 3.2, 3.4, 3.5, 3.6, 3.8, 4.0, 4.1, 4.3]
})
X = sm.add_constant(df[['size_m2', 'loc_rating']])
model = sm.OLS(df['price'], X).fit()
print(model.summary())
# Residual diagnostics
resid = model.resid
print('Mean residual ~ 0:', resid.mean())
print('Homoskedasticity proxy: corr(|resid|, fitted)')
import numpy as np
print(np.corrcoef(np.abs(resid), model.fittedvalues)[0,1])Interpret: Coefficients show marginal effect holding other features fixed. Check linearity (residual vs fitted), normality (QQ plot), and influential points (Cook's distance) when making inference.
Try it: add an interaction
Add size_m2 * loc_rating and see if fit improves (lower AIC, significant coefficient).
3) Bootstrap CI for the median
When data are skewed or heavy-tailed, bootstrap the median's CI.
import numpy as np
rng = np.random.default_rng(7)
data = rng.lognormal(mean=1.5, sigma=0.8, size=100)
B = 5000
boot_meds = []
for _ in range(B):
sample = rng.choice(data, size=len(data), replace=True)
boot_meds.append(np.median(sample))
ci = (np.percentile(boot_meds, 2.5), np.percentile(boot_meds, 97.5))
print('Median:', np.median(data), '95% CI:', ci)Report: median and 95% bootstrap CI. State that CI is from resampling.
4) Bayesian update for a conversion rate
Prior belief: Beta(1,1) (uniform). Observe 60 conversions out of 1000. Posterior is Beta(1+60, 1+940).
from scipy.stats import beta
alpha_post, beta_post = 1+60, 1+940
mean = alpha_post / (alpha_post + beta_post)
ci = beta.ppf([0.025, 0.975], alpha_post, beta_post)
print('Posterior mean:', mean, '95% credible interval:', ci)Compare to frequentist CI for a proportion; they will be similar for weak priors and moderate n.
5) Time series quickstart: trend, seasonality, stationarity
import pandas as pd
import numpy as np
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.tsa.stattools import adfuller
# Simulated monthly data
dates = pd.date_range('2022-01-01', periods=36, freq='M')
trend = np.linspace(100, 140, 36)
season = 10*np.sin(2*np.pi*np.arange(36)/12)
noise = np.random.default_rng(0).normal(0, 3, 36)
series = pd.Series(trend + season + noise, index=dates)
result = seasonal_decompose(series, model='additive', period=12)
# result.trend, result.seasonal, result.resid are available
adf_stat, pvalue, *_ = adfuller(series.dropna())
print('ADF p-value:', pvalue)If p-value is high, the series may be non-stationary; difference the series and test again.
6) Benjamini–Hochberg (FDR) for many metrics
Suppose you ran 10 significance tests and got these p-values:
import numpy as np
p = np.array([0.001, 0.004, 0.012, 0.019, 0.041, 0.052, 0.12, 0.23, 0.31, 0.77])
alpha = 0.05
m = len(p)
order = np.argsort(p)
p_sorted = p[order]
thresholds = alpha * (np.arange(1, m+1)/m)
# Find largest k with p_(k) <= threshold_k
k = np.where(p_sorted <= thresholds)[0]
cut_index = k.max() if len(k) else -1
significant_mask_sorted = np.zeros(m, dtype=bool)
if cut_index >= 0:
significant_mask_sorted[:cut_index+1] = True
# Map back to original order
significant_mask = np.zeros(m, dtype=bool)
significant_mask[order] = significant_mask_sorted
print('Significant flags:', significant_mask)
This controls expected false discovery rate at 5% across all tests.
Drills and micro-exercises
Common mistakes and how to debug
- Peeking in A/B tests: Stopping early on a significant result inflates false positives. Fix: use a fixed sample plan or a sequential method designed for peeking.
- Misreading p-values: p=0.04 is not a 96% chance the effect is real. Correct: it is the probability of data as extreme assuming no effect.
- Ignoring assumptions: t-tests and OLS need approximate normality of errors and independence. Fix: inspect residuals; use non-parametric or robust methods if violated.
- Multiple comparisons: Testing many segments inflates false positives. Fix: pre-register metrics; control FDR with BH.
- Overfitting regression: Too many features relative to n. Fix: cross-validate; simplify model; regularize.
- Confusing correlation with causation: Observational differences are not causal. Fix: use experiments or causal methods.
Debugging checklist
Mini project: Ship a trustworthy A/B test report
- Define metrics: Primary conversion rate; secondary metrics (e.g., revenue per user, click-through). State hypotheses and alpha. Decide on FDR control for secondary metrics.
- Plan sample size: Pick a minimum detectable effect and 80% power; compute per-group n.
- Collect data: Ensure random assignment, logging of exposure, conversions, and timestamps.
- Analyze: For the primary metric, run a two-proportion z-test and compute a 95% CI. For secondary metrics, apply BH FDR.
- Diagnostics: Check balance (user counts, baseline rate). Look for novelty effects or time trends.
- Report: Summarize effect sizes with uncertainty, decisions, limitations, and recommendations.
Deliverables
- Notebook/script with computations and plots.
- One-page summary: objective, method, main result with CI, guardrail metrics, decision, next steps.
Subskills
- Descriptive Statistics: Summaries (mean, median, variance, quantiles), outliers, and shape of distributions.
- Sampling And Distributions: Random sampling, CLT, and common distributions (Normal, t, Binomial, Poisson).
- Estimation And Confidence Intervals: Standard error, CIs for means/proportions, bootstrap.
- Hypothesis Testing: t-tests, z-tests for proportions, chi-square tests, p-values, power.
- Regression Basics: Linear regression, interpretation, diagnostics, simple regularization awareness.
- Bayesian Basics: Priors, likelihood, Beta-Binomial updates, credible intervals.
- Time Series Basics: Trend, seasonality, stationarity, simple forecasting baselines.
- Statistical Assumptions And Diagnostics: Residual checks, influence, robustness.
- Multiple Testing And False Discovery Awareness: FDR control with Benjamini–Hochberg.
Who this is for
- Aspiring and junior Data Scientists who need solid inference skills for experiments and modeling.
- Analysts and ML engineers who want to quantify uncertainty and make defensible decisions.
Prerequisites
- Comfort with basic algebra and functions.
- Python basics (lists, arrays) or R basics; ability to run notebooks or scripts.
- Familiarity with data frames (pandas or similar) is helpful.
Learning path
- Start with Descriptive Statistics and Sampling And Distributions.
- Move to Estimation And Confidence Intervals and Hypothesis Testing.
- Practice Regression Basics and Statistical Assumptions And Diagnostics.
- Add Bayesian Basics and Time Series Basics.
- Finish with Multiple Testing And False Discovery Awareness and a capstone A/B analysis.
Practical projects
- Analyze a funnel: compute stage-wise rates with CIs; identify the biggest drop with uncertainty.
- Marketing uplift: test email versions; size the test; run BH over multiple segments.
- Retention forecast: decompose weekly active users; build a naive seasonal forecast; evaluate error.
- Pricing model: regress price on features; validate assumptions; communicate elasticities.
Next steps
- Complete the subskills below in order.
- Do the mini project and share your one-page report with a peer.
- Take the skill exam to check readiness. Anyone can take it; logged-in users get saved progress.