luvv to helpDiscover the Best Free Online Tools

Probability

Learn Probability for Data Scientist for free: roadmap, examples, subskills, and a skill exam.

Published: January 1, 2026 | Updated: January 1, 2026

Why Probability matters for Data Scientists

Probability is the language of uncertainty. As a Data Scientist, you use it to reason about noisy data, assess model risk, design A/B tests, build Bayesian models, simulate outcomes, and communicate confidence. Mastering probability helps you make sound decisions under uncertainty and avoid common analytical traps.

  • Experimentation: compute p-values, power, and credible intervals.
  • Modeling: choose and fit appropriate distributions for data (Poisson, Binomial, Gaussian, etc.).
  • Inference: apply Bayes’ rule and understand priors/posteriors.
  • Simulation: validate assumptions and estimate quantities you cannot solve analytically.
  • Sequential behavior: represent processes with Markov chains (e.g., user states in a funnel).

Who this is for

  • Career-switchers into Data Science wanting a strong statistical foundation.
  • Analysts/engineers who run experiments or build predictive models.
  • Students reinforcing theory with practical, code-first examples.

Prerequisites

  • Comfort with basic algebra and functions.
  • Familiarity with Python and NumPy/Pandas is helpful for examples (not strictly required).
  • Basic descriptive statistics (mean, variance, percentiles).

Learning path (practical roadmap)

  1. Core probability rules — events, complements, independence, conditional probability, Bayes’ rule.
  2. Random variables & distributions — Bernoulli, Binomial, Poisson, Normal, Exponential; PMF/PDF/CDF.
  3. Moments — expectation, variance, covariance, correlation; linearity of expectation.
  4. Asymptotics — Law of Large Numbers (LLN) and Central Limit Theorem (CLT) for confidence intervals.
  5. Inequalities — Markov and Chebyshev for conservative bounds.
  6. Markov chains — state transitions, powers of transition matrices, stationary distributions.
  7. Simulation/Monte Carlo — random sampling, estimators, variance reduction, experiment simulation.
  8. Probabilistic thinking for modeling — choosing distributions, priors, assumptions, and validation.
Milestone tips
  • Pair every concept with at least one coded example (even a small one-liner).
  • Simulate to build intuition when formulas feel abstract.
  • Keep a personal “assumptions checklist” for every analysis.

Worked examples (with code)

1) Bayes for simple spam filtering

Compute P(Spam | contains “win”).

See solution
import math
# Suppose: P(Spam)=0.2, P(Word="win"|Spam)=0.4, P(Word="win"|Ham)=0.05
p_spam = 0.2
p_win_given_spam = 0.4
p_win_given_ham = 0.05
p_ham = 1 - p_spam
p_win = p_spam*p_win_given_spam + p_ham*p_win_given_ham
posterior = (p_spam*p_win_given_spam) / p_win
posterior
# Interpretation: if posterior > threshold (like 0.5 or cost-weighted), flag as spam.

2) Binomial probability for an A/B test

What is P(X ≥ 60) for X ~ Binomial(n=200, p=0.25)?

See code
from math import comb
n, p = 200, 0.25
prob = sum(comb(n, k) * (p**k) * ((1-p)**(n-k)) for k in range(60, n+1))
prob

Use a Normal approximation if you need speed: mean=np=50, var=np(1-p)=37.5.

3) Expected value and variance of revenue

Let daily revenue R = 5X - 2Y, where X and Y are independent counts with E[X]=12, Var[X]=9; E[Y]=4, Var[Y]=4.

See solution

E[R] = 5E[X] - 2E[Y] = 5*12 - 2*4 = 60 - 8 = 52.

Var[R] = 25 Var[X] + 4 Var[Y] (no cross term since independent) = 25*9 + 4*4 = 225 + 16 = 241.

4) CLT-based confidence interval for a mean

Sample of n=400 sessions, sample mean = 5.4 min, sample sd = 2.0 min. Approximate 95% CI for the true mean.

See solution

SE = 2 / sqrt(400) = 0.1, so 95% CI ≈ 5.4 ± 1.96*0.1 = (5.204, 5.596).

5) Markov chain: predicting user state

States: {New, Active, Churn}. Transition matrix rows sum to 1:

import numpy as np
P = np.array([
  [0.1, 0.8, 0.1],  # New → New,Active,Churn
  [0.0, 0.9, 0.1],  # Active → New,Active,Churn
  [0.0, 0.0, 1.0],  # Churn → absorbing
])
pi0 = np.array([1.0, 0.0, 0.0])  # start with all users New
pi2 = pi0 @ np.linalg.matrix_power(P, 2)
pi2
Interpretation

pi2 shows the distribution over states after 2 periods. Use for forecasting churn and planning re-engagement.

Drills and quick exercises

  • ☐ Compute P(A∪B) given P(A), P(B), and P(A∩B).
  • ☐ For X ~ Poisson(λ=3), calculate P(X ≤ 2).
  • ☐ Show that E[aX + b] = aE[X] + b for any constants a, b.
  • ☐ Simulate 10,000 coin flips and estimate P(≥ 60 heads in 100 flips).
  • ☐ Use CLT to build a 95% CI for a sample mean of your choosing.
  • ☐ Construct a 2-state Markov chain and find its stationary distribution.
  • ☐ Apply Bayes’ rule to a medical test with any plausible parameters you pick.

Common mistakes and debugging tips

  • Confusing independence with disjointness: disjoint events cannot both occur, independent events can. Check P(A∩B) = P(A)P(B) for independence.
  • Forgetting base rates in Bayes: a highly accurate test can still yield many false positives when prevalence is low. Always compute P(+) correctly.
  • Using Normal approximations too casually: check n·p and n·(1−p) ≥ ~10 for Binomial; otherwise consider exact methods or continuity corrections.
  • Ignoring variance in decision-making: compare expected value and uncertainty. Report intervals, not just point estimates.
  • Misusing CLT with heavy tails: large outliers slow convergence. Consider robust estimators or transformations.
  • Markov chain misuse: ensure each row sums to 1 and entries are non-negative. Validate with small power checks (P², P³).
  • Simulation bugs: seed randomness for reproducibility; verify simple moments (mean/variance) match theory before complex metrics.

Mini project: A/B Test Outcome Simulator

Build a tool that simulates an A/B test end-to-end and compares frequentist and Bayesian conclusions.

  1. Define true conversion rates pA and pB and choose sample sizes.
  2. Simulate outcomes with Binomial sampling for each variant.
  3. Compute: (a) z-test and 95% CI for the difference; (b) Bayesian posterior with Beta priors and the probability that B > A.
  4. Repeat many times (Monte Carlo) to estimate power and false positive rate.
  5. Visualize distributions and intervals; log assumptions and decisions.
Starter code
import numpy as np
from scipy.stats import beta, norm
rng = np.random.default_rng(42)

pA, pB = 0.10, 0.12
nA, nB = 1000, 1000
sims = 5000

z_wins = 0
bayes_wins = 0

for _ in range(sims):
    xA = rng.binomial(nA, pA)
    xB = rng.binomial(nB, pB)
    pA_hat, pB_hat = xA/nA, xB/nB

    # z-test for difference in proportions
    se = np.sqrt(pA_hat*(1-pA_hat)/nA + pB_hat*(1-pB_hat)/nB)
    z = (pB_hat - pA_hat) / (se + 1e-12)
    pval = 2*(1 - norm.cdf(abs(z)))
    if pval < 0.05 and pB_hat > pA_hat:
        z_wins += 1

    # Bayesian with Beta(1,1) priors
    postA = beta(xA+1, nA-xA+1)
    postB = beta(xB+1, nB-xB+1)
    # Monte Carlo posterior comparison
    drawA = postA.rvs(2000, random_state=rng)
    drawB = postB.rvs(2000, random_state=rng)
    prob_B_better = np.mean(drawB > drawA)
    if prob_B_better > 0.95:
        bayes_wins += 1

z_power = z_wins / sims
bayes_power = bayes_wins / sims
z_power, bayes_power

Deliverables: (1) notebook or script, (2) chart of power vs. sample size, (3) a short write-up of assumptions and recommendations.

Practical projects

  • Churn Markov Model: define states (Active, Passive, Churn), estimate transition matrix from data, forecast retention.
  • Demand Modeling: fit Poisson/Negative Binomial to daily orders, simulate inventory risk and stockout probabilities.
  • Risk Scoring: build a simple Bayesian spam/fraud score using word/feature likelihoods and a tunable prior.

Subskills

  • Random Variables and Distributions — Understand PMF/PDF/CDF and when to use Bernoulli, Binomial, Poisson, Normal, Exponential.
  • Conditional Probability and Bayes Rule — Compute posteriors and reason with base rates in practical settings.
  • Expectation, Variance, Covariance — Calculate and combine moments; interpret correlation vs. causation carefully.
  • Law of Large Numbers and CLT — Use sampling distributions to form confidence intervals and sanity-check estimates.
  • Probability Inequalities Basics — Apply Markov and Chebyshev for conservative bounds when assumptions are weak.
  • Markov Chains Basics — Model sequential user states and long-run behavior.
  • Simulation and Monte Carlo — Estimate complex probabilities, validate models, and plan experiments.
  • Probabilistic Thinking for Modeling — Map business questions to probabilistic structures and test assumptions.

Next steps

  • Re-implement every example with your own numbers and validate via simulation.
  • Apply probability to one real dataset (experimentation, funnel, or demand).
  • Move on to statistical inference and causal analysis after you are comfortable with CLT and Bayesian basics.

Probability — Skill Exam

This exam checks your grasp of core probability for Data Science. Answer all questions in one sitting. Everyone can take the exam for free. Only logged-in users will have their progress and scores saved.Tips: Show work on scratch paper for numeric items. For approximate numerics, stay within the stated tolerance.

12 questions70% to pass

Have questions about Probability?

AI Assistant

Ask questions about this tool