Why this matters for Data Scientists
Random variables and distributions let you quantify uncertainty. You will use them to:
- Design and interpret A/B tests (Binomial, Normal approximation).
- Model event counts like errors, tickets, or clicks per minute (Poisson).
- Estimate SLAs and risk using tail probabilities (Normal, Exponential).
- Calibrate model outputs and set decision thresholds (distributions and quantiles).
- Simulate outcomes to compare product or policy choices (Monte Carlo with known distributions).
Concept explained simply
A random variable is a rule that turns uncertain outcomes into numbers. A distribution tells you how likely each number (or range) is.
- Discrete random variable: takes separate values (e.g., number of signups). Described by a PMF P(X = x).
- Continuous random variable: takes any value in a range (e.g., time in seconds). Described by a PDF f(x). Probabilities come from areas: P(a ≤ X ≤ b) = ∫ab f(x) dx.
- CDF F(x): probability that X ≤ x. It works for both discrete and continuous cases.
Cheat sheet: quantities you will use often
- Expectation (mean): E[X]. Linearity: E[aX + b] = aE[X] + b.
- Variance: Var(X) = E[(X − E[X])^2]. For constants: Var(aX + b) = a^2 Var(X).
- Bernoulli(p): E[X] = p, Var(X) = p(1−p).
- Binomial(n, p): E[X] = np, Var(X) = np(1−p).
- Poisson(λ): E[X] = Var(X) = λ.
- Normal(μ, σ^2): Z = (X − μ)/σ ~ Normal(0,1).
- Exponential(λ): mean 1/λ, memoryless.
Mental model
Think of a distribution as a landscape of likelihood. For discrete variables, it is like a row of bars (heights = probabilities). For continuous variables, it is a smooth hill (height = density). The exact probability is the bar height (discrete) or the area under the curve (continuous) over a region.
Core formulas you will actually use
- Binomial probability: P(X = k) = C(n,k) p^k (1−p)^{n−k}.
- Poisson probability: P(X = k) = e^{−λ} λ^k / k!.
- Normal standardization: Z = (X − μ)/σ; use Z to find probabilities by areas.
- Normal approximation to Binomial: if np ≥ 10 and n(1−p) ≥ 10, X ≈ Normal(np, np(1−p)).
- Law of total expectation: E[X] = E[E[X | Y]].
- Scaling Poisson: if rate is λ per unit time, then over t units, use λt.
Worked examples
1) Binomial: at least 3 signups out of 20 with p = 0.1
Let X ~ Binomial(n=20, p=0.1). We want P(X ≥ 3) = 1 − [P(0) + P(1) + P(2)].
- P(0) = 0.9^{20} ≈ 0.1216
- P(1) = 20·0.1·0.9^{19} ≈ 0.2702
- P(2) = C(20,2)·0.1^2·0.9^{18} ≈ 0.2852
Sum ≈ 0.6769, so P(X ≥ 3) ≈ 0.3231.
2) Normal: late deliveries beyond 40 minutes
Assume delivery time T ~ Normal(μ=30, σ=5). Probability of being late (T > 40): Z = (40 − 30)/5 = 2, so P(T > 40) ≈ 0.0228 (about 2.3%).
95th percentile: 30 + 1.645·5 ≈ 38.2 minutes.
3) Poisson: chance of at least one defect per batch
Defects per batch D ~ Poisson(λ=2.5). P(D ≥ 1) = 1 − P(0) = 1 − e^{−2.5} ≈ 0.918.
Two independent batches: P(no defects in both) = e^{−5} ≈ 0.0067, so P(at least one defect across two) ≈ 0.9933.
4) Mixtures: conversion rate across platforms
40% mobile with p=0.04, 60% desktop with p=0.06. Expected conversions per 100 users: 100 · (0.4·0.04 + 0.6·0.06) = 5.2.
Exercises you can try now
These mirror the interactive exercises below. Try them here first, then submit in the exercise panel.
- CTR estimation (Binomial/Normal approximation): You observe 20 clicks out of 200 impressions. Compute p-hat, a 95% normal-approximation confidence interval, and the expected number of clicks in the next 1000 impressions.
- Ticket times (Normal): T ~ Normal(μ=30, σ=8). Find P(T > 45) and the 90th percentile time.
- Calls (Poisson): Calls arrive at 2.5 per minute on average. Compute P(X ≥ 5) in one minute, and P(0 calls) in 30 seconds.
Need a nudge? Hints
- For a proportion, p-hat = x/n and SE ≈ sqrt(p-hat(1 − p-hat)/n); 95% CI ≈ p-hat ± 1.96·SE.
- For Normal, standardize with Z = (x − μ)/σ; use common Z values (1.28, 1.645, 1.96, 2.33).
- For Poisson, scale λ by time window: λnew = λ·t.
- I computed each answer symbolically before plugging numbers.
- I checked units (minutes vs. seconds; impressions vs. clicks).
- I validated that probabilities are between 0 and 1.
Common mistakes and how to self-check
- Mixing PDF with probability: For continuous X, P(X = a) = 0. Use areas under the curve, not f(a).
- Using Normal approximation when np or n(1−p) is too small: Check both are at least around 10.
- Forgetting to scale Poisson rates with time or area: Always adjust λ by the interval length.
- Confusing variance and standard deviation: SD is the square root of variance.
- Ignoring independence assumptions: Binomial needs independent trials with constant p. If not, consider alternative models.
Self-check mini-list
- Did I write down the distribution and its parameters before calculating?
- Did I verify assumptions (independence, identical p, rate stability)?
- Are my results sensible in magnitude and units?
Practical projects
- A/B test simulator: Simulate Binomial outcomes for control and treatment, compute lift distributions, and visualize overlap.
- Queue risk dashboard: Model ticket arrivals with Poisson; compute the probability of exceeding capacity in each 15-minute window.
- Anomaly thresholding: Fit a Normal distribution to a stable metric and set dynamic alert thresholds using quantiles (e.g., 99.5th percentile).
Suggested steps for the A/B simulator
- Choose n and p for control and treatment.
- Simulate outcomes 10,000 times for each arm.
- Compute lift and the proportion lift > 0.
- Plot histograms and report the 95% interval of lift.
Learning path
- Discrete basics: Bernoulli, Binomial, Geometric; expectation and variance.
- Counts: Poisson and Poisson processes; scaling and sums.
- Continuous basics: Uniform, Normal, Exponential; PDFs vs. CDFs.
- Approximations: Normal approx to Binomial; Central Limit Theorem intuition.
- Intervals and quantiles: Using Z-scores; interpreting tail risk.
- Mixtures and conditioning: Law of total expectation/variance.
- Simulation: Monte Carlo to validate analytic results.
Who this is for
- Aspiring and junior Data Scientists preparing for product analytics, experimentation, or modeling roles.
- Analysts and engineers who want reliable uncertainty estimates for decisions.
Prerequisites
- Comfort with basic algebra and percentages.
- Familiarity with mean, variance, and standard deviation.
- A calculator or spreadsheet for simple computations (optional: Python/R for simulation).
Next steps
- Practice by analyzing a small A/B test with Binomial confidence intervals.
- Estimate the probability of breaching an SLA using Normal tails.
- Move on to Sampling, CLT, and Hypothesis Testing to connect distributions with inference.
Mini challenge
A daily active user (DAU) session length S is modeled as a mixture: 70% short users with Exponential(λ=1/10 min), 30% power users with Exponential(λ=1/40 min). Compute:
- The expected session length E[S].
- P(S > 30 min).
Show reasoning
- E[S] = 0.7·10 + 0.3·40 = 7 + 12 = 19 minutes.
- P(S > 30) = 0.7·e^{−30/10} + 0.3·e^{−30/40} ≈ 0.7·e^{−3} + 0.3·e^{−0.75} ≈ 0.7·0.0498 + 0.3·0.4724 ≈ 0.0349 + 0.1417 ≈ 0.1766.
Quick Test
Everyone can take the Quick Test below. If you log in, your progress will be saved.