How to learn Random Variables And Distributions for Probability in Data Scientist for free

Why this matters for Data Scientists

Random variables and distributions let you quantify uncertainty. You will use them to:

Design and interpret A/B tests (Binomial, Normal approximation).
Model event counts like errors, tickets, or clicks per minute (Poisson).
Estimate SLAs and risk using tail probabilities (Normal, Exponential).
Calibrate model outputs and set decision thresholds (distributions and quantiles).
Simulate outcomes to compare product or policy choices (Monte Carlo with known distributions).

Concept explained simply

A random variable is a rule that turns uncertain outcomes into numbers. A distribution tells you how likely each number (or range) is.

Discrete random variable: takes separate values (e.g., number of signups). Described by a PMF P(X = x).
Continuous random variable: takes any value in a range (e.g., time in seconds). Described by a PDF f(x). Probabilities come from areas: P(a ≤ X ≤ b) = ∫_a^b f(x) dx.
CDF F(x): probability that X ≤ x. It works for both discrete and continuous cases.

Cheat sheet: quantities you will use often

Expectation (mean): E[X]. Linearity: E[aX + b] = aE[X] + b.
Variance: Var(X) = E[(X − E[X])^2]. For constants: Var(aX + b) = a^2 Var(X).
Bernoulli(p): E[X] = p, Var(X) = p(1−p).
Binomial(n, p): E[X] = np, Var(X) = np(1−p).
Poisson(λ): E[X] = Var(X) = λ.
Normal(μ, σ^2): Z = (X − μ)/σ ~ Normal(0,1).
Exponential(λ): mean 1/λ, memoryless.

Mental model

Think of a distribution as a landscape of likelihood. For discrete variables, it is like a row of bars (heights = probabilities). For continuous variables, it is a smooth hill (height = density). The exact probability is the bar height (discrete) or the area under the curve (continuous) over a region.

Core formulas you will actually use

Binomial probability: P(X = k) = C(n,k) p^k (1−p)^{n−k}.
Poisson probability: P(X = k) = e^{−λ} λ^k / k!.
Normal standardization: Z = (X − μ)/σ; use Z to find probabilities by areas.
Normal approximation to Binomial: if np ≥ 10 and n(1−p) ≥ 10, X ≈ Normal(np, np(1−p)).
Law of total expectation: E[X] = E[E[X | Y]].
Scaling Poisson: if rate is λ per unit time, then over t units, use λt.

Worked examples

1) Binomial: at least 3 signups out of 20 with p = 0.1

Let X ~ Binomial(n=20, p=0.1). We want P(X ≥ 3) = 1 − [P(0) + P(1) + P(2)].

P(0) = 0.9^{20} ≈ 0.1216
P(1) = 20·0.1·0.9^{19} ≈ 0.2702
P(2) = C(20,2)·0.1^2·0.9^{18} ≈ 0.2852

Sum ≈ 0.6769, so P(X ≥ 3) ≈ 0.3231.

2) Normal: late deliveries beyond 40 minutes

Assume delivery time T ~ Normal(μ=30, σ=5). Probability of being late (T > 40): Z = (40 − 30)/5 = 2, so P(T > 40) ≈ 0.0228 (about 2.3%).

95th percentile: 30 + 1.645·5 ≈ 38.2 minutes.

3) Poisson: chance of at least one defect per batch

Defects per batch D ~ Poisson(λ=2.5). P(D ≥ 1) = 1 − P(0) = 1 − e^{−2.5} ≈ 0.918.

Two independent batches: P(no defects in both) = e^{−5} ≈ 0.0067, so P(at least one defect across two) ≈ 0.9933.

4) Mixtures: conversion rate across platforms

40% mobile with p=0.04, 60% desktop with p=0.06. Expected conversions per 100 users: 100 · (0.4·0.04 + 0.6·0.06) = 5.2.

Exercises you can try now

These mirror the interactive exercises below. Try them here first, then submit in the exercise panel.

CTR estimation (Binomial/Normal approximation): You observe 20 clicks out of 200 impressions. Compute p-hat, a 95% normal-approximation confidence interval, and the expected number of clicks in the next 1000 impressions.
Ticket times (Normal): T ~ Normal(μ=30, σ=8). Find P(T > 45) and the 90th percentile time.
Calls (Poisson): Calls arrive at 2.5 per minute on average. Compute P(X ≥ 5) in one minute, and P(0 calls) in 30 seconds.

Need a nudge? Hints

For a proportion, p-hat = x/n and SE ≈ sqrt(p-hat(1 − p-hat)/n); 95% CI ≈ p-hat ± 1.96·SE.
For Normal, standardize with Z = (x − μ)/σ; use common Z values (1.28, 1.645, 1.96, 2.33).
For Poisson, scale λ by time window: λ_new = λ·t.

I computed each answer symbolically before plugging numbers.
I checked units (minutes vs. seconds; impressions vs. clicks).
I validated that probabilities are between 0 and 1.

Common mistakes and how to self-check

Mixing PDF with probability: For continuous X, P(X = a) = 0. Use areas under the curve, not f(a).
Using Normal approximation when np or n(1−p) is too small: Check both are at least around 10.
Forgetting to scale Poisson rates with time or area: Always adjust λ by the interval length.
Confusing variance and standard deviation: SD is the square root of variance.
Ignoring independence assumptions: Binomial needs independent trials with constant p. If not, consider alternative models.

Self-check mini-list

Did I write down the distribution and its parameters before calculating?
Did I verify assumptions (independence, identical p, rate stability)?
Are my results sensible in magnitude and units?

Practical projects

A/B test simulator: Simulate Binomial outcomes for control and treatment, compute lift distributions, and visualize overlap.
Queue risk dashboard: Model ticket arrivals with Poisson; compute the probability of exceeding capacity in each 15-minute window.
Anomaly thresholding: Fit a Normal distribution to a stable metric and set dynamic alert thresholds using quantiles (e.g., 99.5th percentile).

Suggested steps for the A/B simulator

Choose n and p for control and treatment.
Simulate outcomes 10,000 times for each arm.
Compute lift and the proportion lift > 0.
Plot histograms and report the 95% interval of lift.

Learning path

Discrete basics: Bernoulli, Binomial, Geometric; expectation and variance.
Counts: Poisson and Poisson processes; scaling and sums.
Continuous basics: Uniform, Normal, Exponential; PDFs vs. CDFs.
Approximations: Normal approx to Binomial; Central Limit Theorem intuition.
Intervals and quantiles: Using Z-scores; interpreting tail risk.
Mixtures and conditioning: Law of total expectation/variance.
Simulation: Monte Carlo to validate analytic results.

Who this is for

Aspiring and junior Data Scientists preparing for product analytics, experimentation, or modeling roles.
Analysts and engineers who want reliable uncertainty estimates for decisions.

Prerequisites

Comfort with basic algebra and percentages.
Familiarity with mean, variance, and standard deviation.
A calculator or spreadsheet for simple computations (optional: Python/R for simulation).

Next steps

Practice by analyzing a small A/B test with Binomial confidence intervals.
Estimate the probability of breaching an SLA using Normal tails.
Move on to Sampling, CLT, and Hypothesis Testing to connect distributions with inference.

Mini challenge

A daily active user (DAU) session length S is modeled as a mixture: 70% short users with Exponential(λ=1/10 min), 30% power users with Exponential(λ=1/40 min). Compute:

The expected session length E[S].
P(S > 30 min).

Show reasoning

E[S] = 0.7·10 + 0.3·40 = 7 + 12 = 19 minutes.
P(S > 30) = 0.7·e^{−30/10} + 0.3·e^{−30/40} ≈ 0.7·e^{−3} + 0.3·e^{−0.75} ≈ 0.7·0.0498 + 0.3·0.4724 ≈ 0.0349 + 0.1417 ≈ 0.1766.

Quick Test

Everyone can take the Quick Test below. If you log in, your progress will be saved.

Menu

Random Variables And Distributions

Table of Contents

Why this matters for Data Scientists

Concept explained simply

Mental model

Core formulas you will actually use

Worked examples

1) Binomial: at least 3 signups out of 20 with p = 0.1

2) Normal: late deliveries beyond 40 minutes

3) Poisson: chance of at least one defect per batch

4) Mixtures: conversion rate across platforms

Exercises you can try now

Common mistakes and how to self-check

Practical projects

Learning path

Who this is for

Prerequisites

Next steps

Mini challenge

Quick Test

Practice Exercises

CTR estimation with Binomial and Normal approximation

Instructions

Expected Output

Normal probabilities for ticket resolution time

Poisson counts for call arrivals

Random Variables And Distributions — Quick Test

Have questions about Random Variables And Distributions?

AI Assistant