How to learn Sampling And Distributions for Statistics in Data Scientist for free

Why this matters

Sampling and distributions are the backbone of evidence-based decisions. As a Data Scientist, you will:

Evaluate A/B tests: estimate conversion lifts and uncertainty.
Build dashboards that show reliable metrics with margins of error.
Forecast demand and quantify risk with probabilistic models.
Detect anomalies while controlling false alarms.
Run quick studies when full population data is unavailable.

How this page works

You can read, try examples, and take the quick test. The test is available to everyone; only logged-in users get saved progress.

Concept explained simply

Population is everyone you care about. A sample is the smaller set you actually measure. Sampling lets you learn about the population when measuring everyone is too expensive or slow.

Distributions describe how values vary. Some are for counts (Binomial, Poisson); some are for continuous measurements (Normal, t, Exponential). The sampling distribution tells you how a statistic (like the sample mean) varies from sample to sample.

Mental model

Think of your statistic (mean, proportion) as a dart thrown at the true value. Each new sample throws a slightly different dart. The distribution of those darts around the true value is the sampling distribution. Its spread is the standard error; more data means tighter grouping.

Core ideas you will use on the job

Populations, samples, and error

Sampling error: random variation because you used a sample.
Bias: systematic error (e.g., surveying only power users).
Standard error (SE): spread of a statistic across many samples.

Sampling methods

Simple random: every unit has equal chance.
Stratified: split into meaningful groups (strata), sample each. Great for rare but important segments.
Cluster: sample groups (clusters) first, then units. Good when lists are hard to build.
Systematic: pick every k-th unit after a random start (avoid hidden periodicity).

Distributions you’ll meet often

Bernoulli/Binomial: success/failure counts (e.g., conversions).
Poisson: counts in time/space (e.g., requests per minute).
Normal: bell curve for many aggregated effects; z-scores.
t-distribution: like Normal but with heavier tails; use for means when sigma is unknown.
Exponential: waiting times between Poisson events.

Sampling distributions and the CLT

Central Limit Theorem (CLT): with a large enough n and finite variance, the sampling distribution of the sample mean is approximately Normal, regardless of the population’s shape.

SE(mean) ≈ s/√n (use sample s).
SE(proportion) ≈ √[p(1−p)/n] (use p̂ from data).

Confidence intervals (CI)

Mean (unknown sigma): mean ± t* × s/√n.
Proportion: p̂ ± z* × √[p̂(1−p̂)/n].
Interpretation: If you repeated the study many times, about 95% of the resulting 95% CIs would include the true value.

When to use Normal vs t

Use z (Normal) when population sigma is known or n is very large and population isn’t heavy-tailed.
Use t when estimating sigma with sample s and n is moderate or small. As n grows, t approaches Normal.

Worked examples

Example 1 — Standard error and CLT

A support team samples 36 response times. Sample mean = 42 min, sample sd s = 12 min.

SE(mean) = s/√n = 12/6 = 2 min.
Approx 95% CI (t≈2.03 for df=35, close to 2): 42 ± 2.03×2 = 42 ± 4.06 → (37.94, 46.06) min.
Even if raw times are skewed, by CLT the mean is roughly Normal here.

Example 2 — Proportion CI

Out of 800 sessions, 96 convert. p̂ = 96/800 = 0.12.

SE(p̂) = √[0.12×0.88/800] ≈ √(0.1056/800) ≈ √0.000132 ≈ 0.0115.
95% CI: 0.12 ± 1.96×0.0115 ≈ 0.12 ± 0.0225 → (0.0975, 0.1425).

Example 3 — Binomial vs Poisson

Bug reports: average 3 per day. Assume independent arrivals.

Use Poisson(λ=3) for counts per day.
P(X ≥ 5) = 1 − P(X ≤ 4). Compute from Poisson pmf/cdf: ≈ 1 − 0.8153 = 0.1847 (about 18%).

Rule of thumb: Poisson fits counts in fixed intervals when events are independent and rare.

Example 4 — t-based CI for mean latency

n=40 page loads. mean=510 ms, s=120 ms. 95% CI.

SE = 120/√40 ≈ 18.97 ms.
t* (df=39) ≈ 2.02.
ME = 2.02×18.97 ≈ 38.3 ms.
CI: 510 ± 38.3 → (471.7, 548.3) ms.

Practice — do it now

Try these without a calculator first, then compute precisely.

Exercise 1 (mirrors below): Compute a 95% CI for a mean with n=40, mean=510 ms, s=120 ms.
Exercise 2 (mirrors below): Find the minimum n for a ±3% margin at 95% for an unknown proportion.

Quick checklist — before you report a result

Is the sample representative? If not, note possible bias.
State the sampling method (random/stratified/cluster).
Which distribution model did you assume? Why is it reasonable?
Report the estimate, SE or CI, and level (e.g., 95%).
Mention limitations (e.g., small n, nonresponse, seasonality).

Common mistakes and self-checks

Confusing standard deviation with standard error. Self-check: SE should shrink like 1/√n.
Using Normal when sigma is unknown and n is small. Self-check: prefer t unless n is large and data not heavy-tailed.
Convenience sampling presented as if random. Self-check: describe how each unit could be selected.
Ignoring dependence (time series). Self-check: plot over time; consider autocorrelation.
Misinterpreting CI. Self-check: avoid “95% chance the true value is in our CI”; instead use long-run frequency wording.
Forgetting rare segments in experiments. Self-check: stratify or ensure minimum per segment.

Practical projects

Bootstrap a mean latency CI: resample your dataset 1,000 times, compute means, take the 2.5th–97.5th percentiles. Compare with t-based CI.
Conversion uplift: design a stratified sampling plan for new vs returning users; run an A/B test and report CI for difference in proportions.
Arrival modeling: fit a Poisson model to hourly events; check goodness by comparing mean and variance and inspecting interarrival times (Exponential).

Mini challenge

You have 4 user segments (A/B/C/D). Segment D is only 5% of users but critical. You have budget for 1,000 observations.

Propose a stratified sample allocation that ensures at least 100 observations in D.
Pick appropriate distributions for (a) number of signups per hour, (b) time-to-first-action, (c) average order value across days, and justify.

Hint

Start by allocating minimums to small but important segments, then distribute the rest proportionally. Poisson for counts in time, Exponential for waiting times, Normal (by CLT) for daily averages.

Who this is for

Early-career Data Scientists and Analysts.
Engineers running experiments or dashboards.
Researchers preparing quick studies with limited data.

Prerequisites

Basic arithmetic and algebra.
Understanding of mean, median, variance.
Comfort with percentages and logs (helpful).

Learning path

Next: Hypothesis testing (p-values, power, sample size).
Then: Effect sizes and experimental design.
Then: Generalized linear models (Binomial/Poisson regressions).
Optional: Bootstrap and Bayesian estimation.

Next steps

Apply these ideas to a recent dataset you use at work.
Write a 3–5 sentence “assumptions and limitations” section for your next report.
Practice: complete the exercises below, then take the quick test.

Exercises

Exercise 1 — 95% CI for a mean

You sampled n=40 page loads: mean=510 ms, s=120 ms. Compute a 95% CI using the appropriate distribution.

Exercise 2 — Sample size for a proportion

You want ±3% margin of error at 95% confidence for an unknown conversion rate. What minimum n do you need?

Ready? Take the Quick Test

Answer a few questions to check understanding. Everyone can take it; only logged-in users get saved progress.

Menu

Sampling And Distributions

Table of Contents