How to learn Basic Distributions for Exploratory Analysis in Data Analyst for free

Why this matters

Distributions describe the shape of your data. As a Data Analyst, this helps you:

Pick the right summary stats (e.g., median instead of mean for skewed data).
Estimate probabilities (e.g., how often a queue will be empty or overflow).
Choose appropriate visualizations (histogram vs. bar chart) and transformations (log for right-skew).
Validate assumptions for A/B tests and forecasting.

Concept explained simply

A distribution tells you how likely different values are. Think of it as the "shape" a random process tends to produce over many repeats.

Discrete vs. continuous: Counts (0,1,2,...) vs. measurements (any real value).
Parameters: Small set of numbers that summarize the shape (e.g., mean, variance, rate).
Skew and tails: Right-skewed means long tail to the right; heavy tails mean more extreme values than Normal.

Common distributions at a glance

Bernoulli(p): Single yes/no outcome (clicked vs. not).
Binomial(n, p): Number of successes out of n independent trials.
Poisson(lambda): Count of events in a fixed interval when events happen independently at a constant average rate.
Uniform(a, b): All values in [a, b] equally likely.
Normal(mu, sigma): Bell curve; many natural aggregates; CLT makes sums/means tend toward Normal.
Log-normal(mu_log, sigma_log): Positive, right-skewed amounts; log of data is roughly Normal.
Exponential(rate): Time between Poisson events; memoryless.
t-distribution(df): Like Normal but with heavier tails; used when estimating means with small samples and unknown variance.

Mental model: Generative stories

Bernoulli: Flip a biased coin once. Heads probability = p.
Binomial: Flip the same coin n times; count heads.
Poisson: Events pop up randomly at average rate lambda per interval; count how many occur in an interval.
Exponential: Wait time until the next random event from a Poisson process.
Normal: Many small independent effects add up (measurement error, heights of people, average of many samples).
Log-normal: Multiply small independent effects (e.g., price growth factors) — taking logs turns products into sums.

Worked examples

Example 1: Support emails per hour

Scenario: Over many hours, you record counts of support emails: 0,1,2,... You suspect a Poisson process.

Estimate lambda: the average count per hour (mean of the data). Suppose the mean is 3.2 emails/hour.
Quick check: For Poisson, variance is approximately equal to the mean. If sample variance is near 3.2, it supports Poisson.
Probability of zero emails in an hour: P(X=0) = exp(-lambda) = exp(-3.2) ≈ 0.040.

Example 2: Transaction amounts (skewed)

Scenario: Purchase amounts are positive and right-skewed; a few very large orders exist.

Hypothesis: Log-normal. Take log(amount) and plot a histogram.
If log(amount) looks roughly bell-shaped, use Normal summaries on the log scale.
Median on the original scale equals exp(mean of log(amount)). If mean log = 2.1, median ≈ exp(2.1) ≈ 8.17.

Example 3: Email campaign conversion

Scenario: Each of 1,000 recipients independently converts with probability p.

Model: Binomial(n=1000, p).
Expected conversions: n * p. If p = 0.03, expected = 30.
Standard deviation: sqrt(n p (1-p)) ≈ sqrt(1000 * 0.03 * 0.97) ≈ 5.4.
Rough 95% range: expected ± 2*sd → 30 ± 10.8 → about 19 to 41.

Exercises

Complete the tasks below. Solutions are available in toggles. Your progress is saved if you are logged in; otherwise, you can still practice for free.

Exercise 1: Match scenarios to distributions

For each scenario, select the most suitable distribution and estimate its key parameter(s):

A. Number of app crashes per day (rare, independent events).
B. Whether a user clicks a button in a single app session.
C. Number of clicks out of 200 ad impressions with stable probability.
D. Time between consecutive signups on a landing page.
E. Daily revenue values that are positive and heavily right-skewed.

Estimate parameters using these summaries (hypothetical):

Average crashes/day: 1.4
Click probability per impression: 0.05
Impressions in C: n = 200
Median time between signups: about 2 minutes
Mean of log(daily revenue): 3.2; SD of log(daily revenue): 0.6

I picked one distribution per scenario.
I estimated parameters using the given numbers.
My choices match the data type (count, yes/no, time, positive skew).

Exercise 2: Compute simple probabilities

Use the distribution formulas or standard approximations:

Poisson with lambda = 3 per hour: probability of zero events in the next hour?
Normal with mean = 48 and SD = 12 hours: probability a ticket resolves in over 72 hours?
Log-normal where log(amount) ~ Normal(mean = 2.0, SD = 0.5): what is the median amount?
Small sample mean: n = 25, sample mean = 68, sample SD = 10. 95% CI for the population mean?

I wrote the formula used for each calculation.
I showed a rounded numeric answer.
For (4), I used the t-multiplier (not Normal) since SD is estimated.

Common mistakes and self-check

Using Normal on raw, right-skewed amounts. Self-check: Plot histogram and try log; if log looks bell-shaped, prefer log-normal summaries.
Treating counts with mean near 0 as Normal. Self-check: If mean is small and variance ≈ mean, Poisson often fits better.
Forgetting independence. Self-check: If events cluster (dependence), Poisson may understate variance (overdispersion).
Using z-interval with small n and unknown sigma. Self-check: Use t with df = n-1 when sigma is unknown and n is small.
Confusing median and mean under log-normal. Self-check: Median = exp(mean of log data), not exp(mean of raw).

Practical projects

Helpdesk volume model: Collect hourly ticket counts for two weeks. Fit Poisson (estimate lambda). Compare observed variance to lambda; note any hours with overdispersion.
Revenue shape check: Take 3 months of order amounts. Plot raw and log histograms. Report median, IQR on raw; and mean, SD on log. Present one slide with recommendation.
CTR stability: For daily impressions and clicks, model clicks ~ Binomial(n, p). Compute daily p-hats and a 95% interval for p. Flag days outside the interval and explain practical reasons.

Mini challenge

You observe: (i) 60% of minutes have 0 signups, 30% have 1, 10% have 2, almost none have 3+. Suggest a distribution and estimate its main parameter. Then, propose a quick check to validate your choice.

Hint

Rates per minute with many zeros often suggest a Poisson with lambda around the average count per minute; check mean vs variance.

Who this is for

Entry-level and aspiring Data Analysts who want to interpret data distributions clearly.
Professionals switching from reporting to analytical modeling.

Prerequisites

Basic arithmetic and percentages.
Comfort with averages, variance, and standard deviation.
Ability to read histograms and bar charts.

Learning path

Before: Descriptive statistics (mean, median, variance), data types (categorical vs numeric).
This subskill: Recognize and use basic distributions in EDA.
After: Sampling and the Central Limit Theorem, hypothesis testing, regression assumptions.

Next steps

Re-check your last project: Which distributions did you assume implicitly? Were they appropriate?
Build a small notebook/template to test Poisson vs. overdispersion and to visualize log-normal candidates.
Take the Quick Test below. You can take it for free; sign in to save your progress.

Menu

Basic Distributions

Table of Contents

Why this matters

Concept explained simply

Common distributions at a glance

Worked examples

Exercises

Exercise 1: Match scenarios to distributions

Exercise 2: Compute simple probabilities

Common mistakes and self-check

Practical projects

Mini challenge

Who this is for

Prerequisites

Learning path

Next steps

Practice Exercises

Match scenarios to distributions

Instructions

Expected Output

Compute simple probabilities

Basic Distributions — Quick Test

Have questions about Basic Distributions?

AI Assistant