Why this matters
Distributions describe the shape of your data. As a Data Analyst, this helps you:
- Pick the right summary stats (e.g., median instead of mean for skewed data).
- Estimate probabilities (e.g., how often a queue will be empty or overflow).
- Choose appropriate visualizations (histogram vs. bar chart) and transformations (log for right-skew).
- Validate assumptions for A/B tests and forecasting.
Concept explained simply
A distribution tells you how likely different values are. Think of it as the "shape" a random process tends to produce over many repeats.
- Discrete vs. continuous: Counts (0,1,2,...) vs. measurements (any real value).
- Parameters: Small set of numbers that summarize the shape (e.g., mean, variance, rate).
- Skew and tails: Right-skewed means long tail to the right; heavy tails mean more extreme values than Normal.
Common distributions at a glance
- Bernoulli(p): Single yes/no outcome (clicked vs. not).
- Binomial(n, p): Number of successes out of n independent trials.
- Poisson(lambda): Count of events in a fixed interval when events happen independently at a constant average rate.
- Uniform(a, b): All values in [a, b] equally likely.
- Normal(mu, sigma): Bell curve; many natural aggregates; CLT makes sums/means tend toward Normal.
- Log-normal(mu_log, sigma_log): Positive, right-skewed amounts; log of data is roughly Normal.
- Exponential(rate): Time between Poisson events; memoryless.
- t-distribution(df): Like Normal but with heavier tails; used when estimating means with small samples and unknown variance.
Mental model: Generative stories
- Bernoulli: Flip a biased coin once. Heads probability = p.
- Binomial: Flip the same coin n times; count heads.
- Poisson: Events pop up randomly at average rate lambda per interval; count how many occur in an interval.
- Exponential: Wait time until the next random event from a Poisson process.
- Normal: Many small independent effects add up (measurement error, heights of people, average of many samples).
- Log-normal: Multiply small independent effects (e.g., price growth factors) — taking logs turns products into sums.
Worked examples
Example 1: Support emails per hour
Scenario: Over many hours, you record counts of support emails: 0,1,2,... You suspect a Poisson process.
- Estimate lambda: the average count per hour (mean of the data). Suppose the mean is 3.2 emails/hour.
- Quick check: For Poisson, variance is approximately equal to the mean. If sample variance is near 3.2, it supports Poisson.
- Probability of zero emails in an hour: P(X=0) = exp(-lambda) = exp(-3.2) ≈ 0.040.
Example 2: Transaction amounts (skewed)
Scenario: Purchase amounts are positive and right-skewed; a few very large orders exist.
- Hypothesis: Log-normal. Take log(amount) and plot a histogram.
- If log(amount) looks roughly bell-shaped, use Normal summaries on the log scale.
- Median on the original scale equals exp(mean of log(amount)). If mean log = 2.1, median ≈ exp(2.1) ≈ 8.17.
Example 3: Email campaign conversion
Scenario: Each of 1,000 recipients independently converts with probability p.
- Model: Binomial(n=1000, p).
- Expected conversions: n * p. If p = 0.03, expected = 30.
- Standard deviation: sqrt(n p (1-p)) ≈ sqrt(1000 * 0.03 * 0.97) ≈ 5.4.
- Rough 95% range: expected ± 2*sd → 30 ± 10.8 → about 19 to 41.
Exercises
Complete the tasks below. Solutions are available in toggles. Your progress is saved if you are logged in; otherwise, you can still practice for free.
Exercise 1: Match scenarios to distributions
For each scenario, select the most suitable distribution and estimate its key parameter(s):
- A. Number of app crashes per day (rare, independent events).
- B. Whether a user clicks a button in a single app session.
- C. Number of clicks out of 200 ad impressions with stable probability.
- D. Time between consecutive signups on a landing page.
- E. Daily revenue values that are positive and heavily right-skewed.
Estimate parameters using these summaries (hypothetical):
- Average crashes/day: 1.4
- Click probability per impression: 0.05
- Impressions in C: n = 200
- Median time between signups: about 2 minutes
- Mean of log(daily revenue): 3.2; SD of log(daily revenue): 0.6
- I picked one distribution per scenario.
- I estimated parameters using the given numbers.
- My choices match the data type (count, yes/no, time, positive skew).
Exercise 2: Compute simple probabilities
Use the distribution formulas or standard approximations:
- Poisson with lambda = 3 per hour: probability of zero events in the next hour?
- Normal with mean = 48 and SD = 12 hours: probability a ticket resolves in over 72 hours?
- Log-normal where log(amount) ~ Normal(mean = 2.0, SD = 0.5): what is the median amount?
- Small sample mean: n = 25, sample mean = 68, sample SD = 10. 95% CI for the population mean?
- I wrote the formula used for each calculation.
- I showed a rounded numeric answer.
- For (4), I used the t-multiplier (not Normal) since SD is estimated.
Common mistakes and self-check
- Using Normal on raw, right-skewed amounts. Self-check: Plot histogram and try log; if log looks bell-shaped, prefer log-normal summaries.
- Treating counts with mean near 0 as Normal. Self-check: If mean is small and variance ≈ mean, Poisson often fits better.
- Forgetting independence. Self-check: If events cluster (dependence), Poisson may understate variance (overdispersion).
- Using z-interval with small n and unknown sigma. Self-check: Use t with df = n-1 when sigma is unknown and n is small.
- Confusing median and mean under log-normal. Self-check: Median = exp(mean of log data), not exp(mean of raw).
Practical projects
- Helpdesk volume model: Collect hourly ticket counts for two weeks. Fit Poisson (estimate lambda). Compare observed variance to lambda; note any hours with overdispersion.
- Revenue shape check: Take 3 months of order amounts. Plot raw and log histograms. Report median, IQR on raw; and mean, SD on log. Present one slide with recommendation.
- CTR stability: For daily impressions and clicks, model clicks ~ Binomial(n, p). Compute daily p-hats and a 95% interval for p. Flag days outside the interval and explain practical reasons.
Mini challenge
You observe: (i) 60% of minutes have 0 signups, 30% have 1, 10% have 2, almost none have 3+. Suggest a distribution and estimate its main parameter. Then, propose a quick check to validate your choice.
Hint
Rates per minute with many zeros often suggest a Poisson with lambda around the average count per minute; check mean vs variance.
Who this is for
- Entry-level and aspiring Data Analysts who want to interpret data distributions clearly.
- Professionals switching from reporting to analytical modeling.
Prerequisites
- Basic arithmetic and percentages.
- Comfort with averages, variance, and standard deviation.
- Ability to read histograms and bar charts.
Learning path
- Before: Descriptive statistics (mean, median, variance), data types (categorical vs numeric).
- This subskill: Recognize and use basic distributions in EDA.
- After: Sampling and the Central Limit Theorem, hypothesis testing, regression assumptions.
Next steps
- Re-check your last project: Which distributions did you assume implicitly? Were they appropriate?
- Build a small notebook/template to test Poisson vs. overdispersion and to visualize log-normal candidates.
- Take the Quick Test below. You can take it for free; sign in to save your progress.