Why this matters
Understanding distributions shows how your data is actually shaped, not just its average. As a Data Analyst, you will:
- Check if a metric is skewed before reporting a mean (e.g., order values, session duration).
- Pick the right summary (median vs mean) and detect outliers quickly.
- Compare groups (A/B variants, regions) to see real shifts, not noise.
- Communicate uncertainty and typical ranges clearly to stakeholders.
Who this is for
Beginner to intermediate analysts who need to visualize data distributions and interpret them for real decisions.
Prerequisites
- Basic descriptive statistics: mean, median, percentiles.
- Comfort loading data into a tool (spreadsheets, Python, or SQL client).
- Basic plotting knowledge helps but is not required.
Concept explained simply
A histogram counts how many values fall into each numeric interval (bin). A density plot is a smooth curve estimating how common values are across the range. Both show shape: center, spread, skew, and multimodality (multiple peaks).
Mental model
Imagine lining up all your data points on a number line and grouping them into equal-width boxes. The height of each box shows how many fall there. That’s a histogram. If you drape a flexible ribbon over the tops of those boxes and smooth it, you get a density curve. The ribbon highlights the general pattern without sharp bin edges.
Key choices: bins and bandwidth
- Histogram bins: too few bins hide detail; too many bins add noise. Try 8–30 bins depending on sample size. For n ~ 200–2000, start with 15–25 bins.
- Density bandwidth: too small shows spiky noise; too large oversmooths. Start with defaults, then adjust until main shape is clear without jaggedness.
Practical defaults
- Histogram: start with 20 bins; adjust up/down and pick the clearest story.
- Density: start with “Scott” or “Silverman” rule (most tools default) and tweak slightly.
How to read histograms vs density plots
- Center: where most values cluster (peak area). Median is near the midpoint of area; mean pulls toward the tail.
- Spread: how wide the mass of data stretches (IQR for middle 50%).
- Skew: long tail to the right (right-skew) or left (left-skew).
- Peaks: one peak (unimodal) or several peaks (multimodal) may indicate subgroups.
- Outliers: isolated bars/long tails that may need investigation.
Worked examples
Example 1: E-commerce order values
Scenario: 5,000 orders; most are small, a few are very large.
- Plot: Histogram with density overlay.
- Observation: Strong right-skew. Median around $32; mean around $58 due to a tail of high spenders.
- Decision: Report median for “typical order” and show 90th percentile to acknowledge big spenders.
Example 2: Session duration (minutes)
Scenario: 2,000 sessions.
- Plot: Histogram with 20 bins, density on top.
- Observation: Peak at 3–5 minutes, gradual right tail up to ~40 minutes.
- Decision: Use median duration in dashboards; show distribution to UX to understand typical vs power users.
Example 3: A/B test conversion lift (user-level deltas)
Scenario: Distribution of per-user spend change vs baseline.
- Plot: Overlaid densities for Variant A and B.
- Observation: Both right-skewed. B shows a slight overall shift right but larger variance.
- Decision: Report median lift and compare quantiles; caution stakeholders about volatile tail behavior.
How to make these plots in common tools
Spreadsheets (Excel/Google Sheets)
- Select your numeric column.
- Insert → Histogram (or use FREQUENCY/BIN ranges to build manually).
- Adjust bin width via Axis → Bins (set bin width or number of bins).
- To approximate density, add a smoothed line on top by computing a moving average of the normalized counts.
Python (pandas + seaborn/matplotlib)
import pandas as pd, seaborn as sns, matplotlib.pyplot as plt
# df['value'] is your numeric series
ax = sns.histplot(df['value'], bins=20, stat='density', kde=True, color='#4e79a7')
ax.set_xlabel('Value'); ax.set_ylabel('Density')
plt.show()
Notes: stat='density' scales the histogram to area 1 so it matches the density curve. Adjust bins or kde=True bandwidth via bw_adjust (e.g., sns.kdeplot(df['value'], bw_adjust=1.2)).
SQL (binning counts)
-- Generic binning by width (example: 5-unit bins)
SELECT FLOOR(value/5)*5 AS bin_start,
COUNT(*) AS n
FROM your_table
WHERE value IS NOT NULL
GROUP BY 1
ORDER BY 1;
Export the results to chart as a column chart. To overlay two groups, compute counts for each group, then normalize by total to compare shapes.
Common mistakes and self-check
- Mistake: Using mean only on a skewed metric. Fix: Show distribution; add median and percentiles.
- Mistake: Too few or too many bins. Fix: Try several; pick the one that makes primary shape clear without jagged noise.
- Mistake: Comparing groups with different scales (counts). Fix: Use density/relative frequency to compare shapes fairly.
- Mistake: Ignoring outliers. Fix: Show with capped x-axis and note outlier counts separately.
- Mistake: Over-interpreting tiny bumps. Fix: Confirm with sample size; smooth or aggregate if needed.
Self-check checklist
- Did I label axes with units?
- Is the typical value (median) clear?
- Is the bin width/bandwidth reasonable after trying alternatives?
- Did I annotate any notable peaks, tails, or outliers?
- If comparing groups, did I normalize and use the same axis?
Exercises
Do these hands-on tasks. Then take the Quick Test. Note: Anyone can take the test for free; only logged-in users will have their progress saved.
Exercise 1: Build a histogram + density for session duration
Dataset (synthetic): 120 values in minutes. Approx pattern: 10 values near 0.5–1.5, 70 values in 2–8, 35 values in 8–20, 5 values in 20–45.
- Task: Plot a histogram with 15–25 bins and overlay a density curve. Report median and 90th percentile.
- Deliverable: A plot image and 2 numbers (median, p90).
Need a hint?
- Ensure stat='density' when overlaying density in Python.
- In spreadsheets, adjust bin width until the main peak is clear.
Exercise 2: Compare two distributions (pre vs post)
Dataset (synthetic): Pre: mean ~7, right-skew; Post: slight shift right, more spread. Sample sizes: 400 each.
- Task: Plot overlaid densities or side-by-side histograms with the same bins. State whether the typical value improved and whether variance changed.
- Deliverable: A comparative plot and a 2–3 sentence interpretation.
Need a hint?
- Normalize to densities to compare shapes.
- Report median and IQR for robustness.
Checklist to submit
- Axes labeled with units.
- Bin/bandwidth choices justified.
- Median/IQR or percentiles reported.
- Clear conclusion: shift, no shift, or inconclusive.
Mini challenge
You have two product categories with revenue per order. Category A shows two peaks; Category B shows a single peak with a long right tail. In 3 bullet points, explain what this suggests about customer segments and how you’d present typical revenue to stakeholders for each category.
Tip
Multimodality often signals subgroups (e.g., low-cost vs premium). Right-skew suggests median and upper percentiles tell a clearer story than the mean.
Practical projects
- Customer spend distribution dashboard: histogram + density, median, p90, and outlier notes.
- Engagement shape report: compare session durations by acquisition channel with normalized densities.
- A/B distribution shift study: visualize control vs variant; report median shift and changes in IQR.
Learning path
- Distributions (this lesson): shapes, skew, peaks, outliers.
- Boxplots and percentiles: fast comparisons across many groups.
- Robust summaries: median, IQR, trimmed mean, Winsorization.
- Transformation tactics: log scale for heavy tails; when and why.
- Inference basics: how distribution shape affects tests and confidence intervals.
Next steps
- Recreate two plots from recent work data and present 1-slide insights.
- Adopt a standard: always pair a key metric with its distribution in reports.
- Move on to boxplots and outlier detection to compare many groups quickly.