How to learn Distributions Histograms Density Plots for Data Visualization in Data Analyst for free

Why this matters

Understanding distributions shows how your data is actually shaped, not just its average. As a Data Analyst, you will:

Check if a metric is skewed before reporting a mean (e.g., order values, session duration).
Pick the right summary (median vs mean) and detect outliers quickly.
Compare groups (A/B variants, regions) to see real shifts, not noise.
Communicate uncertainty and typical ranges clearly to stakeholders.

Who this is for

Beginner to intermediate analysts who need to visualize data distributions and interpret them for real decisions.

Prerequisites

Basic descriptive statistics: mean, median, percentiles.
Comfort loading data into a tool (spreadsheets, Python, or SQL client).
Basic plotting knowledge helps but is not required.

Concept explained simply

A histogram counts how many values fall into each numeric interval (bin). A density plot is a smooth curve estimating how common values are across the range. Both show shape: center, spread, skew, and multimodality (multiple peaks).

Mental model

Imagine lining up all your data points on a number line and grouping them into equal-width boxes. The height of each box shows how many fall there. That’s a histogram. If you drape a flexible ribbon over the tops of those boxes and smooth it, you get a density curve. The ribbon highlights the general pattern without sharp bin edges.

Key choices: bins and bandwidth

Histogram bins: too few bins hide detail; too many bins add noise. Try 8–30 bins depending on sample size. For n ~ 200–2000, start with 15–25 bins.
Density bandwidth: too small shows spiky noise; too large oversmooths. Start with defaults, then adjust until main shape is clear without jaggedness.

Practical defaults

Histogram: start with 20 bins; adjust up/down and pick the clearest story.
Density: start with “Scott” or “Silverman” rule (most tools default) and tweak slightly.

How to read histograms vs density plots

Center: where most values cluster (peak area). Median is near the midpoint of area; mean pulls toward the tail.
Spread: how wide the mass of data stretches (IQR for middle 50%).
Skew: long tail to the right (right-skew) or left (left-skew).
Peaks: one peak (unimodal) or several peaks (multimodal) may indicate subgroups.
Outliers: isolated bars/long tails that may need investigation.

Worked examples

Example 1: E-commerce order values

Scenario: 5,000 orders; most are small, a few are very large.

Plot: Histogram with density overlay.
Observation: Strong right-skew. Median around $32; mean around $58 due to a tail of high spenders.
Decision: Report median for “typical order” and show 90th percentile to acknowledge big spenders.

Example 2: Session duration (minutes)

Scenario: 2,000 sessions.

Plot: Histogram with 20 bins, density on top.
Observation: Peak at 3–5 minutes, gradual right tail up to ~40 minutes.
Decision: Use median duration in dashboards; show distribution to UX to understand typical vs power users.

Example 3: A/B test conversion lift (user-level deltas)

Scenario: Distribution of per-user spend change vs baseline.

Plot: Overlaid densities for Variant A and B.
Observation: Both right-skewed. B shows a slight overall shift right but larger variance.
Decision: Report median lift and compare quantiles; caution stakeholders about volatile tail behavior.

How to make these plots in common tools

Spreadsheets (Excel/Google Sheets)

Select your numeric column.
Insert → Histogram (or use FREQUENCY/BIN ranges to build manually).
Adjust bin width via Axis → Bins (set bin width or number of bins).
To approximate density, add a smoothed line on top by computing a moving average of the normalized counts.

Python (pandas + seaborn/matplotlib)

import pandas as pd, seaborn as sns, matplotlib.pyplot as plt

# df['value'] is your numeric series
ax = sns.histplot(df['value'], bins=20, stat='density', kde=True, color='#4e79a7')
ax.set_xlabel('Value'); ax.set_ylabel('Density')
plt.show()

Notes: stat='density' scales the histogram to area 1 so it matches the density curve. Adjust bins or kde=True bandwidth via bw_adjust (e.g., sns.kdeplot(df['value'], bw_adjust=1.2)).

SQL (binning counts)

-- Generic binning by width (example: 5-unit bins)
SELECT FLOOR(value/5)*5 AS bin_start,
       COUNT(*) AS n
FROM your_table
WHERE value IS NOT NULL
GROUP BY 1
ORDER BY 1;

Export the results to chart as a column chart. To overlay two groups, compute counts for each group, then normalize by total to compare shapes.

Common mistakes and self-check

Mistake: Using mean only on a skewed metric. Fix: Show distribution; add median and percentiles.
Mistake: Too few or too many bins. Fix: Try several; pick the one that makes primary shape clear without jagged noise.
Mistake: Comparing groups with different scales (counts). Fix: Use density/relative frequency to compare shapes fairly.
Mistake: Ignoring outliers. Fix: Show with capped x-axis and note outlier counts separately.
Mistake: Over-interpreting tiny bumps. Fix: Confirm with sample size; smooth or aggregate if needed.

Self-check checklist

Did I label axes with units?
Is the typical value (median) clear?
Is the bin width/bandwidth reasonable after trying alternatives?
Did I annotate any notable peaks, tails, or outliers?
If comparing groups, did I normalize and use the same axis?

Exercises

Do these hands-on tasks. Then take the Quick Test. Note: Anyone can take the test for free; only logged-in users will have their progress saved.

Exercise 1: Build a histogram + density for session duration

Dataset (synthetic): 120 values in minutes. Approx pattern: 10 values near 0.5–1.5, 70 values in 2–8, 35 values in 8–20, 5 values in 20–45.

Task: Plot a histogram with 15–25 bins and overlay a density curve. Report median and 90th percentile.
Deliverable: A plot image and 2 numbers (median, p90).

Need a hint?

Ensure stat='density' when overlaying density in Python.
In spreadsheets, adjust bin width until the main peak is clear.

Exercise 2: Compare two distributions (pre vs post)

Dataset (synthetic): Pre: mean ~7, right-skew; Post: slight shift right, more spread. Sample sizes: 400 each.

Task: Plot overlaid densities or side-by-side histograms with the same bins. State whether the typical value improved and whether variance changed.
Deliverable: A comparative plot and a 2–3 sentence interpretation.

Need a hint?

Normalize to densities to compare shapes.
Report median and IQR for robustness.

Checklist to submit

Axes labeled with units.
Bin/bandwidth choices justified.
Median/IQR or percentiles reported.
Clear conclusion: shift, no shift, or inconclusive.

Mini challenge

You have two product categories with revenue per order. Category A shows two peaks; Category B shows a single peak with a long right tail. In 3 bullet points, explain what this suggests about customer segments and how you’d present typical revenue to stakeholders for each category.

Tip

Multimodality often signals subgroups (e.g., low-cost vs premium). Right-skew suggests median and upper percentiles tell a clearer story than the mean.

Practical projects

Customer spend distribution dashboard: histogram + density, median, p90, and outlier notes.
Engagement shape report: compare session durations by acquisition channel with normalized densities.
A/B distribution shift study: visualize control vs variant; report median shift and changes in IQR.

Learning path

Distributions (this lesson): shapes, skew, peaks, outliers.
Boxplots and percentiles: fast comparisons across many groups.
Robust summaries: median, IQR, trimmed mean, Winsorization.
Transformation tactics: log scale for heavy tails; when and why.
Inference basics: how distribution shape affects tests and confidence intervals.

Next steps

Recreate two plots from recent work data and present 1-slide insights.
Adopt a standard: always pair a key metric with its distribution in reports.
Move on to boxplots and outlier detection to compare many groups quickly.

Menu

Distributions Histograms Density Plots

Table of Contents