How to learn EDA Visualizations Distributions And Relationships for Visualization in Data Scientist for free

Why this matters

As a Data Scientist, you constantly answer questions like: Is the target skewed? Which features correlate with each other? Are there segments behaving differently? Fast, clear visual explorations of distributions and relationships help you catch data issues early, find signals for modeling, and explain insights to teammates and stakeholders.

Spot data quality issues: impossible values, heavy tails, multimodal shapes.
Guide feature engineering: transformations (log), binning, interaction terms.
De-risk models: find leakage, confounders, or dominant segments.
Communicate findings: visuals are memorable and actionable.

Who this is for

Current and aspiring Data Scientists who need reliable EDA habits.
Analysts and ML engineers validating features and datasets.
Anyone preparing data stories for product, ops, or leadership.

Prerequisites

Comfort with basic statistics: mean, median, variance, percentiles.
Basic plotting in Python (pandas, seaborn, matplotlib) or a similar tool.
Ability to load and clean data (missing values, types).

Learning path

Master univariate distributions (histogram, KDE, ECDF, box/violin).
Explore relationships between two variables (scatter, hexbin, 2D KDE).
Compare groups (box/violin/ridge, small multiples, faceting).
Handle scale, skew, and overplotting (log scales, bin width, alpha).
Build a repeatable EDA workflow with notes and saved figures.

Concept explained simply

Distribution shows how values are spread for one variable. Relationship shows how two variables move together. We visualize distributions to see shape, center, spread, and outliers. We visualize relationships to detect trends, clusters, and interactions.

Mental model: S-S-S (Shape, Spread, Segments)

Shape: Is it symmetric, skewed, multi-peaked?
Spread: How wide is the range? Are tails heavy?
Segments: Do subgroups differ (by region, plan, cohort)?

Core plot types you will use 90% of the time

Univariate (numeric)

Histogram: count of values in bins. Control bin width for clarity.
Density (KDE): smoothed curve of the distribution; good for comparisons.
ECDF: cumulative proportion up to each value; great for tails and medians.
Box/Violin: compact summary across categories; violin shows shape.

Categorical counts

Bar chart: counts or proportions per category. Prefer proportions when group sizes differ widely.

Two numeric variables

Scatter plot: the default; add transparency (alpha) and jitter if overlapping.
Hexbin/2D density: for large datasets with heavy overlap.
Line plot: if x is ordered (time or sequence).

Numeric vs categorical

Box, violin, ridge plots: compare distributions across categories.
Strip/Swarm: show individual points for small datasets.

Multivariate

Colored/Faceted scatter: segment by hue or small multiples.
Pair plot: quick overview of many numeric relationships at once.

Quick defaults that work well

Scatter with alpha=0.3 for n > 5,000; switch to hexbin above ~50,000 points.
Histogram: try 30 bins; adjust so the shape is clear without spikiness.
Use log scale for heavy right skew (values across orders of magnitude).

Choosing the right plot (decision guide)

One variable? Use histogram/KDE/ECDF. Need compact summary across groups? Box/violin.
Two numeric variables? Start scatter; if overplotting, use alpha or hexbin.
One numeric, one categorical? Box/violin; add swarm for sample-size context.
Many variables? Pair plot for quick screening, then focus on key pairs.
Skewed or wide-range data? Consider log scales before interpreting.

Special cases and fixes

Too many categories: aggregate, rank top N + "Other", or facet.
Unequal sample sizes: prefer box/violin or ECDF over raw-count histograms.
Nonlinear trend: add a LOWESS smoother instead of a linear fit.

Worked examples

Example 1: Customer spend is skewed

Goal: understand the distribution of monthly_spend to decide on transformation.

# Python (pandas + seaborn)
import pandas as pd, seaborn as sns, matplotlib.pyplot as plt
sns.set(style="whitegrid")
df = pd.read_csv("customers.csv")
fig, axes = plt.subplots(1, 3, figsize=(13,4))
# Histogram
sns.histplot(df["monthly_spend"], bins=30, ax=axes[0])
axes[0].set_title("Histogram: monthly_spend")
# ECDF
sns.ecdfplot(df["monthly_spend"], ax=axes[1])
axes[1].set_title("ECDF: monthly_spend")
# Log scale
sns.histplot(df["monthly_spend"], bins=30, ax=axes[2])
axes[2].set_xscale('log')
axes[2].set_title("Histogram (log x)")
plt.tight_layout()

Interpretation: heavy right tail; top 10% contribute most revenue.
Action: use log(monthly_spend + 1) for modeling to stabilize variance.

Example 2: Relationship with overplotting

Goal: visualize price vs. size for housing; handle 80k points.

fig, axes = plt.subplots(1, 2, figsize=(12,4))
# Scatter with alpha
sns.scatterplot(data=df, x="size_sqft", y="price", alpha=0.2, ax=axes[0])
axes[0].set_title("Scatter (alpha=0.2)")
# Hexbin (matplotlib)
axes[1].hexbin(df["size_sqft"], df["price"], gridsize=40, cmap="viridis")
axes[1].set_title("Hexbin")
plt.tight_layout()

Interpretation: positive association; hexbin reveals dense band.
Action: consider log scales; add region as hue or facet to check confounding.

Example 3: Numeric vs categorical (delivery time by region)

plt.figure(figsize=(9,4))
sns.violinplot(data=df, x="region", y="delivery_days", inner=None)
sns.boxplot(data=df, x="region", y="delivery_days", width=0.15, color="white")
sns.stripplot(data=df, x="region", y="delivery_days", color="black", alpha=0.3, size=2)
plt.title("Delivery days by region")

Interpretation: Region South has wider spread and higher median.
Action: Investigate carriers in South; consider segmenting models by region.

Practical EDA workflow you can reuse

Load and preview data: types, missingness, quick describe().
Plot univariate distributions for key numeric and categorical variables.
Check relationships: target vs features; feature vs feature for leakage.
Segment by important categories (hue or facet) to reveal hidden patterns.
Adjust scales and reduce clutter (log axes, alpha, bin widths); annotate observations.

Checklist: is your EDA decision-ready?

Have you validated data ranges and units?
Did you examine tails (ECDF or log scale)?
Did you mitigate overplotting (alpha, hexbin)?
Did you test key segments (region, plan, cohort)?
Are main takeaways written as 1–3 bullet points per plot?

Exercises (hands-on)

Do these now. They mirror the graded exercises below so you can compare your work.

Exercise 1: Retail orders EDA

Dataset columns: order_id, customer_id, order_value, items_count, is_expedited, region, order_date.

Plot the distribution of order_value. Identify skew and suggest a transformation if needed.
Visualize the relationship between items_count and order_value. Mitigate overplotting if necessary.
Compare order_value across regions using an appropriate plot.

Hints

Use ECDF to understand tails quickly.
Try alpha on scatter; switch to hexbin if needed.
Use box/violin for region comparisons; add strip for sample context.

Show reference solution (matches Exercise 1 below)

import seaborn as sns, matplotlib.pyplot as plt
# 1) Distribution
sns.histplot(df["order_value"], bins=30)
plt.figure(); sns.ecdfplot(df["order_value"])  # check tails
# If heavy right skew, use log
plt.figure(); sns.histplot(df["order_value"], bins=30); plt.xscale('log')
# 2) Relationship
plt.figure(figsize=(6,4))
sns.scatterplot(data=df, x="items_count", y="order_value", alpha=0.4)
# 3) Across regions
plt.figure(figsize=(8,4))
sns.violinplot(data=df, x="region", y="order_value", inner=None)
sns.boxplot(data=df, x="region", y="order_value", width=0.15, color="white")

Exercise 2: SaaS metrics EDA

Dataset columns: account_id, monthly_active_users, sessions_per_user, plan_tier, churned, revenue_mrr.

Plot revenue_mrr distribution and compare by plan_tier.
Check relationship: monthly_active_users vs revenue_mrr; handle outliers and scale issues.
Segment the scatter by churned to see if patterns differ.

Hints

Use log scale if values span orders of magnitude.
Facet or color by plan_tier and churned.
Consider a LOWESS smoother for nonlinear trends.

Show reference solution (matches Exercise 2 below)

# Distribution by plan
sns.violinplot(data=df, x="plan_tier", y="revenue_mrr", inner=None)
sns.boxplot(data=df, x="plan_tier", y="revenue_mrr", width=0.15, color="white")
# Scatter with log scales
plt.figure(figsize=(6,4))
sns.scatterplot(data=df, x="monthly_active_users", y="revenue_mrr", alpha=0.3, hue="churned")
plt.xscale('log'); plt.yscale('log')
# Optional smoother
sns.lmplot(data=df, x="monthly_active_users", y="revenue_mrr", hue="churned", lowess=True)

Checklist to self-grade:
- You chose plots that match the variable types.
- You addressed skew/scale and overplotting.
- You wrote 1–2 takeaways for each plot.

Common mistakes and how to self-check

Too-wide or too-narrow histogram bins. Fix: tune bins until shape is readable.
Overplotting hides structure. Fix: alpha/jitter or switch to hexbin/2D density.
Comparing raw counts across unequal groups. Fix: use proportions, box/violin, or ECDF.
Ignoring scale/skew. Fix: try log axes; compare both raw and transformed.
Trend line misuse. Fix: use LOWESS for nonlinear; don’t force linear fits.
Bar charts not starting at zero. Fix: always start at zero for bar lengths.
Cherry-picking views. Fix: show at least one univariate and one bivariate view per key variable.

Self-check mini list

Did you inspect tails with ECDF or log?
Did you assess group differences without biasing by sample size?
Are annotations clear and axes labeled with units?

Practical projects

Product A/B results audit: visualize metric distributions pre/post, use ECDF for uplift tails, and segment by cohort.
Credit risk profile: compare income and utilization distributions by default status; use hexbin for joint patterns.
Sensor reliability: plot time-to-failure distributions by device model; investigate temperature vs failure with LOWESS.

Quick test

You can take the quick test below to check mastery. Everyone can take it for free. Progress is saved only for logged-in users.

How to use the test effectively

Answer without notes first to simulate real decision-making.
Review any missed items, then redo related exercises.
Re-take after a day to confirm retention.

Next steps

Practice on a dataset you know well; summarize 3–5 insights with plots.
Create a reusable notebook template for univariate, bivariate, and segmented EDA.
Move on to multivariate and time-based visualizations; integrate results into dashboards.

Mini challenge

Pick one numeric target and three candidate features from any dataset. In one notebook page: (1) show the target distribution (raw and transformed if needed), (2) show each feature’s relationship to the target (choose the right plot), (3) write 5 bullet insights. Keep it under 15 minutes.

Menu

EDA Visualizations Distributions And Relationships

Table of Contents