How to learn Exploratory Analysis for Data Analyst for free

Why Exploratory Analysis matters for Data Analysts

Exploratory Data Analysis (EDA) is how Data Analysts quickly understand new data, uncover patterns, and spot issues before modeling or decision-making. Strong EDA lets you: find data quality problems early, summarize performance clearly, segment users to reveal differences, identify relationships for feature ideas, and generate testable hypotheses. In a business setting, EDA turns raw tables into insights for product, marketing, and operations.

Who this is for

Aspiring and working Data Analysts who want a reliable, repeatable approach to understanding datasets fast.
Professionals transitioning from business roles who need to communicate data findings clearly.
Students preparing for analytics interviews and case challenges.

Prerequisites

Basic SQL (SELECT, WHERE, GROUP BY, ORDER BY).
Comfort with spreadsheets or Python (pandas) for summaries and charts.
Basic statistics: mean, median, percentiles, variance, correlation.

Quick refresher: core stats you’ll use

Center: mean, median
Spread: standard deviation, IQR
Shape: skew, outliers
Relationships: correlation (Pearson/Spearman)

Learning path

Set goals and questions: define who/what/when/why before touching data.
Audit the data: row counts, date ranges, missing values, duplicates, key uniqueness.
Profile distributions: numeric and categorical summaries; identify skew and outliers.
Segment and compare: by channel, cohort, plan, region, device.
Explore relationships: correlations, pairwise plots, simple aggregations.
Behavioral flows: funnels and cohorts to understand conversion and retention.
Generate hypotheses: write crisp, testable statements with expected direction.
Report findings: concise narrative, visuals, assumptions, and next actions.

Milestone checklist

I can compute robust summaries (median, IQR) and explain skew.
I can segment metrics and interpret differences with context.
I can spot and handle outliers without losing signal.
I can explain correlation vs. causation and avoid traps.
I can build a clear EDA report with recommendations.

Worked examples

1) Distribution profiling (SQL)

Goal: Understand order_value distribution and spot outliers.

-- Bucket order values into ranges to see shape quickly
WITH b AS (
  SELECT
    CASE
      WHEN order_value < 10 THEN '0-10'
      WHEN order_value < 50 THEN '10-50'
      WHEN order_value < 100 THEN '50-100'
      ELSE '100+'
    END AS bucket
  FROM orders
  WHERE order_date >= DATE '2024-01-01'
)
SELECT bucket, COUNT(*) AS n, ROUND(100.0*COUNT(*)/SUM(COUNT(*)) OVER(), 1) AS pct
FROM b
GROUP BY bucket
ORDER BY CASE bucket WHEN '0-10' THEN 1 WHEN '10-50' THEN 2 WHEN '50-100' THEN 3 ELSE 4 END;

Interpretation tips: Large 100+ share suggests heavy-tails. Consider log-transform or robust stats.

2) Summary table with segments (SQL)

Goal: Compare metrics by acquisition_channel.

SELECT acquisition_channel,
       COUNT(*) AS users,
       AVG(first_order_value) AS avg_first_order,
       PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY first_order_value) AS median_first_order
FROM users
GROUP BY acquisition_channel
ORDER BY users DESC;

If your SQL dialect lacks PERCENTILE_CONT, compute median in Python after extracting grouped data.

3) Correlation matrix and pair check (Python/pandas)

import pandas as pd
# df includes numerical columns: sessions, orders, revenue, days_active
corr = df[["sessions","orders","revenue","days_active"]].corr(method="pearson")
print(corr.round(2))

# Quick pair check to avoid spurious correlation due to size
size = df["sessions"]
rate = (df["orders"] / df["sessions"]).fillna(0)
print("corr(sessions, orders)=", size.corr(df["orders"]).round(2))
print("corr(sessions, conversion_rate)=", size.corr(rate).round(2))

Interpretation: A high sessions-orders correlation may just reflect scale. Check rate metrics too.

4) Outlier detection with IQR (Python/pandas)

import numpy as np
q1 = df["order_value"].quantile(0.25)
q3 = df["order_value"].quantile(0.75)
iqr = q3 - q1
lower = q1 - 1.5*iqr
upper = q3 + 1.5*iqr
outliers = df[(df["order_value"] < lower) | (df["order_value"] > upper)]
print(len(outliers), "outliers")

Action: Investigate top outliers for data errors vs. legitimate big orders before trimming or winsorizing.

5) Funnel step conversion (SQL)

Goal: Compute step-through rates for signup → verify → purchase within 14 days.

WITH s AS (
  SELECT user_id, MIN(event_time) AS signup_ts
  FROM events WHERE event = 'signup'
  GROUP BY user_id
), v AS (
  SELECT user_id, MIN(event_time) AS verify_ts
  FROM events WHERE event = 'verify_email'
  GROUP BY user_id
), p AS (
  SELECT user_id, MIN(event_time) AS purchase_ts
  FROM events WHERE event = 'purchase'
  GROUP BY user_id
)
SELECT
  COUNT(s.user_id) AS signups,
  COUNT(v.user_id) AS verified,
  COUNT(p.user_id) AS purchasers,
  ROUND(COUNT(v.user_id)*1.0/COUNT(s.user_id), 3) AS verify_rate,
  ROUND(COUNT(p.user_id)*1.0/COUNT(v.user_id), 3) AS purchase_given_verify
FROM s
LEFT JOIN v ON v.user_id = s.user_id AND v.verify_ts <= s.signup_ts + INTERVAL '14 day'
LEFT JOIN p ON p.user_id = s.user_id AND p.purchase_ts <= s.signup_ts + INTERVAL '14 day';

Note: Use time windows to keep comparisons fair.

6) Cohort retention (Python/pandas)

# df has columns: user_id, first_month (YYYY-MM), activity_month (YYYY-MM)
pivot = (df
  .assign(cohort_age=lambda d: (
      pd.PeriodIndex(d["activity_month"], freq="M") -
      pd.PeriodIndex(d["first_month"],   freq="M")).astype(int))
  .groupby(["first_month","cohort_age"])['user_id']
  .nunique()
  .unstack(fill_value=0))

sizes = df.groupby('first_month')['user_id'].nunique()
retention = (pivot.div(sizes, axis=0).round(3))
print(retention.head())

Interpretation: Compare early-month retention across cohorts to see if quality is improving.

Skill drills

Compute mean, median, and IQR for three key metrics; explain differences in 2 sentences.
Segment your main KPI by two categorical features and highlight the top/bottom 2 segments with reasons.
Identify 5 largest outliers; label each as “data error”, “edge case”, or “valid”.
Produce a 3x3 correlation matrix; flag any potential confounding.
Build a 3-step funnel and find the biggest drop; propose two experiments.
Draft three exploratory hypotheses and list what data would invalidate each.

Common mistakes

Jumping to conclusions from correlation: correlation indicates association, not causation. Check segments and time.
Over-trimming outliers: you might remove real, high-value behavior. Investigate first.
Ignoring data ranges and time windows: mixing different exposure times biases rates.
Leaky segments: defining segments using future information inflates performance.
Cherry-picking: report the full picture (what improved and what didn’t).

Debugging tips

Recalculate metrics with a tiny sample by hand to verify logic.
Cross-check SQL vs. Python results on the same subset.
Use COUNT(DISTINCT ...) carefully; confirm key uniqueness first.
Plot or bucket numeric fields; shape often reveals data issues quickly.

Mini project: New-user activation EDA

Scenario: A product team wants to improve new-user activation within 14 days.

Define success: activated if user completes event X within 14 days.
Audit data: date ranges, event uniqueness, missing users.
Distributions: time-to-activation, session counts, order value (if applicable).
Segments: acquisition channel, device, plan, region.
Relationships: correlation of early actions with activation.
Funnel: signup → verify → first key action → activation.
Cohorts: by signup month; compare activation by cohort age.
Hypotheses: write 3 with expected direction and test ideas.
Report: 6–10 slides: context, method, 3–5 key findings, risks, next steps.

Acceptance criteria

All metrics have clear definitions and time windows.
At least one outlier investigation included.
Findings are actionable (suggested experiments or product changes).
Assumptions and data limitations are stated.

Practical projects

Pricing EDA: Analyze distribution of discounts vs. conversion; recommend a guardrail.
Churn signals: Explore last-30-day behaviors correlated with 60-day inactivity; propose retention nudges.
Marketing mix snapshot: Segment CAC and LTV by channel and region; identify underperforming combinations.

Subskills

Basic Distributions
Summary Tables
Segment Analysis
Correlation Exploration
Univariate Analysis
Bivariate Analysis
Multivariate Analysis
Feature Relationships
Anomaly Spotting
Cohort Exploration
Funnel Exploration
Exploratory Hypothesis Generation
EDA Reporting

Next steps

Re-run EDA monthly on a stable KPI to build intuition for seasonality and anomalies.
Create a reusable EDA notebook/template for your team.
Translate each finding into a decision or experiment proposal.

Reusable EDA template outline

Context & questions
Data audit
Distributions & segments
Relationships
Funnels & cohorts
Hypotheses
Takeaways & actions

Skill exam

Test your understanding with a short exam. Everyone can take it for free; logged-in users get saved progress and a record of attempts.

Menu

Exploratory Analysis

Table of Contents

Why Exploratory Analysis matters for Data Analysts

Who this is for

Prerequisites

Learning path

Worked examples

Skill drills

Common mistakes

Mini project: New-user activation EDA

Practical projects

Subskills

Next steps

Skill exam

Topics

Basic Distributions

Summary Tables

Segment Analysis

Correlation Exploration

Bivariate Analysis

Feature Relationships

Univariate Analysis

Multivariate Analysis

Anomaly Spotting

Cohort Exploration

Funnel Exploration

Exploratory Hypothesis Generation

EDA Reporting

Have questions about Exploratory Analysis?

AI Assistant