How to learn Univariate Analysis for Exploratory Analysis in Data Analyst for free

Why this matters

Univariate analysis is the fastest way to understand a single column: its typical values, spread, shape, and issues. Data Analysts use it to catch data quality problems early and to pick the right transformations before modeling or reporting.

Real tasks it powers: sanity-check new datasets, define bins for dashboards, choose outlier rules, set thresholds for alerts, and decide how to impute missing values.
Outcome: cleaner data, clearer visuals, and fewer surprises later in analysis.

Example day-to-day decisions

Marketing: Is cost-per-click too skewed for an average? Use median instead.
Product: Are daily active users stable? Check variability and outliers.
Ops: Which error code occurs most often? Use counts and proportions.

Concept explained simply

Univariate analysis looks at one variable at a time. You summarize, visualize, and judge quality for that one column.

Numeric (continuous/discrete): mean/median/mode, min–max, range, variance/std, IQR, percentiles, skewness; visuals: histogram, boxplot.
Categorical: counts, proportions, mode, unique count (cardinality); visuals: bar chart.
Date/time: counts by period (day/week/month), min/max date, gaps; visuals: line or bar by time bucket.

Mental model

Imagine a bright flashlight on a single column. Your job: identify its type, clean it, summarize it, visualize it, interpret it, then decide the next action.

Type → Clean → Summarize → Visualize → Interpret → Decide

Standard workflow you can reuse

Identify variable type: numeric, categorical, date/time, or ID-like.
Quality check: missing %, invalid values (e.g., negatives for ages), duplicates (if ID), mixed types.
Summaries:
- Numeric: count, missing %, mean, median, std, IQR, p1/p5/p95/p99, min/max, skew.
- Categorical: unique count, top categories, proportions, rare-category tail.
- Date/time: coverage window, gaps, counts by period.
Simple visuals (mentally or with tools): histogram or bar chart; boxplot for outliers.
Decide actions: impute, drop, winsorize, transform (log/sqrt), bin, group rare categories, or mark as ID.

Quick variable-type check

All unique long strings with no business meaning → likely ID (don’t compute mean).
Counts, money, time durations → numeric (often right-skewed; consider median/log).
Country, device, plan → categorical; focus on counts and tail.
Timestamps → derive period features for summaries.

Worked examples

Example 1: Numeric (Customer Age)

Sample ages: 18, 19, 20, 22, 23, 23, 24, 24, 24, 25, 26, 27, 28, 41

Median: sort and take middle → median = 24
IQR: Q1 ≈ 22, Q3 ≈ 26 → IQR = 4
Outlier bounds: [Q1 − 1.5×IQR, Q3 + 1.5×IQR] = [16, 32]
41 is above 32 → potential outlier. Decide: verify (age 41 is valid). Keep; maybe winsorize if needed for modeling.

More stats

Mean ≈ 24.7
Std dev ≈ 5.2
Skew: slightly right due to 41

Example 2: Categorical (Device Type)

Counts: Mobile=610, Desktop=290, Tablet=80, Other=20 (total=1000)

Proportions: 61%, 29%, 8%, 2%
Decision: Group categories < 3% into "Other" to declutter charts. Keep original raw values in data.
Mode: Mobile

Quality checks

Unique count = 4 → manageable.
Check spelling consistency (e.g., "mobile" vs "Mobile").

Example 3: Numeric with skew (Order Value in $)

Values (n=10): 8, 10, 11, 12, 12, 13, 14, 16, 90, 120

Mean = 30.6; Median = 12.5 → heavy right skew.
IQR (Q1≈11, Q3≈16) → IQR=5; upper bound= Q3+1.5×IQR = 23.5 → 90 and 120 are outliers.
Action options:
- Report typical value as median (12.5) with IQR.
- Use log transform for modeling: log1p(x) to stabilize variance.
- Winsorize at p99 for robust aggregates if business-approved.

Log intuition

Log compresses large values so 90 and 120 don’t dominate variance, improving model stability and visual interpretability.

Hands-on exercises

Do these quick tasks. Compare with solutions below. Then take the quick test at the end.

Exercise 1 (ex1): Numeric — summarize and flag outliers

Data (Revenue $): 12, 13, 15, 16, 16, 18, 19, 20, 22, 25, 100, null, 14

Tasks:
- Compute count (incl. missing), missing %, mean, median, Q1, Q3, IQR, min, max.
- Flag outliers using 1.5×IQR rule.
- Recommend: median/IQR or mean/std for reporting? Briefly justify.

Show solution for ex1

Count=13, Missing=1 → missing%=7.7%. Numeric count (non-missing)=12. Sorted (non-missing): 12,13,14,15,16,16,18,19,20,22,25,100.

Median = (16+18)/2 = 17
Q1 ≈ 14.5 (between 14 and 15), Q3 ≈ 21 (between 20 and 22) → IQR ≈ 6.5
Bounds: lower = Q1−1.5×IQR ≈ 14.5−9.75=4.75; upper ≈ 21+9.75=30.75 → 100 is an outlier.
Mean ≈ (12+13+14+15+16+16+18+19+20+22+25+100)/12 = 24.2
Min=12, Max=100
Recommendation: Use median (17) and IQR (≈6.5) due to strong right-skew and outlier.

Exercise 2 (ex2): Categorical — proportions and grouping

Data (Signup Source counts, total 1,200): SEO=480, Paid=300, Direct=210, Referral=90, Social=60, Other=60

Tasks:
- Compute proportions (%) for each category.
- Apply a 3% threshold to group rare categories into "Other". Which categories get grouped?
- Report the new distribution after grouping.

Show solution for ex2

Proportions: SEO 40%, Paid 25%, Direct 17.5%, Referral 7.5%, Social 5%, Other 5%.
3% threshold = 36 users. Categories below 3%: none. So no changes under 3% rule.
If you choose 5% threshold, Social and Other would be grouped, new "Other" ≈ 10%.

Checklist — fast univariate QA

Variable type identified correctly
Missing % computed; plan to handle NA
For numeric: median, IQR, p1/p99 checked; outliers flagged
For categorical: top categories listed; tail reviewed
Date range and gaps assessed (if applicable)
Transform or binning decision recorded
Business sanity check performed

Common mistakes and how to self-check

Using mean/std for skewed data → Check mean vs median; if far apart, prefer median/IQR.
Treating IDs as numeric → If unique count ≈ row count, don’t compute numeric stats.
Ignoring missingness → Always compute missing % and decide: drop, impute, or keep as signal.
Over-pruning categories → Keep business-critical categories even if rare.
One-size-fits-all outlier rules → Validate with domain context (e.g., $0 sales days on holidays).

Self-check prompts

Is your chosen summary robust to outliers?
Did you verify data units and valid ranges?
Did you capture decisions (imputation, grouping) in notes/code for reproducibility?

Practical projects

E-commerce single-column audit: For 10 key columns (price, quantity, discount, device, country, date), produce a one-pager per column with summaries, plots (if available), and decisions.
Churn dataset triage: Evaluate each variable univariately to decide: keep, transform, or drop. Document rationale.
Automated univariate report: Build a script/notebook that outputs summaries (including missing %, percentiles, top categories) for any dataset.

Who this is for, prerequisites, learning path

Who this is for

Data Analysts, Product Analysts, and anyone preparing datasets for reporting or modeling.

Prerequisites

Basic statistics: mean, median, percentiles
Comfort with spreadsheets or a scripting language (Python/R) is helpful but not required

Learning path

Start: Univariate analysis (this page)
Next: Bivariate analysis (relationships)
Then: Multivariate exploration and feature engineering

Mini challenge

You receive a new column: session_duration_sec with 15% missing, min=0, p50=45, p95=900, max=7200. In 3 bullets, state your reporting metric, outlier handling, and any transform. Keep it crisp.

Next steps

Apply this workflow to two columns in your current dataset: one numeric and one categorical.
Document one transformation decision and its reason.
Proceed to bivariate analysis once you can complete the checklist in under 5 minutes per column.

Quick test

Anyone can take the test. If you are logged in, your progress will be saved.

Menu

Univariate Analysis

Table of Contents