Menu

Topic 7 of 13

Univariate Analysis

Learn Univariate Analysis for free with explanations, exercises, and a quick test (for Data Analyst).

Published: December 19, 2025 | Updated: December 19, 2025

Why this matters

Univariate analysis is the fastest way to understand a single column: its typical values, spread, shape, and issues. Data Analysts use it to catch data quality problems early and to pick the right transformations before modeling or reporting.

  • Real tasks it powers: sanity-check new datasets, define bins for dashboards, choose outlier rules, set thresholds for alerts, and decide how to impute missing values.
  • Outcome: cleaner data, clearer visuals, and fewer surprises later in analysis.
Example day-to-day decisions
  • Marketing: Is cost-per-click too skewed for an average? Use median instead.
  • Product: Are daily active users stable? Check variability and outliers.
  • Ops: Which error code occurs most often? Use counts and proportions.

Concept explained simply

Univariate analysis looks at one variable at a time. You summarize, visualize, and judge quality for that one column.

  • Numeric (continuous/discrete): mean/median/mode, min–max, range, variance/std, IQR, percentiles, skewness; visuals: histogram, boxplot.
  • Categorical: counts, proportions, mode, unique count (cardinality); visuals: bar chart.
  • Date/time: counts by period (day/week/month), min/max date, gaps; visuals: line or bar by time bucket.

Mental model

Imagine a bright flashlight on a single column. Your job: identify its type, clean it, summarize it, visualize it, interpret it, then decide the next action.

  • Type → Clean → Summarize → Visualize → Interpret → Decide

Standard workflow you can reuse

  1. Identify variable type: numeric, categorical, date/time, or ID-like.
  2. Quality check: missing %, invalid values (e.g., negatives for ages), duplicates (if ID), mixed types.
  3. Summaries:
    • Numeric: count, missing %, mean, median, std, IQR, p1/p5/p95/p99, min/max, skew.
    • Categorical: unique count, top categories, proportions, rare-category tail.
    • Date/time: coverage window, gaps, counts by period.
  4. Simple visuals (mentally or with tools): histogram or bar chart; boxplot for outliers.
  5. Decide actions: impute, drop, winsorize, transform (log/sqrt), bin, group rare categories, or mark as ID.
Quick variable-type check
  • All unique long strings with no business meaning → likely ID (don’t compute mean).
  • Counts, money, time durations → numeric (often right-skewed; consider median/log).
  • Country, device, plan → categorical; focus on counts and tail.
  • Timestamps → derive period features for summaries.

Worked examples

Example 1: Numeric (Customer Age)

Sample ages: 18, 19, 20, 22, 23, 23, 24, 24, 24, 25, 26, 27, 28, 41

  • Median: sort and take middle → median = 24
  • IQR: Q1 ≈ 22, Q3 ≈ 26 → IQR = 4
  • Outlier bounds: [Q1 − 1.5×IQR, Q3 + 1.5×IQR] = [16, 32]
  • 41 is above 32 → potential outlier. Decide: verify (age 41 is valid). Keep; maybe winsorize if needed for modeling.
More stats
  • Mean ≈ 24.7
  • Std dev ≈ 5.2
  • Skew: slightly right due to 41

Example 2: Categorical (Device Type)

Counts: Mobile=610, Desktop=290, Tablet=80, Other=20 (total=1000)

  • Proportions: 61%, 29%, 8%, 2%
  • Decision: Group categories < 3% into "Other" to declutter charts. Keep original raw values in data.
  • Mode: Mobile
Quality checks
  • Unique count = 4 → manageable.
  • Check spelling consistency (e.g., "mobile" vs "Mobile").

Example 3: Numeric with skew (Order Value in $)

Values (n=10): 8, 10, 11, 12, 12, 13, 14, 16, 90, 120

  • Mean = 30.6; Median = 12.5 → heavy right skew.
  • IQR (Q1≈11, Q3≈16) → IQR=5; upper bound= Q3+1.5×IQR = 23.5 → 90 and 120 are outliers.
  • Action options:
    • Report typical value as median (12.5) with IQR.
    • Use log transform for modeling: log1p(x) to stabilize variance.
    • Winsorize at p99 for robust aggregates if business-approved.
Log intuition

Log compresses large values so 90 and 120 don’t dominate variance, improving model stability and visual interpretability.

Hands-on exercises

Do these quick tasks. Compare with solutions below. Then take the quick test at the end.

Exercise 1 (ex1): Numeric — summarize and flag outliers

Data (Revenue $): 12, 13, 15, 16, 16, 18, 19, 20, 22, 25, 100, null, 14

  • Tasks:
    • Compute count (incl. missing), missing %, mean, median, Q1, Q3, IQR, min, max.
    • Flag outliers using 1.5×IQR rule.
    • Recommend: median/IQR or mean/std for reporting? Briefly justify.
Show solution for ex1

Count=13, Missing=1 → missing%=7.7%. Numeric count (non-missing)=12. Sorted (non-missing): 12,13,14,15,16,16,18,19,20,22,25,100.

  • Median = (16+18)/2 = 17
  • Q1 ≈ 14.5 (between 14 and 15), Q3 ≈ 21 (between 20 and 22) → IQR ≈ 6.5
  • Bounds: lower = Q1−1.5×IQR ≈ 14.5−9.75=4.75; upper ≈ 21+9.75=30.75 → 100 is an outlier.
  • Mean ≈ (12+13+14+15+16+16+18+19+20+22+25+100)/12 = 24.2
  • Min=12, Max=100
  • Recommendation: Use median (17) and IQR (≈6.5) due to strong right-skew and outlier.

Exercise 2 (ex2): Categorical — proportions and grouping

Data (Signup Source counts, total 1,200): SEO=480, Paid=300, Direct=210, Referral=90, Social=60, Other=60

  • Tasks:
    • Compute proportions (%) for each category.
    • Apply a 3% threshold to group rare categories into "Other". Which categories get grouped?
    • Report the new distribution after grouping.
Show solution for ex2
  • Proportions: SEO 40%, Paid 25%, Direct 17.5%, Referral 7.5%, Social 5%, Other 5%.
  • 3% threshold = 36 users. Categories below 3%: none. So no changes under 3% rule.
  • If you choose 5% threshold, Social and Other would be grouped, new "Other" ≈ 10%.

Checklist — fast univariate QA

  • Variable type identified correctly
  • Missing % computed; plan to handle NA
  • For numeric: median, IQR, p1/p99 checked; outliers flagged
  • For categorical: top categories listed; tail reviewed
  • Date range and gaps assessed (if applicable)
  • Transform or binning decision recorded
  • Business sanity check performed

Common mistakes and how to self-check

  • Using mean/std for skewed data → Check mean vs median; if far apart, prefer median/IQR.
  • Treating IDs as numeric → If unique count ≈ row count, don’t compute numeric stats.
  • Ignoring missingness → Always compute missing % and decide: drop, impute, or keep as signal.
  • Over-pruning categories → Keep business-critical categories even if rare.
  • One-size-fits-all outlier rules → Validate with domain context (e.g., $0 sales days on holidays).
Self-check prompts
  • Is your chosen summary robust to outliers?
  • Did you verify data units and valid ranges?
  • Did you capture decisions (imputation, grouping) in notes/code for reproducibility?

Practical projects

  • E-commerce single-column audit: For 10 key columns (price, quantity, discount, device, country, date), produce a one-pager per column with summaries, plots (if available), and decisions.
  • Churn dataset triage: Evaluate each variable univariately to decide: keep, transform, or drop. Document rationale.
  • Automated univariate report: Build a script/notebook that outputs summaries (including missing %, percentiles, top categories) for any dataset.

Who this is for, prerequisites, learning path

Who this is for

  • Data Analysts, Product Analysts, and anyone preparing datasets for reporting or modeling.

Prerequisites

  • Basic statistics: mean, median, percentiles
  • Comfort with spreadsheets or a scripting language (Python/R) is helpful but not required

Learning path

  • Start: Univariate analysis (this page)
  • Next: Bivariate analysis (relationships)
  • Then: Multivariate exploration and feature engineering

Mini challenge

You receive a new column: session_duration_sec with 15% missing, min=0, p50=45, p95=900, max=7200. In 3 bullets, state your reporting metric, outlier handling, and any transform. Keep it crisp.

Next steps

  • Apply this workflow to two columns in your current dataset: one numeric and one categorical.
  • Document one transformation decision and its reason.
  • Proceed to bivariate analysis once you can complete the checklist in under 5 minutes per column.

Quick test

Anyone can take the test. If you are logged in, your progress will be saved.

Practice Exercises

2 exercises to complete

Instructions

Data (Revenue $): 12, 13, 15, 16, 16, 18, 19, 20, 22, 25, 100, null, 14

  • Compute count (incl. missing), missing %, mean, median, Q1, Q3, IQR, min, max.
  • Flag outliers using 1.5×IQR rule.
  • Recommend: median/IQR or mean/std for reporting? Briefly justify.
Expected Output
Count=13, Missing=1 (7.7%); Numeric count=12; Median≈17; Q1≈14.5; Q3≈21; IQR≈6.5; Bounds≈[4.75,30.75]; Outlier: 100; Use median/IQR due to skew.

Univariate Analysis — Quick Test

Test your knowledge with 6 questions. Pass with 70% or higher.

6 questions70% to pass

Have questions about Univariate Analysis?

AI Assistant

Ask questions about this tool