Why this matters
Univariate analysis is the fastest way to understand a single column: its typical values, spread, shape, and issues. Data Analysts use it to catch data quality problems early and to pick the right transformations before modeling or reporting.
- Real tasks it powers: sanity-check new datasets, define bins for dashboards, choose outlier rules, set thresholds for alerts, and decide how to impute missing values.
- Outcome: cleaner data, clearer visuals, and fewer surprises later in analysis.
Example day-to-day decisions
- Marketing: Is cost-per-click too skewed for an average? Use median instead.
- Product: Are daily active users stable? Check variability and outliers.
- Ops: Which error code occurs most often? Use counts and proportions.
Concept explained simply
Univariate analysis looks at one variable at a time. You summarize, visualize, and judge quality for that one column.
- Numeric (continuous/discrete): mean/median/mode, min–max, range, variance/std, IQR, percentiles, skewness; visuals: histogram, boxplot.
- Categorical: counts, proportions, mode, unique count (cardinality); visuals: bar chart.
- Date/time: counts by period (day/week/month), min/max date, gaps; visuals: line or bar by time bucket.
Mental model
Imagine a bright flashlight on a single column. Your job: identify its type, clean it, summarize it, visualize it, interpret it, then decide the next action.
- Type → Clean → Summarize → Visualize → Interpret → Decide
Standard workflow you can reuse
- Identify variable type: numeric, categorical, date/time, or ID-like.
- Quality check: missing %, invalid values (e.g., negatives for ages), duplicates (if ID), mixed types.
- Summaries:
- Numeric: count, missing %, mean, median, std, IQR, p1/p5/p95/p99, min/max, skew.
- Categorical: unique count, top categories, proportions, rare-category tail.
- Date/time: coverage window, gaps, counts by period.
- Simple visuals (mentally or with tools): histogram or bar chart; boxplot for outliers.
- Decide actions: impute, drop, winsorize, transform (log/sqrt), bin, group rare categories, or mark as ID.
Quick variable-type check
- All unique long strings with no business meaning → likely ID (don’t compute mean).
- Counts, money, time durations → numeric (often right-skewed; consider median/log).
- Country, device, plan → categorical; focus on counts and tail.
- Timestamps → derive period features for summaries.
Worked examples
Example 1: Numeric (Customer Age)
Sample ages: 18, 19, 20, 22, 23, 23, 24, 24, 24, 25, 26, 27, 28, 41
- Median: sort and take middle → median = 24
- IQR: Q1 ≈ 22, Q3 ≈ 26 → IQR = 4
- Outlier bounds: [Q1 − 1.5×IQR, Q3 + 1.5×IQR] = [16, 32]
- 41 is above 32 → potential outlier. Decide: verify (age 41 is valid). Keep; maybe winsorize if needed for modeling.
More stats
- Mean ≈ 24.7
- Std dev ≈ 5.2
- Skew: slightly right due to 41
Example 2: Categorical (Device Type)
Counts: Mobile=610, Desktop=290, Tablet=80, Other=20 (total=1000)
- Proportions: 61%, 29%, 8%, 2%
- Decision: Group categories < 3% into "Other" to declutter charts. Keep original raw values in data.
- Mode: Mobile
Quality checks
- Unique count = 4 → manageable.
- Check spelling consistency (e.g., "mobile" vs "Mobile").
Example 3: Numeric with skew (Order Value in $)
Values (n=10): 8, 10, 11, 12, 12, 13, 14, 16, 90, 120
- Mean = 30.6; Median = 12.5 → heavy right skew.
- IQR (Q1≈11, Q3≈16) → IQR=5; upper bound= Q3+1.5×IQR = 23.5 → 90 and 120 are outliers.
- Action options:
- Report typical value as median (12.5) with IQR.
- Use log transform for modeling: log1p(x) to stabilize variance.
- Winsorize at p99 for robust aggregates if business-approved.
Log intuition
Log compresses large values so 90 and 120 don’t dominate variance, improving model stability and visual interpretability.
Hands-on exercises
Do these quick tasks. Compare with solutions below. Then take the quick test at the end.
Exercise 1 (ex1): Numeric — summarize and flag outliers
Data (Revenue $): 12, 13, 15, 16, 16, 18, 19, 20, 22, 25, 100, null, 14
- Tasks:
- Compute count (incl. missing), missing %, mean, median, Q1, Q3, IQR, min, max.
- Flag outliers using 1.5×IQR rule.
- Recommend: median/IQR or mean/std for reporting? Briefly justify.
Show solution for ex1
Count=13, Missing=1 → missing%=7.7%. Numeric count (non-missing)=12. Sorted (non-missing): 12,13,14,15,16,16,18,19,20,22,25,100.
- Median = (16+18)/2 = 17
- Q1 ≈ 14.5 (between 14 and 15), Q3 ≈ 21 (between 20 and 22) → IQR ≈ 6.5
- Bounds: lower = Q1−1.5×IQR ≈ 14.5−9.75=4.75; upper ≈ 21+9.75=30.75 → 100 is an outlier.
- Mean ≈ (12+13+14+15+16+16+18+19+20+22+25+100)/12 = 24.2
- Min=12, Max=100
- Recommendation: Use median (17) and IQR (≈6.5) due to strong right-skew and outlier.
Exercise 2 (ex2): Categorical — proportions and grouping
Data (Signup Source counts, total 1,200): SEO=480, Paid=300, Direct=210, Referral=90, Social=60, Other=60
- Tasks:
- Compute proportions (%) for each category.
- Apply a 3% threshold to group rare categories into "Other". Which categories get grouped?
- Report the new distribution after grouping.
Show solution for ex2
- Proportions: SEO 40%, Paid 25%, Direct 17.5%, Referral 7.5%, Social 5%, Other 5%.
- 3% threshold = 36 users. Categories below 3%: none. So no changes under 3% rule.
- If you choose 5% threshold, Social and Other would be grouped, new "Other" ≈ 10%.
Checklist — fast univariate QA
- Variable type identified correctly
- Missing % computed; plan to handle NA
- For numeric: median, IQR, p1/p99 checked; outliers flagged
- For categorical: top categories listed; tail reviewed
- Date range and gaps assessed (if applicable)
- Transform or binning decision recorded
- Business sanity check performed
Common mistakes and how to self-check
- Using mean/std for skewed data → Check mean vs median; if far apart, prefer median/IQR.
- Treating IDs as numeric → If unique count ≈ row count, don’t compute numeric stats.
- Ignoring missingness → Always compute missing % and decide: drop, impute, or keep as signal.
- Over-pruning categories → Keep business-critical categories even if rare.
- One-size-fits-all outlier rules → Validate with domain context (e.g., $0 sales days on holidays).
Self-check prompts
- Is your chosen summary robust to outliers?
- Did you verify data units and valid ranges?
- Did you capture decisions (imputation, grouping) in notes/code for reproducibility?
Practical projects
- E-commerce single-column audit: For 10 key columns (price, quantity, discount, device, country, date), produce a one-pager per column with summaries, plots (if available), and decisions.
- Churn dataset triage: Evaluate each variable univariately to decide: keep, transform, or drop. Document rationale.
- Automated univariate report: Build a script/notebook that outputs summaries (including missing %, percentiles, top categories) for any dataset.
Who this is for, prerequisites, learning path
Who this is for
- Data Analysts, Product Analysts, and anyone preparing datasets for reporting or modeling.
Prerequisites
- Basic statistics: mean, median, percentiles
- Comfort with spreadsheets or a scripting language (Python/R) is helpful but not required
Learning path
- Start: Univariate analysis (this page)
- Next: Bivariate analysis (relationships)
- Then: Multivariate exploration and feature engineering
Mini challenge
You receive a new column: session_duration_sec with 15% missing, min=0, p50=45, p95=900, max=7200. In 3 bullets, state your reporting metric, outlier handling, and any transform. Keep it crisp.
Next steps
- Apply this workflow to two columns in your current dataset: one numeric and one categorical.
- Document one transformation decision and its reason.
- Proceed to bivariate analysis once you can complete the checklist in under 5 minutes per column.
Quick test
Anyone can take the test. If you are logged in, your progress will be saved.