Why this matters
As a Data Scientist, you will constantly summarize data before modeling or making decisions. Descriptive statistics help you: check data quality, choose the right models, communicate insights clearly, and set baselines for experiments.
- Real tasks: sanity-check feature distributions, compare A/B variant metrics, summarize user behavior, detect outliers before training.
- Hiring screens often include quick questions on mean, median, variance, quartiles, z-scores, and interpreting plots.
Concept explained simply
Descriptive statistics turn raw numbers into a quick story about center, spread, and shape.
- Center: where the data tends to be (mean, median, mode).
- Spread: how variable it is (range, interquartile range, variance, standard deviation).
- Shape: symmetric or skewed? any outliers? (look via percentiles, box plots, histograms).
Mental model
Imagine your data as a line of pebbles on a ruler:
- Median is the middle pebble.
- Mean is the balance point if the ruler were a seesaw.
- IQR (Q3−Q1) measures the middle 50% width of pebbles.
- Standard deviation measures average distance from the mean.
- Outliers are pebbles far from the rest, beyond the “fences.”
Core concepts and quick rules
- Mean: average. Sensitive to outliers.
- Median: middle value. Robust to outliers/skew.
- Mode: most frequent value. Useful for discrete/categorical-like numeric data.
- Range: max − min. Very sensitive to outliers.
- Variance (sample): average squared distance from mean, using n−1 in the denominator.
- Standard deviation (sample): sqrt(variance). Same units as data.
- Quartiles (Q1, Q3): 25th and 75th percentiles. IQR = Q3 − Q1.
- Outlier rule of thumb: values < Q1 − 1.5×IQR or > Q3 + 1.5×IQR.
- Z-score: (x − mean) / sd. Tells how many standard deviations x is from mean.
- Report pairs: symmetric data → mean & sd; skewed/outliers → median & IQR.
Worked examples
Example 1 — Core summaries
Data: 9, 10, 10, 11, 12, 14, 15, 20
- Mean = (sum)/8 = 101/8 = 12.625
- Median = average of 4th and 5th = (11+12)/2 = 11.5
- Mode = 10
- Range = 20 − 9 = 11
- Q1 (lower half 9,10,10,11) = median = (10+10)/2 = 10
- Q3 (upper half 12,14,15,20) = median = (14+15)/2 = 14.5
- IQR = 14.5 − 10 = 4.5
- Sample sd ≈ 3.62
Example 2 — Outlier detection with IQR
Data: 5, 6, 6, 7, 7, 8, 12, 30
- Q1 = 6, Q3 = 10 → IQR = 4
- Fences: lower = 6 − 1.5×4 = 0; upper = 10 + 1.5×4 = 16
- Outliers: values < 0 or > 16 → 30 is an outlier
Example 3 — Pick the right summaries
Scenario: Monthly incomes in a city are right-skewed with a few very high earners. Mean is pulled up by outliers.
- Use: median & IQR for typical value and spread.
- A box plot will highlight skew and outliers clearly.
How to compute quickly
- Sort your data once. Many stats (median, quartiles, IQR, outliers) follow directly from the sorted list.
- Pick center and spread based on shape: skewed → median & IQR; symmetric → mean & sd.
- Use IQR fences for a quick outlier check before modeling.
- Standardize with z-scores to compare across different scales.
Exercises (practice here, then open solutions)
These mirror the tasks in the Exercises section below (ex1–ex3). Try them first; then open solutions.
- ex1: Compute mean, median, mode, range, Q1, Q3, IQR, and sample sd.
- ex2: Use IQR to find outliers and state the fences.
- ex3: Choose the best center and spread for a skewed scenario, and explain why.
Self-check checklist
- You can decide between mean/sd vs median/IQR based on distribution shape.
- You can compute quartiles and IQR from a sorted list.
- You can apply the 1.5×IQR rule to flag outliers.
- You can compute and interpret a z-score.
Common mistakes and how to self-check
- Using mean with strong skew/outliers. Self-check: compare mean vs median; if far apart, prefer median & IQR.
- Forgetting n−1 in sample variance/sd. Self-check: confirm denominator for sample-based estimates.
- Mixing quartile conventions. Self-check: be explicit (median-of-halves/Tukey) and stay consistent within an analysis.
- Calling any extreme value an “outlier” without context. Self-check: compute IQR fences and also consider domain knowledge.
Practical projects
- Product metrics snapshot: summarize daily active users for the last 30 days (median, IQR, outliers) and write 3 bullet insights.
- Experiment pre-check: for an A/B dataset, compute baseline mean & sd and check for skew/outliers; suggest data transformations if needed.
- Feature audit: pick 5 numeric features from a public dataset; for each, report center, spread, outliers, and which summary pair is appropriate.
Who this is for
- Aspiring and early-career Data Scientists who need strong data summarization skills.
- Analysts and ML engineers validating data before modeling.
Prerequisites
- Comfort with basic arithmetic and order of operations.
- Knowing how to sort data and count observations.
Learning path
- Descriptive Statistics (this page)
- Probability basics (random variables, distributions)
- Sampling and Central Limit Theorem
- Confidence intervals and Hypothesis testing
- Effect sizes and Power
Exercises — detailed prompts and solutions
Exercise ex1 — Compute core summaries
Data: 12, 15, 14, 10, 9, 10, 11, 20
Tasks: mean, median, mode(s), range, Q1, Q3, IQR, sample sd (2 decimals).
Try it, then compare with the solution below.
Exercise ex2 — Detect outliers with IQR
Data: 5, 6, 6, 7, 7, 8, 12, 30. Find Q1, Q3, IQR, outlier fences, and list outliers.
Exercise ex3 — Pick the right summary
Scenario: Highly skewed right distribution of household incomes. What center and spread would you report, and why?
Mini challenge
You receive 1, 1, 2, 2, 2, 3, 12 as a feature vector. Without a calculator, decide quickly: should you report mean & sd or median & IQR to describe it to stakeholders? Justify in 1–2 sentences.
Next steps
- Apply these summaries to a dataset you care about (product, sports, finance).
- Move to Probability basics to understand uncertainty behind these summaries.
Quick Test
Note: The Quick Test is available to everyone. If you sign in, your progress saves automatically.