What is Descriptive Statistics for Data Analysts?
Descriptive statistics summarize data so you can understand what is typical, how spread out values are, and whether there are unusual patterns. As a Data Analyst, you will use it to profile datasets, validate assumptions, and communicate insights clearly to stakeholders before any modeling.
- Central tendency: mean, median, mode
- Dispersion: variance, standard deviation, range, IQR
- Distribution shape: skewness, kurtosis
- Position: percentiles and quantiles
- Tabulations: frequency tables, cross tabs
- Estimation basics: sampling, standard error, confidence intervals, effect size
Typical analyst tasks enabled
- Quality-check a new dataset (spot outliers, input errors)
- Describe user behavior (e.g., median session length vs. average)
- Segment and compare groups (e.g., conversion by channel)
- Provide ranges and uncertainty (e.g., 95% CI for average order value)
Who this is for
- Aspiring or junior Data Analysts who need a strong foundation
- Business analysts transitioning to quantitative work
- Data-savvy PMs who need reliable summaries and comparisons
Prerequisites
- Comfort with arithmetic and ratios
- Basic spreadsheets (sum, average, sort/filter)
- Optional but helpful: Intro SQL or Python
Learning path
- Describe the center: mean, median, mode; when to use each.
- Measure spread: variance, standard deviation, IQR; outlier impact.
- Understand position: percentiles, quantiles, ranks.
- Tabulate: frequency tables and cross tabs for categories.
- Shape: skewness and kurtosis to assess tails and asymmetry.
- Sampling and uncertainty: sampling basics, standard error, confidence intervals.
- Effect size: quantify practical differences beyond p-values.
- Interpret and communicate: write clear, decision-ready summaries.
Milestones checklist
- Compute mean/median/mode and explain when to prefer median
- Calculate SD and IQR; identify outliers using IQR rule
- Read and build a frequency table and cross tab with row/column percentages
- Explain right vs left skew and what it implies
- Construct a 95% CI for a mean and interpret it plainly
- Report a simple effect size (e.g., Cohen’s d) with context
Worked examples
Use this small dataset of daily orders for examples:
orders = [2, 3, 3, 4, 6, 8, 50]Example 1 — Mean vs. median with an outlier
Mean = (2+3+3+4+6+8+50)/7 = 10.86; Median = 4. Outlier 50 inflates the mean; median better represents a typical day.
Example 2 — Variance, SD, and IQR
- Sorted: [2,3,3,4,6,8,50]
- Q1=3, Q2=4, Q3=8 → IQR=5
- Sample variance (s²) ≈ 322.81; SD (s) ≈ 17.96
IQR shows the middle spread (robust to outliers); SD shows large overall spread due to the outlier.
Example 3 — 90th percentile (P90)
For 7 values, P90 is near the 6.4th position. Interpolated value ≈ between 8 and 50 → around 33.2. Interpretation: 90% of days have ≤ ~33 orders.
Example 4 — Frequency table (categorizing order sizes)
Bins: Small (≤3), Medium (4–8), Large (>8)
Small: 3 values (2,3,3)
Medium: 3 values (4,6,8)
Large: 1 value (50)Percentages: Small 42.9%, Medium 42.9%, Large 14.3%.
Example 5 — Cross tab and conditional percentages
Channel vs Purchase (1=yes,0=no)
Rows: Channel [Email, Ads]
Cols: Purchase [0,1]
Email: [60, 40]
Ads: [75, 25]Row percentages:
Email: No 60%, Yes 40%
Ads: No 75%, Yes 25%Email outperforms Ads in conversion (relative to each channel’s traffic).
Drills and exercises
- [ ] Compute mean, median, and mode for three recent metrics (e.g., daily signups). Note differences.
- [ ] Calculate IQR and flag outliers using 1.5×IQR rule.
- [ ] Build a frequency table for a categorical field (e.g., device type).
- [ ] Create a 2×2 cross tab (e.g., new vs. returning by converted vs. not) with row and column percentages.
- [ ] Identify skew direction in a numeric field by comparing mean vs. median and using a histogram.
- [ ] Construct a 95% confidence interval for a mean from a sample and interpret it plainly.
- [ ] Compute a simple effect size (Cohen’s d) between two groups.
Mini tasks you can do in a spreadsheet
- Use QUARTILE.INC for Q1/Q3 and compute IQR
- Use PERCENTILE.INC for P90
- Use COUNTIF/COUNTIFS to build frequency tables
- Use STDEV.S for sample standard deviation
Common mistakes and how to fix them
- Using the mean with skewed data: Prefer median or winsorized mean; report both.
- Confusing population vs. sample formulas: Use sample SD/variance (n−1) for samples.
- Ignoring outliers: Always check IQR and visualize; investigate data quality or report robust stats.
- Reading cross tabs without conditioning: Always specify row or column percentages and why.
- Misinterpreting CIs: A 95% CI means the procedure covers the true mean 95% of the time, not a 95% chance for your specific interval.
- Overstating small differences: Add effect size and context (practical significance) alongside p-values.
Debugging tips
- If SD is huge, check for unit mix-ups or extreme outliers.
- If percentiles look off, confirm sorting method and inclusive vs. exclusive function versions.
- If cross-tab totals don’t match, verify filters and missing values handling.
Mini project: Customer Order Insights
Goal: Summarize orders to guide operations and marketing.
- Clean: Remove obvious errors (negative orders). Document any removals.
- Center and spread: Report mean, median, SD, and IQR for daily orders.
- Percentiles: Provide P50, P75, P90, P95 to inform staffing thresholds.
- Cross tab: By device (Desktop/Mobile) and conversion (Yes/No), give row percentages.
- Uncertainty: 95% CI for average order value from a 30-day sample.
- Effect size: Compare average order value between new vs. returning users (Cohen’s d) and discuss practical impact.
- Deliverable: A one-page brief with a chart (histogram or box plot) and plain-language insights.
What good looks like
- Clear distinction between robust and non-robust metrics
- Explicit handling of outliers and missing data
- Row/column percentage labels on cross tabs
- Precise CI interpretation and a concise recommendation
More practical project ideas
- E-commerce: Daily revenue distribution with staffing recommendations from percentiles
- Marketing: Channel × device cross tab with conversion rates and effect sizes vs. baseline
- Product: Feature usage percentiles and skewness to identify power-user features
Subskills
This skill includes the following subskills. Explore each to practice focused capabilities:
- Central Tendency: Mean, Median, Mode
- Dispersion: Variance, Standard Deviation, IQR
- Percentiles and Quantiles
- Frequency Tables
- Cross Tabulation
- Skewness and Kurtosis
- Confidence Intervals Basics
- Sampling Basics
- Standard Error Basics
- Effect Size Basics
- Practical Interpretation
Next steps
- Re-run descriptive stats on 2–3 different datasets to build speed and intuition.
- Practice clear, one-paragraph interpretations for non-technical readers.
- When comfortable, move to inferential techniques (hypothesis tests, regression) while keeping robust descriptive summaries in your workflow.