Who this is for
Beginner and early-career data analysts who need to collect representative samples to estimate means, proportions, or build dashboards and reports from large datasets.
Prerequisites
- Comfort with averages, percentages, and basic variance/standard deviation.
- Ability to use spreadsheets or SQL to filter and randomly sample rows.
- Basic understanding of confidence level and margin of error (conceptually).
Why this matters
- Estimating customer satisfaction without surveying everyone.
- Auditing transactions efficiently when full review is too costly.
- Measuring defect rates or conversion rates with reliable precision.
- Creating quick, representative datasets for prototyping dashboards and A/B test checks.
Concept explained simply
Population: the full set you care about (all orders this quarter). Sampling frame: the list you can actually sample from (your database table). Sample: the subset you select. Parameter: a true but unknown population value (real average order value). Statistic: a value computed from your sample (your sample mean) used to estimate the parameter.
Sampling error: the natural difference between your sample statistic and the true parameter. Bias: systematic error that pushes estimates consistently high or low (e.g., surveying only power users).
Core probability sampling methods
- Simple Random Sample (SRS): each unit has equal chance. Default choice if you have a clean frame.
- Systematic: pick every k-th unit after a random start. Great for ordered lists when there’s no periodic pattern.
- Stratified: split into strata (e.g., device type), sample from each. Use when subgroups differ and must be represented.
- Cluster: sample clusters (e.g., stores), then all or some items within. Use when listing all units is hard but clusters are easy.
Mental model
Think of sampling as a funnel: define the target population → confirm your frame → choose a method that keeps the sample representative → size the sample to control error → document the process so others can replicate.
Quick sizing heuristics
- For proportions at 95% confidence and ±5% margin of error: about 385 regardless of population size (if large).
- For small populations, apply the finite population correction (FPC) which reduces needed n: FPC = sqrt((N - n) / (N - 1)).
- Reserve at least 30 observations per key subgroup when estimating subgroup metrics.
- Independence rule of thumb: if sampling without replacement, keep n ≤ 10% of N to approximate independence.
Worked examples
Example 1 — Sample size for a proportion with FPC
Goal: Estimate the share of satisfied users with 95% confidence and ±5% margin of error. Assume worst case p = 0.5 (max variance). Population N = 12,000.
- Base size (large population): n0 ≈ 384.16 → round up to 385.
- With FPC: n ≈ n0 / (1 + (n0 - 1)/N) ≈ 384.16 / (1 + 383.16/12000) ≈ 373. Round up to 373.
Action: Aim for at least 373 completed responses. If you expect 60% response rate, invite ≈ 622 people (373 / 0.6).
Example 2 — Systematic sampling
Population: 5,000 orders. Need n = 200. Compute k = floor(N / n) = floor(5000 / 200) = 25. Choose a random start between 1 and 25 (say 7). Take rows 7, 32, 57, 82, ... until you collect 200.
Check: Ensure there’s no 25-row periodicity (e.g., batch processing every 25th row) that could bias the sample.
Example 3 — Stratified proportional allocation
Population by device: Mobile 60%, Web 40%. Need n = 200.
- Mobile: 200 × 0.60 = 120
- Web: 200 × 0.40 = 80
Draw SRS within each device group. If you also care about OS (iOS vs Android), stratify further or quota minimally (e.g., 60 iOS, 60 Android) while keeping randomness within each quota.
Example 4 — Cluster sampling to cut costs
Auditing retail receipts across 1,000 stores. Listing all receipts is hard, but store lists are easy.
- Stage 1: Randomly select 40 stores (clusters).
- Stage 2: Within each selected store, SRS 25 receipts → target n = 1,000.
Note: Cluster sampling often needs a larger n than SRS for the same precision if receipts within a store are similar. Balance cost vs precision.
How to choose a sampling method (step-by-step)
- Define target population precisely (time window, geography, platform).
- Validate your sampling frame (are any groups missing or duplicated?).
- Pick a method:
- Use SRS by default when the frame is clean.
- Use Systematic if data is in a list/stream with no periodicity.
- Use Stratified if subgroups differ or must be represented.
- Use Cluster when listing all units is costly but clusters are available.
- Size your sample based on margin of error and confidence. Adjust for expected nonresponse.
- Document every step so results are reproducible.
Practical projects
- Project 1: Build a sampling playbook. Create reusable steps and spreadsheet templates for SRS, systematic, and stratified allocation.
- Project 2: Dashboard smoke test. Draw a 400-row SRS from the latest month and check KPIs vs full month to evaluate sampling error.
- Project 3: Audit mini-study. Use stratified sampling by product category to estimate return rate with ±4–6% precision.
Exercises
Work through these, then open each solution to self-check.
- Exercise 1 (ex1): Sample size for a proportion with FPC.
Company N = 12,000 customers. You want ±5% margin of error at 95% confidence for a satisfaction proportion. Assume p = 0.5. Compute sample size with and without FPC. Round up to whole numbers. - Exercise 2 (ex2): Stratified allocation.
Employee survey across three regions: APAC 50%, EMEA 30%, Americas 20%. Target n = 250 completed surveys. How many from each region? Round to whole numbers that sum to 250. - Exercise 3 (ex3): Spot the bias.
You plan to estimate average session length by surveying newsletter subscribers only. Identify the bias and propose a better sampling approach.
- Checklist before checking solutions:
- You wrote the formula or steps used.
- You showed intermediate numbers and rounding decisions.
- You stated any assumptions (e.g., p = 0.5, expected response rate).
Common mistakes and self-check
- Using a bad frame: sampling from a filtered table that excludes churned users. Self-check: does every target unit have a chance to be selected?
- Ignoring nonresponse: calculating n for completes, but inviting the same number. Self-check: inflate invites by 1/response rate.
- Overusing convenience samples: fast but biased. Self-check: can you describe the selection chance for each unit? If not, it’s likely biased.
- Systematic sampling with periodic data: k aligns with a hidden cycle. Self-check: plot or scan for repeating patterns before using systematic sampling.
- Too-small strata: allocating 3 people to a subgroup and trying to compare means. Self-check: ensure meaningful minimums (e.g., 30+ per subgroup if doing subgroup estimates).
Mini challenge
You need a sample of support tickets to estimate the proportion that need escalation. Tickets arrive in time order, weekdays are busier than weekends. You want quick selection with minimal bias and n ≈ 400. Choose a method and outline 3 concrete steps to draw the sample. Hint: think about periodicity and weekday/weekend balance.
One good answer
Use stratified systematic sampling by day-of-week:
- Step 1: Split the frame into 7 strata (Mon–Sun).
- Step 2: Allocate n proportionally to each day’s ticket volume (ensure at least ~30 each day).
- Step 3: Within each day, sort by time and take a systematic sample with a random start and k ≈ (day volume / day n). Verify no intra-day periodicity.
Learning path
- After Sampling Basics: Confidence intervals for proportions and means.
- Then: Hypothesis testing basics (z-test for proportions, t-test for means).
- Next: Power analysis and sample size planning for experiments.
- Finally: Experimental design and A/B testing.
Next steps
- Take the quick test to confirm your understanding. Anyone can take the test; only logged-in users have their progress saved.
- Revisit any weak spots and redo the exercises.