Why this matters
Data Scientists often need representative samples to prototype models quickly, run A/B tests safely, and build cohorts to understand retention, funnels, and lifecycle value. Good sampling avoids bias and speeds iteration. Well-defined cohorts make your metrics comparable over time.
- Prototype faster: run analyses on a small, representative subset.
- Reliable experiments: deterministic, reproducible splits for A/B testing.
- Product insights: cohort retention, activation, and revenue over time.
Who this is for
- Data Scientists and Analysts building product metrics, experiments, or lifecycle dashboards.
- Engineers needing consistent data slices for model training or backfills.
Prerequisites
- Comfort with SELECT, WHERE, GROUP BY, JOIN, and window functions (ROW_NUMBER, MIN OVER, COUNT DISTINCT).
- Basic date functions (DATE_TRUNC, DATE_DIFF or equivalents).
- Familiarity with your warehouse functions for hashing or randomization (or ability to use numeric IDs).
Concept explained simply
Sampling is choosing a smaller subset of rows that still behaves like the whole. Cohort building is grouping users by a meaningful milestone (e.g., signup month or first purchase month) so you can compare how groups behave over time.
Mental model
- Sampling: imagine shuffling all rows and taking a slice (random), or taking the same slice every time based on a rule (deterministic). For fairness across groups, take the same proportion from each group (stratified).
- Cohorts: pick a start line (e.g., first purchase date), label everyone by that start line (e.g., 2025-03 cohort), then track what happens in weeks or months after the start.
Deep dive: Deterministic vs random sampling
Random sampling (ORDER BY RANDOM()) is quick for small tables but can be slow and non-reproducible. Deterministic sampling uses a stable function of an identifier (e.g., user_id % 10 = 0) so the same users always appear. For experiments and reusable dashboards, prefer deterministic sampling.
Core patterns
- Random sample: ORDER BY RANDOM() LIMIT N or TABLESAMPLE (if supported) for quick exploration.
- Deterministic sample: use a stable function on an ID, e.g., MOD(user_id, 100) < 5 for ~5%.
- Stratified sample: apply the deterministic rule separately within each group (country, channel) to keep proportions.
- Signup cohort: DATE_TRUNC(signup_date, 'month').
- First-purchase cohort: DATE_TRUNC(MIN(order_date), 'month') per user.
- Retention: join cohort to later events, compute periods since cohort start (e.g., month_number).
Worked examples
Example 1: Quick 1% random sample of users
-- Assumes a users table with user_id, country, signup_date
SELECT *
FROM users
ORDER BY RANDOM()
LIMIT (SELECT CEIL(COUNT(*) * 0.01) FROM users);
Good for quick checks on small-to-medium tables. For reproducibility or very large tables, prefer deterministic sampling.
Example 2: Deterministic 5% sample (portable pattern)
-- Assumes user_id is numeric.
-- 5% sample: users whose user_id modulo 20 equals 0.
SELECT *
FROM users
WHERE MOD(user_id, 20) = 0;
This yields roughly 5% and is stable across runs. If your IDs are not numeric, use your warehouse's hash function and take MOD on it.
Example 3: Stratified 10% per country
-- Keep ~10% from every country based on a deterministic rule.
-- Using modulo on user_id ensures each country keeps ~the same proportion.
SELECT *
FROM users
WHERE MOD(user_id, 10) = 0; -- ~10%
Because the rule does not depend on country, it preserves proportions in each country. If some groups are tiny, switch to a fixed N per group using ROW_NUMBER per partition.
-- Fixed N per group (e.g., 200 per country)
WITH ranked AS (
SELECT u.*, ROW_NUMBER() OVER (PARTITION BY country ORDER BY user_id) AS rn
FROM users u
)
SELECT *
FROM ranked
WHERE rn <= 200;
Example 4: Build first-purchase monthly cohorts and 3-month retention
-- Tables: users(user_id, signup_date, country), orders(order_id, user_id, order_date, amount)
-- 1) First purchase date per user
WITH first_purchase AS (
SELECT user_id, MIN(order_date) AS first_order_date
FROM orders
GROUP BY user_id
),
cohorts AS (
SELECT fp.user_id,
DATE_TRUNC('month', fp.first_order_date) AS cohort_month
FROM first_purchase fp
),
events_months AS (
-- activity months for users with any orders
SELECT o.user_id,
DATE_TRUNC('month', o.order_date) AS activity_month
FROM orders o
),
joined AS (
SELECT c.cohort_month,
em.activity_month,
EXTRACT(YEAR FROM em.activity_month) * 12 + EXTRACT(MONTH FROM em.activity_month)
- (EXTRACT(YEAR FROM c.cohort_month) * 12 + EXTRACT(MONTH FROM c.cohort_month)) AS month_number,
em.user_id
FROM cohorts c
JOIN events_months em USING (user_id)
)
SELECT cohort_month,
month_number,
COUNT(DISTINCT user_id) AS active_users
FROM joined
WHERE month_number BETWEEN 0 AND 2 -- months 0,1,2
GROUP BY cohort_month, month_number
ORDER BY cohort_month, month_number;
Interpretation: month_number = 0 is the first-purchase month; 1 and 2 are subsequent months.
How to choose a method
- Need speed only: random sample on small tables.
- Need reproducibility: deterministic sampling by ID.
- Need fairness across groups: stratified or fixed-N per group.
- Need lifecycle insights: build cohorts by a meaningful milestone (signup or first purchase) and measure forward.
Exercises
Use these shared schemas in your mind or local sandbox:
- users(user_id INT, country TEXT, signup_date DATE, marketing_channel TEXT)
- orders(order_id INT, user_id INT, order_date DATE, amount NUMERIC)
- events(user_id INT, event_name TEXT, event_time TIMESTAMP)
- Exercise 1 (ex1): Create a reproducible ~5% user sample that preserves proportions by marketing_channel. Use a deterministic rule on user_id. Return user_id, marketing_channel.
- Exercise 2 (ex2): Build signup cohorts by month and compute how many users had at least one order in cohort month (month 0) and the next two months (months 1–2). Output: cohort_month, month_number, active_users.
- Exercise 3 (ex3): Produce a deterministic 50/50 A/B split on users (groups 'A' and 'B'). Then show, for each group, the number of distinct users who generated an 'app_open' event within 7 days of signup. Output: group_label, d7_app_open_users.
Checklist before you run
- Sampling uses a stable rule (e.g., MOD(user_id, 20)).
- No leakage: post-cohort filters do not affect who enters the cohort.
- Date math uses the cohort start as the reference, not calendar month only.
- Stratification handled by equal rule across groups or fixed-N per group.
Common mistakes and self-check
- Mistake: Using ORDER BY RANDOM() on huge tables. Fix: Use deterministic modulo or warehouse-native TABLESAMPLE.
- Mistake: Biased sampling by filtering first (e.g., only active users), then sampling. Fix: Sample first, then analyze.
- Mistake: Cohort by calendar month, not by milestone. Fix: Use DATE_TRUNC on the milestone (signup or first order).
- Mistake: Misaligned retention windows. Fix: Compute month_number or day_number relative to cohort start.
- Mistake: Non-reproducible A/B splits. Fix: Use deterministic split (e.g., MOD(user_id, 2)).
Self-check mini list
- Re-run your sampling query twice—same rows?
- Pick three users and manually verify their cohort assignment.
- Validate counts against full data for one small cohort month.
Practical projects
- Build a monthly signup cohort dashboard with 0–6 month retention and revenue.
- Create a reproducible 10% user sample and use it to prototype a churn model end-to-end.
- Design a deterministic A/B split and compare conversion by marketing_channel for 14 days.
Learning path
- Before: Aggregations, window functions, date functions.
- Now: Sampling strategies, cohort definitions, retention math.
- Next: Experiment design (AA checks), power analysis, causal inference basics.
Next steps
- Parameterize your sampling percent and cohort windows for reuse.
- Document cohort definitions so dashboards stay consistent across teams.
- Add quality checks: cohort size thresholds and data freshness alerts.
Mini challenge
Create a first-purchase cohort table and compute 1-, 3-, and 6-month revenue per user. Then re-run on a deterministic 20% sample. Do results align proportionally?
Quick Test
Take the short test below to check understanding. Available to everyone; only logged-in users get saved progress.