How to learn Sampling And Cohort Building for SQL in Data Scientist for free

Why this matters

Data Scientists often need representative samples to prototype models quickly, run A/B tests safely, and build cohorts to understand retention, funnels, and lifecycle value. Good sampling avoids bias and speeds iteration. Well-defined cohorts make your metrics comparable over time.

Prototype faster: run analyses on a small, representative subset.
Reliable experiments: deterministic, reproducible splits for A/B testing.
Product insights: cohort retention, activation, and revenue over time.

Who this is for

Data Scientists and Analysts building product metrics, experiments, or lifecycle dashboards.
Engineers needing consistent data slices for model training or backfills.

Prerequisites

Comfort with SELECT, WHERE, GROUP BY, JOIN, and window functions (ROW_NUMBER, MIN OVER, COUNT DISTINCT).
Basic date functions (DATE_TRUNC, DATE_DIFF or equivalents).
Familiarity with your warehouse functions for hashing or randomization (or ability to use numeric IDs).

Concept explained simply

Sampling is choosing a smaller subset of rows that still behaves like the whole. Cohort building is grouping users by a meaningful milestone (e.g., signup month or first purchase month) so you can compare how groups behave over time.

Mental model

Sampling: imagine shuffling all rows and taking a slice (random), or taking the same slice every time based on a rule (deterministic). For fairness across groups, take the same proportion from each group (stratified).
Cohorts: pick a start line (e.g., first purchase date), label everyone by that start line (e.g., 2025-03 cohort), then track what happens in weeks or months after the start.

Deep dive: Deterministic vs random sampling

Random sampling (ORDER BY RANDOM()) is quick for small tables but can be slow and non-reproducible. Deterministic sampling uses a stable function of an identifier (e.g., user_id % 10 = 0) so the same users always appear. For experiments and reusable dashboards, prefer deterministic sampling.

Core patterns

Random sample: ORDER BY RANDOM() LIMIT N or TABLESAMPLE (if supported) for quick exploration.
Deterministic sample: use a stable function on an ID, e.g., MOD(user_id, 100) < 5 for ~5%.
Stratified sample: apply the deterministic rule separately within each group (country, channel) to keep proportions.
Signup cohort: DATE_TRUNC(signup_date, 'month').
First-purchase cohort: DATE_TRUNC(MIN(order_date), 'month') per user.
Retention: join cohort to later events, compute periods since cohort start (e.g., month_number).

Worked examples

Example 1: Quick 1% random sample of users

-- Assumes a users table with user_id, country, signup_date
SELECT *
FROM users
ORDER BY RANDOM()
LIMIT (SELECT CEIL(COUNT(*) * 0.01) FROM users);

Good for quick checks on small-to-medium tables. For reproducibility or very large tables, prefer deterministic sampling.

Example 2: Deterministic 5% sample (portable pattern)

-- Assumes user_id is numeric.
-- 5% sample: users whose user_id modulo 20 equals 0.
SELECT *
FROM users
WHERE MOD(user_id, 20) = 0;

This yields roughly 5% and is stable across runs. If your IDs are not numeric, use your warehouse's hash function and take MOD on it.

Example 3: Stratified 10% per country

-- Keep ~10% from every country based on a deterministic rule.
-- Using modulo on user_id ensures each country keeps ~the same proportion.
SELECT *
FROM users
WHERE MOD(user_id, 10) = 0;  -- ~10%

Because the rule does not depend on country, it preserves proportions in each country. If some groups are tiny, switch to a fixed N per group using ROW_NUMBER per partition.

-- Fixed N per group (e.g., 200 per country)
WITH ranked AS (
  SELECT u.*, ROW_NUMBER() OVER (PARTITION BY country ORDER BY user_id) AS rn
  FROM users u
)
SELECT *
FROM ranked
WHERE rn <= 200;

Example 4: Build first-purchase monthly cohorts and 3-month retention

-- Tables: users(user_id, signup_date, country), orders(order_id, user_id, order_date, amount)
-- 1) First purchase date per user
WITH first_purchase AS (
  SELECT user_id, MIN(order_date) AS first_order_date
  FROM orders
  GROUP BY user_id
),
cohorts AS (
  SELECT fp.user_id,
         DATE_TRUNC('month', fp.first_order_date) AS cohort_month
  FROM first_purchase fp
),
events_months AS (
  -- activity months for users with any orders
  SELECT o.user_id,
         DATE_TRUNC('month', o.order_date) AS activity_month
  FROM orders o
),
joined AS (
  SELECT c.cohort_month,
         em.activity_month,
         EXTRACT(YEAR FROM em.activity_month) * 12 + EXTRACT(MONTH FROM em.activity_month)
         - (EXTRACT(YEAR FROM c.cohort_month) * 12 + EXTRACT(MONTH FROM c.cohort_month)) AS month_number,
         em.user_id
  FROM cohorts c
  JOIN events_months em USING (user_id)
)
SELECT cohort_month,
       month_number,
       COUNT(DISTINCT user_id) AS active_users
FROM joined
WHERE month_number BETWEEN 0 AND 2  -- months 0,1,2
GROUP BY cohort_month, month_number
ORDER BY cohort_month, month_number;

Interpretation: month_number = 0 is the first-purchase month; 1 and 2 are subsequent months.

How to choose a method

Need speed only: random sample on small tables.
Need reproducibility: deterministic sampling by ID.
Need fairness across groups: stratified or fixed-N per group.
Need lifecycle insights: build cohorts by a meaningful milestone (signup or first purchase) and measure forward.

Exercises

Use these shared schemas in your mind or local sandbox:

users(user_id INT, country TEXT, signup_date DATE, marketing_channel TEXT)
orders(order_id INT, user_id INT, order_date DATE, amount NUMERIC)
events(user_id INT, event_name TEXT, event_time TIMESTAMP)

Exercise 1 (ex1): Create a reproducible ~5% user sample that preserves proportions by marketing_channel. Use a deterministic rule on user_id. Return user_id, marketing_channel.
Exercise 2 (ex2): Build signup cohorts by month and compute how many users had at least one order in cohort month (month 0) and the next two months (months 1–2). Output: cohort_month, month_number, active_users.
Exercise 3 (ex3): Produce a deterministic 50/50 A/B split on users (groups 'A' and 'B'). Then show, for each group, the number of distinct users who generated an 'app_open' event within 7 days of signup. Output: group_label, d7_app_open_users.

Checklist before you run

Sampling uses a stable rule (e.g., MOD(user_id, 20)).
No leakage: post-cohort filters do not affect who enters the cohort.
Date math uses the cohort start as the reference, not calendar month only.
Stratification handled by equal rule across groups or fixed-N per group.

Common mistakes and self-check

Mistake: Using ORDER BY RANDOM() on huge tables. Fix: Use deterministic modulo or warehouse-native TABLESAMPLE.
Mistake: Biased sampling by filtering first (e.g., only active users), then sampling. Fix: Sample first, then analyze.
Mistake: Cohort by calendar month, not by milestone. Fix: Use DATE_TRUNC on the milestone (signup or first order).
Mistake: Misaligned retention windows. Fix: Compute month_number or day_number relative to cohort start.
Mistake: Non-reproducible A/B splits. Fix: Use deterministic split (e.g., MOD(user_id, 2)).

Self-check mini list

Re-run your sampling query twice—same rows?
Pick three users and manually verify their cohort assignment.
Validate counts against full data for one small cohort month.

Practical projects

Build a monthly signup cohort dashboard with 0–6 month retention and revenue.
Create a reproducible 10% user sample and use it to prototype a churn model end-to-end.
Design a deterministic A/B split and compare conversion by marketing_channel for 14 days.

Learning path

Before: Aggregations, window functions, date functions.
Now: Sampling strategies, cohort definitions, retention math.
Next: Experiment design (AA checks), power analysis, causal inference basics.

Next steps

Parameterize your sampling percent and cohort windows for reuse.
Document cohort definitions so dashboards stay consistent across teams.
Add quality checks: cohort size thresholds and data freshness alerts.

Mini challenge

Create a first-purchase cohort table and compute 1-, 3-, and 6-month revenue per user. Then re-run on a deterministic 20% sample. Do results align proportionally?

Quick Test

Take the short test below to check understanding. Available to everyone; only logged-in users get saved progress.

Menu

Sampling And Cohort Building

Table of Contents

Why this matters

Who this is for

Prerequisites

Concept explained simply

Mental model

Core patterns

Worked examples

How to choose a method

Exercises

Common mistakes and self-check

Practical projects

Learning path

Next steps

Mini challenge

Quick Test

Practice Exercises

Deterministic ~5% stratified by marketing_channel

Instructions

Expected Output

Signup cohorts and 0–2 month activity

Deterministic A/B split and D+7 activation

Sampling And Cohort Building — Quick Test

Have questions about Sampling And Cohort Building?

AI Assistant