luvv to helpDiscover the Best Free Online Tools
Topic 6 of 8

Sampling And Cohort Building

Learn Sampling And Cohort Building for free with explanations, exercises, and a quick test (for Data Scientist).

Published: January 1, 2026 | Updated: January 1, 2026

Why this matters

Data Scientists often need representative samples to prototype models quickly, run A/B tests safely, and build cohorts to understand retention, funnels, and lifecycle value. Good sampling avoids bias and speeds iteration. Well-defined cohorts make your metrics comparable over time.

  • Prototype faster: run analyses on a small, representative subset.
  • Reliable experiments: deterministic, reproducible splits for A/B testing.
  • Product insights: cohort retention, activation, and revenue over time.

Who this is for

  • Data Scientists and Analysts building product metrics, experiments, or lifecycle dashboards.
  • Engineers needing consistent data slices for model training or backfills.

Prerequisites

  • Comfort with SELECT, WHERE, GROUP BY, JOIN, and window functions (ROW_NUMBER, MIN OVER, COUNT DISTINCT).
  • Basic date functions (DATE_TRUNC, DATE_DIFF or equivalents).
  • Familiarity with your warehouse functions for hashing or randomization (or ability to use numeric IDs).

Concept explained simply

Sampling is choosing a smaller subset of rows that still behaves like the whole. Cohort building is grouping users by a meaningful milestone (e.g., signup month or first purchase month) so you can compare how groups behave over time.

Mental model

  • Sampling: imagine shuffling all rows and taking a slice (random), or taking the same slice every time based on a rule (deterministic). For fairness across groups, take the same proportion from each group (stratified).
  • Cohorts: pick a start line (e.g., first purchase date), label everyone by that start line (e.g., 2025-03 cohort), then track what happens in weeks or months after the start.
Deep dive: Deterministic vs random sampling

Random sampling (ORDER BY RANDOM()) is quick for small tables but can be slow and non-reproducible. Deterministic sampling uses a stable function of an identifier (e.g., user_id % 10 = 0) so the same users always appear. For experiments and reusable dashboards, prefer deterministic sampling.

Core patterns

  • Random sample: ORDER BY RANDOM() LIMIT N or TABLESAMPLE (if supported) for quick exploration.
  • Deterministic sample: use a stable function on an ID, e.g., MOD(user_id, 100) < 5 for ~5%.
  • Stratified sample: apply the deterministic rule separately within each group (country, channel) to keep proportions.
  • Signup cohort: DATE_TRUNC(signup_date, 'month').
  • First-purchase cohort: DATE_TRUNC(MIN(order_date), 'month') per user.
  • Retention: join cohort to later events, compute periods since cohort start (e.g., month_number).

Worked examples

Example 1: Quick 1% random sample of users
-- Assumes a users table with user_id, country, signup_date
SELECT *
FROM users
ORDER BY RANDOM()
LIMIT (SELECT CEIL(COUNT(*) * 0.01) FROM users);

Good for quick checks on small-to-medium tables. For reproducibility or very large tables, prefer deterministic sampling.

Example 2: Deterministic 5% sample (portable pattern)
-- Assumes user_id is numeric.
-- 5% sample: users whose user_id modulo 20 equals 0.
SELECT *
FROM users
WHERE MOD(user_id, 20) = 0;

This yields roughly 5% and is stable across runs. If your IDs are not numeric, use your warehouse's hash function and take MOD on it.

Example 3: Stratified 10% per country
-- Keep ~10% from every country based on a deterministic rule.
-- Using modulo on user_id ensures each country keeps ~the same proportion.
SELECT *
FROM users
WHERE MOD(user_id, 10) = 0;  -- ~10%

Because the rule does not depend on country, it preserves proportions in each country. If some groups are tiny, switch to a fixed N per group using ROW_NUMBER per partition.

-- Fixed N per group (e.g., 200 per country)
WITH ranked AS (
  SELECT u.*, ROW_NUMBER() OVER (PARTITION BY country ORDER BY user_id) AS rn
  FROM users u
)
SELECT *
FROM ranked
WHERE rn <= 200;
Example 4: Build first-purchase monthly cohorts and 3-month retention
-- Tables: users(user_id, signup_date, country), orders(order_id, user_id, order_date, amount)
-- 1) First purchase date per user
WITH first_purchase AS (
  SELECT user_id, MIN(order_date) AS first_order_date
  FROM orders
  GROUP BY user_id
),
cohorts AS (
  SELECT fp.user_id,
         DATE_TRUNC('month', fp.first_order_date) AS cohort_month
  FROM first_purchase fp
),
events_months AS (
  -- activity months for users with any orders
  SELECT o.user_id,
         DATE_TRUNC('month', o.order_date) AS activity_month
  FROM orders o
),
joined AS (
  SELECT c.cohort_month,
         em.activity_month,
         EXTRACT(YEAR FROM em.activity_month) * 12 + EXTRACT(MONTH FROM em.activity_month)
         - (EXTRACT(YEAR FROM c.cohort_month) * 12 + EXTRACT(MONTH FROM c.cohort_month)) AS month_number,
         em.user_id
  FROM cohorts c
  JOIN events_months em USING (user_id)
)
SELECT cohort_month,
       month_number,
       COUNT(DISTINCT user_id) AS active_users
FROM joined
WHERE month_number BETWEEN 0 AND 2  -- months 0,1,2
GROUP BY cohort_month, month_number
ORDER BY cohort_month, month_number;

Interpretation: month_number = 0 is the first-purchase month; 1 and 2 are subsequent months.

How to choose a method

  • Need speed only: random sample on small tables.
  • Need reproducibility: deterministic sampling by ID.
  • Need fairness across groups: stratified or fixed-N per group.
  • Need lifecycle insights: build cohorts by a meaningful milestone (signup or first purchase) and measure forward.

Exercises

Use these shared schemas in your mind or local sandbox:

  • users(user_id INT, country TEXT, signup_date DATE, marketing_channel TEXT)
  • orders(order_id INT, user_id INT, order_date DATE, amount NUMERIC)
  • events(user_id INT, event_name TEXT, event_time TIMESTAMP)
  1. Exercise 1 (ex1): Create a reproducible ~5% user sample that preserves proportions by marketing_channel. Use a deterministic rule on user_id. Return user_id, marketing_channel.
  2. Exercise 2 (ex2): Build signup cohorts by month and compute how many users had at least one order in cohort month (month 0) and the next two months (months 1–2). Output: cohort_month, month_number, active_users.
  3. Exercise 3 (ex3): Produce a deterministic 50/50 A/B split on users (groups 'A' and 'B'). Then show, for each group, the number of distinct users who generated an 'app_open' event within 7 days of signup. Output: group_label, d7_app_open_users.
Checklist before you run
  • Sampling uses a stable rule (e.g., MOD(user_id, 20)).
  • No leakage: post-cohort filters do not affect who enters the cohort.
  • Date math uses the cohort start as the reference, not calendar month only.
  • Stratification handled by equal rule across groups or fixed-N per group.

Common mistakes and self-check

  • Mistake: Using ORDER BY RANDOM() on huge tables. Fix: Use deterministic modulo or warehouse-native TABLESAMPLE.
  • Mistake: Biased sampling by filtering first (e.g., only active users), then sampling. Fix: Sample first, then analyze.
  • Mistake: Cohort by calendar month, not by milestone. Fix: Use DATE_TRUNC on the milestone (signup or first order).
  • Mistake: Misaligned retention windows. Fix: Compute month_number or day_number relative to cohort start.
  • Mistake: Non-reproducible A/B splits. Fix: Use deterministic split (e.g., MOD(user_id, 2)).
Self-check mini list
  • Re-run your sampling query twice—same rows?
  • Pick three users and manually verify their cohort assignment.
  • Validate counts against full data for one small cohort month.

Practical projects

  • Build a monthly signup cohort dashboard with 0–6 month retention and revenue.
  • Create a reproducible 10% user sample and use it to prototype a churn model end-to-end.
  • Design a deterministic A/B split and compare conversion by marketing_channel for 14 days.

Learning path

  • Before: Aggregations, window functions, date functions.
  • Now: Sampling strategies, cohort definitions, retention math.
  • Next: Experiment design (AA checks), power analysis, causal inference basics.

Next steps

  • Parameterize your sampling percent and cohort windows for reuse.
  • Document cohort definitions so dashboards stay consistent across teams.
  • Add quality checks: cohort size thresholds and data freshness alerts.

Mini challenge

Create a first-purchase cohort table and compute 1-, 3-, and 6-month revenue per user. Then re-run on a deterministic 20% sample. Do results align proportionally?

Quick Test

Take the short test below to check understanding. Available to everyone; only logged-in users get saved progress.

Practice Exercises

3 exercises to complete

Instructions

Create a reproducible ~5% user sample using a deterministic rule on user_id that preserves proportions by marketing_channel. Return user_id and marketing_channel.

Hint: Use MOD on user_id with a divisor that approximates 5%.

Expected Output
A result set with ~5% of users across all marketing channels, including columns: user_id, marketing_channel.

Sampling And Cohort Building — Quick Test

Test your knowledge with 8 questions. Pass with 70% or higher.

8 questions70% to pass

Have questions about Sampling And Cohort Building?

AI Assistant

Ask questions about this tool