Why this matters
As a Data Scientist, your model is only as good as the dataset you extract. SQL turns raw events and transactions into clean, leakage-free training rows. Typical tasks you will do:
- Define the prediction moment (prediction_date) and build labels correctly (e.g., purchase_next_30d).
- Aggregate features in fixed windows before prediction_date (e.g., last 30 days) to avoid leakage.
- Join multiple sources (events, orders, subscriptions) efficiently and safely.
- Create rolling training datasets (many prediction dates per user).
- Balance classes and split train/validation by time.
Concept explained simply
Think of each training row as a snapshot taken at a specific time: what we knew then (features) and what happened after (label).
Minimal recipe
- Define cohort: which entities and which prediction_date(s).
- Features: aggregate behavior in [prediction_date - window, prediction_date).
- Label: check outcomes in [prediction_date, prediction_date + label_window).
- Ensure one row per entity per prediction_date. No future data allowed in features.
Data requirements checklist
- Entity key identified (e.g., user_id).
- Timestamped facts: events.event_time, orders.order_time, etc.
- Consistency: timestamps in the same timezone.
- Windows chosen: feature window (e.g., 30 days), label window (e.g., 30 days).
- Deduplication rules: how to pick latest state when needed.
- Class definition: what counts as positive vs negative.
Worked examples
Example 1: Build a 7‑day purchase label
Goal: For each user in the cohort as of 2023-12-01, label whether they purchase in the next 7 days.
-- Cohort as of a fixed prediction date
WITH cohort AS (
SELECT DISTINCT u.user_id
FROM users u
WHERE u.created_at < DATE '2023-12-01'
),
labels AS (
SELECT c.user_id,
CASE WHEN COUNT(*) > 0 THEN 1 ELSE 0 END AS purchase_next_7d
FROM cohort c
LEFT JOIN orders o
ON o.user_id = c.user_id
AND o.order_time >= DATE '2023-12-01'
AND o.order_time < DATE '2023-12-08'
GROUP BY c.user_id
)
SELECT * FROM labels;
Key idea: label window is strictly after prediction_date.
Example 2: Leakage‑safe 30‑day features
Goal: Aggregate behavior in the 30 days before 2023-12-01.
WITH cohort AS (
SELECT DISTINCT u.user_id
FROM users u
WHERE u.created_at < DATE '2023-12-01'
),
features AS (
SELECT c.user_id,
COUNT(*) FILTER (
WHERE e.event_time >= DATE '2023-11-01' AND e.event_time < DATE '2023-12-01'
AND e.event_type = 'view'
) AS views_30d,
COUNT(*) FILTER (
WHERE o.order_time >= DATE '2023-11-01' AND o.order_time < DATE '2023-12-01'
) AS orders_30d,
COALESCE(SUM(CASE WHEN o.order_time >= DATE '2023-11-01' AND o.order_time < DATE '2023-12-01' THEN o.amount END),0) AS spend_30d
FROM cohort c
LEFT JOIN events e ON e.user_id = c.user_id
LEFT JOIN orders o ON o.user_id = c.user_id
GROUP BY c.user_id
)
SELECT * FROM features;
Key idea: feature window is strictly before prediction_date.
Example 3: One training table row per user
Goal: Combine features and label for 2023-12-01 into one dataset.
WITH cohort AS (
SELECT DISTINCT u.user_id
FROM users u
WHERE u.created_at < DATE '2023-12-01'
),
features AS (
SELECT c.user_id,
COUNT(*) FILTER (
WHERE e.event_time >= DATE '2023-11-01' AND e.event_time < DATE '2023-12-01' AND e.event_type='view'
) AS views_30d,
COUNT(*) FILTER (
WHERE o.order_time >= DATE '2023-11-01' AND o.order_time < DATE '2023-12-01'
) AS orders_30d,
COALESCE(SUM(CASE WHEN o.order_time >= DATE '2023-11-01' AND o.order_time < DATE '2023-12-01' THEN o.amount END),0) AS spend_30d
FROM cohort c
LEFT JOIN events e ON e.user_id = c.user_id
LEFT JOIN orders o ON o.user_id = c.user_id
GROUP BY c.user_id
),
labels AS (
SELECT c.user_id,
CASE WHEN COUNT(*) > 0 THEN 1 ELSE 0 END AS purchase_next_30d
FROM cohort c
LEFT JOIN orders o
ON o.user_id = c.user_id
AND o.order_time >= DATE '2023-12-01'
AND o.order_time < DATE '2023-12-31'
GROUP BY c.user_id
)
SELECT f.user_id, DATE '2023-12-01' AS prediction_date, f.views_30d, f.orders_30d, f.spend_30d, l.purchase_next_30d
FROM features f
JOIN labels l USING (user_id);
Key idea: ensure exactly one row per user for the prediction_date.
Quality checks
- Duplicate rows: ensure one row per (user_id, prediction_date).
SELECT user_id, prediction_date, COUNT(*) FROM training GROUP BY 1,2 HAVING COUNT(*) > 1; - Leakage scan: features must be strictly < prediction_date.
SELECT * FROM raw_feature_events r JOIN training t ON t.user_id = r.user_id WHERE r.event_time >= t.prediction_date; -- should return 0 rows - Label prevalence: quick sanity check.
SELECT AVG(purchase_next_30d::numeric) AS positive_rate FROM training;
Exercises
Use this simplified schema (types and names can be adapted to your SQL dialect):
- users(user_id, created_at, country)
- events(user_id, event_type, event_time)
- orders(order_id, user_id, amount, order_time, status)
Exercise 1 — Snapshot dataset at 2023-12-01
Build one row per user with:
- prediction_date = 2023-12-01
- Features from 30 days before prediction_date: views_30d, orders_30d, spend_30d
- Label purchase_next_30d from [2023-12-01, 2023-12-31)
Expected columns: user_id, prediction_date, views_30d, orders_30d, spend_30d, purchase_next_30d
Sample expected output (illustrative):
user_001 | 2023-12-01 | 12 | 1 | 45.50 | 0
user_002 | 2023-12-01 | 5 | 0 | 0.00 | 1
user_003 | 2023-12-01 | 0 | 0 | 0.00 | 0
Hints:
- Join events and orders but filter feature windows strictly before prediction_date.
- Use LEFT JOIN so users without activity still appear.
- Label uses order_time >= prediction_date and < prediction_date + interval 30 days.
Show solution
WITH cohort AS (
SELECT DISTINCT u.user_id
FROM users u
WHERE u.created_at < DATE '2023-12-01'
),
features AS (
SELECT c.user_id,
COUNT(*) FILTER (
WHERE e.event_time >= DATE '2023-11-01' AND e.event_time < DATE '2023-12-01' AND e.event_type='view'
) AS views_30d,
COUNT(*) FILTER (
WHERE o.order_time >= DATE '2023-11-01' AND o.order_time < DATE '2023-12-01'
) AS orders_30d,
COALESCE(SUM(CASE WHEN o.order_time >= DATE '2023-11-01' AND o.order_time < DATE '2023-12-01' THEN o.amount END),0) AS spend_30d
FROM cohort c
LEFT JOIN events e ON e.user_id = c.user_id
LEFT JOIN orders o ON o.user_id = c.user_id
GROUP BY c.user_id
),
labels AS (
SELECT c.user_id,
CASE WHEN COUNT(*) > 0 THEN 1 ELSE 0 END AS purchase_next_30d
FROM cohort c
LEFT JOIN orders o
ON o.user_id = c.user_id
AND o.order_time >= DATE '2023-12-01'
AND o.order_time < DATE '2023-12-31'
GROUP BY c.user_id
)
SELECT f.user_id,
DATE '2023-12-01' AS prediction_date,
f.views_30d, f.orders_30d, f.spend_30d,
l.purchase_next_30d
FROM features f
JOIN labels l USING (user_id);
Exercise 2 — Rolling monthly training rows
Create rows for each user for prediction dates 2023-10-01, 2023-11-01, 2023-12-01. For each (user_id, prediction_date):
- Features from the 30 days before prediction_date: views_30d, orders_30d, spend_30d
- Label purchase_next_30d from [prediction_date, prediction_date + 30d)
- Ensure one row per (user_id, prediction_date)
Expected columns: user_id, prediction_date, views_30d, orders_30d, spend_30d, purchase_next_30d
Hints:
- Build a small dates CTE and join it to users with a created_at cutoff.
- Filter features with event_time/order_time < prediction_date.
- Aggregate by user_id and prediction_date.
Show solution
WITH dates AS (
SELECT DATE '2023-10-01' AS prediction_date
UNION ALL SELECT DATE '2023-11-01'
UNION ALL SELECT DATE '2023-12-01'
),
cohort AS (
SELECT u.user_id, d.prediction_date
FROM users u
CROSS JOIN dates d
WHERE u.created_at < d.prediction_date
),
features AS (
SELECT c.user_id, c.prediction_date,
COUNT(*) FILTER (
WHERE e.event_time >= c.prediction_date - INTERVAL '30 days'
AND e.event_time < c.prediction_date
AND e.event_type='view'
) AS views_30d,
COUNT(*) FILTER (
WHERE o.order_time >= c.prediction_date - INTERVAL '30 days'
AND o.order_time < c.prediction_date
) AS orders_30d,
COALESCE(SUM(CASE WHEN o.order_time >= c.prediction_date - INTERVAL '30 days'
AND o.order_time < c.prediction_date THEN o.amount END),0) AS spend_30d
FROM cohort c
LEFT JOIN events e ON e.user_id = c.user_id
LEFT JOIN orders o ON o.user_id = c.user_id
GROUP BY c.user_id, c.prediction_date
),
labels AS (
SELECT c.user_id, c.prediction_date,
CASE WHEN COUNT(*) > 0 THEN 1 ELSE 0 END AS purchase_next_30d
FROM cohort c
LEFT JOIN orders o
ON o.user_id = c.user_id
AND o.order_time >= c.prediction_date
AND o.order_time < c.prediction_date + INTERVAL '30 days'
GROUP BY c.user_id, c.prediction_date
)
SELECT f.user_id, f.prediction_date,
f.views_30d, f.orders_30d, f.spend_30d,
l.purchase_next_30d
FROM features f
JOIN labels l USING (user_id, prediction_date)
ORDER BY f.user_id, f.prediction_date;
Exercise self-check checklist
- Did you restrict features strictly before prediction_date?
- Did your label use a window strictly after prediction_date?
- Do you have exactly one row per (user_id, prediction_date)?
- Do users with no activity still appear (LEFT JOIN used)?
Common mistakes and how to self-check
- Leakage via windows: Using features computed without filtering by prediction_date. Self-check: ensure every feature has a time filter like event_time < prediction_date.
- Double counting from joins: Joining multiple fact tables at row level increases counts. Fix: pre-aggregate each fact to (user_id, prediction_date) before joining.
- Inconsistent timezones: Mixed UTC/local timestamps cause off-by-one-day windows. Fix: normalize timestamps first.
- Non-deterministic deduping: Not specifying which record wins. Fix: use ROW_NUMBER() OVER (...) = 1 with explicit ORDER BY.
- Random split leakage: Random train/val splits across the same user/time. Fix: split by time (earlier dates for train, later for validation).
Practical projects
- Lead conversion: Build a dataset predicting if a lead converts in 14 days using web events and CRM updates.
- Next purchase: Dataset predicting a user purchase in 30 days from ecommerce orders and page views.
- Churn intent: Dataset predicting customer churn in 30 days from support tickets, logins, and payments.
Learning path
- SQL filtering and date arithmetic (NOW, INTERVAL, date truncation).
- GROUP BY and conditional aggregation (FILTER, CASE WHEN).
- JOIN strategies: pre-aggregation and dimensional joins.
- Window functions for dedup and rolling features.
- Time-based splits and dataset validation.
Next steps
- Automate your extraction with scheduled queries.
- Add feature stores or views for reusability.
- Document label and feature definitions with exact SQL snippets.
Mini challenge
Create a weekly rolling dataset for prediction dates every Monday in November 2023. Build features from the prior 14 days and a 14-day label window. Ensure one row per (user_id, prediction_date). Add a query that reports positive rate per week.
Quick Test
The quick test below is available to everyone. Only logged-in users get saved progress.