How to learn Data Extraction For Modeling for SQL in Data Scientist for free

Why this matters

As a Data Scientist, your model is only as good as the dataset you extract. SQL turns raw events and transactions into clean, leakage-free training rows. Typical tasks you will do:

Define the prediction moment (prediction_date) and build labels correctly (e.g., purchase_next_30d).
Aggregate features in fixed windows before prediction_date (e.g., last 30 days) to avoid leakage.
Join multiple sources (events, orders, subscriptions) efficiently and safely.
Create rolling training datasets (many prediction dates per user).
Balance classes and split train/validation by time.

Concept explained simply

Think of each training row as a snapshot taken at a specific time: what we knew then (features) and what happened after (label).

Mental model: One entity (user_id) + one prediction_date = one row. Everything in features must be strictly before prediction_date. Everything in the label must be strictly after (within the label window).

Minimal recipe

Define cohort: which entities and which prediction_date(s).
Features: aggregate behavior in [prediction_date - window, prediction_date).
Label: check outcomes in [prediction_date, prediction_date + label_window).
Ensure one row per entity per prediction_date. No future data allowed in features.

Data requirements checklist

Entity key identified (e.g., user_id).
Timestamped facts: events.event_time, orders.order_time, etc.
Consistency: timestamps in the same timezone.
Windows chosen: feature window (e.g., 30 days), label window (e.g., 30 days).
Deduplication rules: how to pick latest state when needed.
Class definition: what counts as positive vs negative.

Worked examples

Example 1: Build a 7‑day purchase label

Goal: For each user in the cohort as of 2023-12-01, label whether they purchase in the next 7 days.

-- Cohort as of a fixed prediction date
WITH cohort AS (
  SELECT DISTINCT u.user_id
  FROM users u
  WHERE u.created_at < DATE '2023-12-01'
),
labels AS (
  SELECT c.user_id,
         CASE WHEN COUNT(*) > 0 THEN 1 ELSE 0 END AS purchase_next_7d
  FROM cohort c
  LEFT JOIN orders o
    ON o.user_id = c.user_id
   AND o.order_time >= DATE '2023-12-01'
   AND o.order_time <  DATE '2023-12-08'
  GROUP BY c.user_id
)
SELECT * FROM labels;

Key idea: label window is strictly after prediction_date.

Example 2: Leakage‑safe 30‑day features

Goal: Aggregate behavior in the 30 days before 2023-12-01.

WITH cohort AS (
  SELECT DISTINCT u.user_id
  FROM users u
  WHERE u.created_at < DATE '2023-12-01'
),
features AS (
  SELECT c.user_id,
         COUNT(*) FILTER (
           WHERE e.event_time >= DATE '2023-11-01' AND e.event_time < DATE '2023-12-01'
           AND e.event_type = 'view'
         ) AS views_30d,
         COUNT(*) FILTER (
           WHERE o.order_time >= DATE '2023-11-01' AND o.order_time < DATE '2023-12-01'
         ) AS orders_30d,
         COALESCE(SUM(CASE WHEN o.order_time >= DATE '2023-11-01' AND o.order_time < DATE '2023-12-01' THEN o.amount END),0) AS spend_30d
  FROM cohort c
  LEFT JOIN events e ON e.user_id = c.user_id
  LEFT JOIN orders o ON o.user_id = c.user_id
  GROUP BY c.user_id
)
SELECT * FROM features;

Key idea: feature window is strictly before prediction_date.

Example 3: One training table row per user

Goal: Combine features and label for 2023-12-01 into one dataset.

WITH cohort AS (
  SELECT DISTINCT u.user_id
  FROM users u
  WHERE u.created_at < DATE '2023-12-01'
),
features AS (
  SELECT c.user_id,
         COUNT(*) FILTER (
           WHERE e.event_time >= DATE '2023-11-01' AND e.event_time < DATE '2023-12-01' AND e.event_type='view'
         ) AS views_30d,
         COUNT(*) FILTER (
           WHERE o.order_time >= DATE '2023-11-01' AND o.order_time < DATE '2023-12-01'
         ) AS orders_30d,
         COALESCE(SUM(CASE WHEN o.order_time >= DATE '2023-11-01' AND o.order_time < DATE '2023-12-01' THEN o.amount END),0) AS spend_30d
  FROM cohort c
  LEFT JOIN events e ON e.user_id = c.user_id
  LEFT JOIN orders o ON o.user_id = c.user_id
  GROUP BY c.user_id
),
labels AS (
  SELECT c.user_id,
         CASE WHEN COUNT(*) > 0 THEN 1 ELSE 0 END AS purchase_next_30d
  FROM cohort c
  LEFT JOIN orders o
    ON o.user_id = c.user_id
   AND o.order_time >= DATE '2023-12-01'
   AND o.order_time <  DATE '2023-12-31'
  GROUP BY c.user_id
)
SELECT f.user_id, DATE '2023-12-01' AS prediction_date, f.views_30d, f.orders_30d, f.spend_30d, l.purchase_next_30d
FROM features f
JOIN labels l USING (user_id);

Key idea: ensure exactly one row per user for the prediction_date.

Quality checks

Duplicate rows: ensure one row per (user_id, prediction_date).

SELECT user_id, prediction_date, COUNT(*)
FROM training
GROUP BY 1,2
HAVING COUNT(*) > 1;

Leakage scan: features must be strictly < prediction_date.

SELECT *
FROM raw_feature_events r
JOIN training t ON t.user_id = r.user_id
WHERE r.event_time >= t.prediction_date; -- should return 0 rows

Label prevalence: quick sanity check.

SELECT AVG(purchase_next_30d::numeric) AS positive_rate FROM training;

Exercises

Use this simplified schema (types and names can be adapted to your SQL dialect):

users(user_id, created_at, country)
events(user_id, event_type, event_time)
orders(order_id, user_id, amount, order_time, status)

Exercise 1 — Snapshot dataset at 2023-12-01

Build one row per user with:

prediction_date = 2023-12-01
Features from 30 days before prediction_date: views_30d, orders_30d, spend_30d
Label purchase_next_30d from [2023-12-01, 2023-12-31)

Expected columns: user_id, prediction_date, views_30d, orders_30d, spend_30d, purchase_next_30d

Sample expected output (illustrative):

user_001 | 2023-12-01 | 12 | 1 | 45.50 | 0
user_002 | 2023-12-01 |  5 | 0 |  0.00 | 1
user_003 | 2023-12-01 |  0 | 0 |  0.00 | 0

Hints:

Join events and orders but filter feature windows strictly before prediction_date.
Use LEFT JOIN so users without activity still appear.
Label uses order_time >= prediction_date and < prediction_date + interval 30 days.

Show solution

WITH cohort AS (
  SELECT DISTINCT u.user_id
  FROM users u
  WHERE u.created_at < DATE '2023-12-01'
),
features AS (
  SELECT c.user_id,
         COUNT(*) FILTER (
           WHERE e.event_time >= DATE '2023-11-01' AND e.event_time < DATE '2023-12-01' AND e.event_type='view'
         ) AS views_30d,
         COUNT(*) FILTER (
           WHERE o.order_time >= DATE '2023-11-01' AND o.order_time < DATE '2023-12-01'
         ) AS orders_30d,
         COALESCE(SUM(CASE WHEN o.order_time >= DATE '2023-11-01' AND o.order_time < DATE '2023-12-01' THEN o.amount END),0) AS spend_30d
  FROM cohort c
  LEFT JOIN events e ON e.user_id = c.user_id
  LEFT JOIN orders o ON o.user_id = c.user_id
  GROUP BY c.user_id
),
labels AS (
  SELECT c.user_id,
         CASE WHEN COUNT(*) > 0 THEN 1 ELSE 0 END AS purchase_next_30d
  FROM cohort c
  LEFT JOIN orders o
    ON o.user_id = c.user_id
   AND o.order_time >= DATE '2023-12-01'
   AND o.order_time <  DATE '2023-12-31'
  GROUP BY c.user_id
)
SELECT f.user_id,
       DATE '2023-12-01' AS prediction_date,
       f.views_30d, f.orders_30d, f.spend_30d,
       l.purchase_next_30d
FROM features f
JOIN labels l USING (user_id);

Exercise 2 — Rolling monthly training rows

Create rows for each user for prediction dates 2023-10-01, 2023-11-01, 2023-12-01. For each (user_id, prediction_date):

Features from the 30 days before prediction_date: views_30d, orders_30d, spend_30d
Label purchase_next_30d from [prediction_date, prediction_date + 30d)
Ensure one row per (user_id, prediction_date)

Expected columns: user_id, prediction_date, views_30d, orders_30d, spend_30d, purchase_next_30d

Hints:

Build a small dates CTE and join it to users with a created_at cutoff.
Filter features with event_time/order_time < prediction_date.
Aggregate by user_id and prediction_date.

Show solution

WITH dates AS (
  SELECT DATE '2023-10-01' AS prediction_date
  UNION ALL SELECT DATE '2023-11-01'
  UNION ALL SELECT DATE '2023-12-01'
),
cohort AS (
  SELECT u.user_id, d.prediction_date
  FROM users u
  CROSS JOIN dates d
  WHERE u.created_at < d.prediction_date
),
features AS (
  SELECT c.user_id, c.prediction_date,
         COUNT(*) FILTER (
           WHERE e.event_time >= c.prediction_date - INTERVAL '30 days'
             AND e.event_time <  c.prediction_date
             AND e.event_type='view'
         ) AS views_30d,
         COUNT(*) FILTER (
           WHERE o.order_time >= c.prediction_date - INTERVAL '30 days'
             AND o.order_time <  c.prediction_date
         ) AS orders_30d,
         COALESCE(SUM(CASE WHEN o.order_time >= c.prediction_date - INTERVAL '30 days'
                            AND o.order_time <  c.prediction_date THEN o.amount END),0) AS spend_30d
  FROM cohort c
  LEFT JOIN events e ON e.user_id = c.user_id
  LEFT JOIN orders o ON o.user_id = c.user_id
  GROUP BY c.user_id, c.prediction_date
),
labels AS (
  SELECT c.user_id, c.prediction_date,
         CASE WHEN COUNT(*) > 0 THEN 1 ELSE 0 END AS purchase_next_30d
  FROM cohort c
  LEFT JOIN orders o
    ON o.user_id = c.user_id
   AND o.order_time >= c.prediction_date
   AND o.order_time <  c.prediction_date + INTERVAL '30 days'
  GROUP BY c.user_id, c.prediction_date
)
SELECT f.user_id, f.prediction_date,
       f.views_30d, f.orders_30d, f.spend_30d,
       l.purchase_next_30d
FROM features f
JOIN labels l USING (user_id, prediction_date)
ORDER BY f.user_id, f.prediction_date;

Exercise self-check checklist

Did you restrict features strictly before prediction_date?
Did your label use a window strictly after prediction_date?
Do you have exactly one row per (user_id, prediction_date)?
Do users with no activity still appear (LEFT JOIN used)?

Common mistakes and how to self-check

Leakage via windows: Using features computed without filtering by prediction_date. Self-check: ensure every feature has a time filter like event_time < prediction_date.
Double counting from joins: Joining multiple fact tables at row level increases counts. Fix: pre-aggregate each fact to (user_id, prediction_date) before joining.
Inconsistent timezones: Mixed UTC/local timestamps cause off-by-one-day windows. Fix: normalize timestamps first.
Non-deterministic deduping: Not specifying which record wins. Fix: use ROW_NUMBER() OVER (...) = 1 with explicit ORDER BY.
Random split leakage: Random train/val splits across the same user/time. Fix: split by time (earlier dates for train, later for validation).

Practical projects

Lead conversion: Build a dataset predicting if a lead converts in 14 days using web events and CRM updates.
Next purchase: Dataset predicting a user purchase in 30 days from ecommerce orders and page views.
Churn intent: Dataset predicting customer churn in 30 days from support tickets, logins, and payments.

Learning path

SQL filtering and date arithmetic (NOW, INTERVAL, date truncation).
GROUP BY and conditional aggregation (FILTER, CASE WHEN).
JOIN strategies: pre-aggregation and dimensional joins.
Window functions for dedup and rolling features.
Time-based splits and dataset validation.

Next steps

Automate your extraction with scheduled queries.
Add feature stores or views for reusability.
Document label and feature definitions with exact SQL snippets.

Mini challenge

Create a weekly rolling dataset for prediction dates every Monday in November 2023. Build features from the prior 14 days and a 14-day label window. Ensure one row per (user_id, prediction_date). Add a query that reports positive rate per week.

Quick Test

The quick test below is available to everyone. Only logged-in users get saved progress.

Menu

Data Extraction For Modeling

Table of Contents

Why this matters

Concept explained simply

Data requirements checklist

Worked examples

Quality checks

Exercises

Exercise self-check checklist

Common mistakes and how to self-check

Practical projects

Learning path

Next steps

Mini challenge

Quick Test

Practice Exercises

Snapshot dataset at 2023-12-01

Instructions

Expected Output

Rolling monthly training rows

Data Extraction For Modeling — Quick Test

Have questions about Data Extraction For Modeling?

AI Assistant