luvv to helpDiscover the Best Free Online Tools
Topic 1 of 8

Data Extraction For Modeling

Learn Data Extraction For Modeling for free with explanations, exercises, and a quick test (for Data Scientist).

Published: January 1, 2026 | Updated: January 1, 2026

Why this matters

As a Data Scientist, your model is only as good as the dataset you extract. SQL turns raw events and transactions into clean, leakage-free training rows. Typical tasks you will do:

  • Define the prediction moment (prediction_date) and build labels correctly (e.g., purchase_next_30d).
  • Aggregate features in fixed windows before prediction_date (e.g., last 30 days) to avoid leakage.
  • Join multiple sources (events, orders, subscriptions) efficiently and safely.
  • Create rolling training datasets (many prediction dates per user).
  • Balance classes and split train/validation by time.

Concept explained simply

Think of each training row as a snapshot taken at a specific time: what we knew then (features) and what happened after (label).

Mental model: One entity (user_id) + one prediction_date = one row. Everything in features must be strictly before prediction_date. Everything in the label must be strictly after (within the label window).
Minimal recipe
  1. Define cohort: which entities and which prediction_date(s).
  2. Features: aggregate behavior in [prediction_date - window, prediction_date).
  3. Label: check outcomes in [prediction_date, prediction_date + label_window).
  4. Ensure one row per entity per prediction_date. No future data allowed in features.

Data requirements checklist

  • Entity key identified (e.g., user_id).
  • Timestamped facts: events.event_time, orders.order_time, etc.
  • Consistency: timestamps in the same timezone.
  • Windows chosen: feature window (e.g., 30 days), label window (e.g., 30 days).
  • Deduplication rules: how to pick latest state when needed.
  • Class definition: what counts as positive vs negative.

Worked examples

Example 1: Build a 7‑day purchase label

Goal: For each user in the cohort as of 2023-12-01, label whether they purchase in the next 7 days.

-- Cohort as of a fixed prediction date
WITH cohort AS (
  SELECT DISTINCT u.user_id
  FROM users u
  WHERE u.created_at < DATE '2023-12-01'
),
labels AS (
  SELECT c.user_id,
         CASE WHEN COUNT(*) > 0 THEN 1 ELSE 0 END AS purchase_next_7d
  FROM cohort c
  LEFT JOIN orders o
    ON o.user_id = c.user_id
   AND o.order_time >= DATE '2023-12-01'
   AND o.order_time <  DATE '2023-12-08'
  GROUP BY c.user_id
)
SELECT * FROM labels;

Key idea: label window is strictly after prediction_date.

Example 2: Leakage‑safe 30‑day features

Goal: Aggregate behavior in the 30 days before 2023-12-01.

WITH cohort AS (
  SELECT DISTINCT u.user_id
  FROM users u
  WHERE u.created_at < DATE '2023-12-01'
),
features AS (
  SELECT c.user_id,
         COUNT(*) FILTER (
           WHERE e.event_time >= DATE '2023-11-01' AND e.event_time < DATE '2023-12-01'
           AND e.event_type = 'view'
         ) AS views_30d,
         COUNT(*) FILTER (
           WHERE o.order_time >= DATE '2023-11-01' AND o.order_time < DATE '2023-12-01'
         ) AS orders_30d,
         COALESCE(SUM(CASE WHEN o.order_time >= DATE '2023-11-01' AND o.order_time < DATE '2023-12-01' THEN o.amount END),0) AS spend_30d
  FROM cohort c
  LEFT JOIN events e ON e.user_id = c.user_id
  LEFT JOIN orders o ON o.user_id = c.user_id
  GROUP BY c.user_id
)
SELECT * FROM features;

Key idea: feature window is strictly before prediction_date.

Example 3: One training table row per user

Goal: Combine features and label for 2023-12-01 into one dataset.

WITH cohort AS (
  SELECT DISTINCT u.user_id
  FROM users u
  WHERE u.created_at < DATE '2023-12-01'
),
features AS (
  SELECT c.user_id,
         COUNT(*) FILTER (
           WHERE e.event_time >= DATE '2023-11-01' AND e.event_time < DATE '2023-12-01' AND e.event_type='view'
         ) AS views_30d,
         COUNT(*) FILTER (
           WHERE o.order_time >= DATE '2023-11-01' AND o.order_time < DATE '2023-12-01'
         ) AS orders_30d,
         COALESCE(SUM(CASE WHEN o.order_time >= DATE '2023-11-01' AND o.order_time < DATE '2023-12-01' THEN o.amount END),0) AS spend_30d
  FROM cohort c
  LEFT JOIN events e ON e.user_id = c.user_id
  LEFT JOIN orders o ON o.user_id = c.user_id
  GROUP BY c.user_id
),
labels AS (
  SELECT c.user_id,
         CASE WHEN COUNT(*) > 0 THEN 1 ELSE 0 END AS purchase_next_30d
  FROM cohort c
  LEFT JOIN orders o
    ON o.user_id = c.user_id
   AND o.order_time >= DATE '2023-12-01'
   AND o.order_time <  DATE '2023-12-31'
  GROUP BY c.user_id
)
SELECT f.user_id, DATE '2023-12-01' AS prediction_date, f.views_30d, f.orders_30d, f.spend_30d, l.purchase_next_30d
FROM features f
JOIN labels l USING (user_id);

Key idea: ensure exactly one row per user for the prediction_date.

Quality checks

  • Duplicate rows: ensure one row per (user_id, prediction_date).
    SELECT user_id, prediction_date, COUNT(*)
    FROM training
    GROUP BY 1,2
    HAVING COUNT(*) > 1;
    
  • Leakage scan: features must be strictly < prediction_date.
    SELECT *
    FROM raw_feature_events r
    JOIN training t ON t.user_id = r.user_id
    WHERE r.event_time >= t.prediction_date; -- should return 0 rows
    
  • Label prevalence: quick sanity check.
    SELECT AVG(purchase_next_30d::numeric) AS positive_rate FROM training;

Exercises

Use this simplified schema (types and names can be adapted to your SQL dialect):

  • users(user_id, created_at, country)
  • events(user_id, event_type, event_time)
  • orders(order_id, user_id, amount, order_time, status)
Exercise 1 — Snapshot dataset at 2023-12-01

Build one row per user with:

  • prediction_date = 2023-12-01
  • Features from 30 days before prediction_date: views_30d, orders_30d, spend_30d
  • Label purchase_next_30d from [2023-12-01, 2023-12-31)

Expected columns: user_id, prediction_date, views_30d, orders_30d, spend_30d, purchase_next_30d

Sample expected output (illustrative):

user_001 | 2023-12-01 | 12 | 1 | 45.50 | 0
user_002 | 2023-12-01 |  5 | 0 |  0.00 | 1
user_003 | 2023-12-01 |  0 | 0 |  0.00 | 0

Hints:

  • Join events and orders but filter feature windows strictly before prediction_date.
  • Use LEFT JOIN so users without activity still appear.
  • Label uses order_time >= prediction_date and < prediction_date + interval 30 days.
Show solution
WITH cohort AS (
  SELECT DISTINCT u.user_id
  FROM users u
  WHERE u.created_at < DATE '2023-12-01'
),
features AS (
  SELECT c.user_id,
         COUNT(*) FILTER (
           WHERE e.event_time >= DATE '2023-11-01' AND e.event_time < DATE '2023-12-01' AND e.event_type='view'
         ) AS views_30d,
         COUNT(*) FILTER (
           WHERE o.order_time >= DATE '2023-11-01' AND o.order_time < DATE '2023-12-01'
         ) AS orders_30d,
         COALESCE(SUM(CASE WHEN o.order_time >= DATE '2023-11-01' AND o.order_time < DATE '2023-12-01' THEN o.amount END),0) AS spend_30d
  FROM cohort c
  LEFT JOIN events e ON e.user_id = c.user_id
  LEFT JOIN orders o ON o.user_id = c.user_id
  GROUP BY c.user_id
),
labels AS (
  SELECT c.user_id,
         CASE WHEN COUNT(*) > 0 THEN 1 ELSE 0 END AS purchase_next_30d
  FROM cohort c
  LEFT JOIN orders o
    ON o.user_id = c.user_id
   AND o.order_time >= DATE '2023-12-01'
   AND o.order_time <  DATE '2023-12-31'
  GROUP BY c.user_id
)
SELECT f.user_id,
       DATE '2023-12-01' AS prediction_date,
       f.views_30d, f.orders_30d, f.spend_30d,
       l.purchase_next_30d
FROM features f
JOIN labels l USING (user_id);
Exercise 2 — Rolling monthly training rows

Create rows for each user for prediction dates 2023-10-01, 2023-11-01, 2023-12-01. For each (user_id, prediction_date):

  • Features from the 30 days before prediction_date: views_30d, orders_30d, spend_30d
  • Label purchase_next_30d from [prediction_date, prediction_date + 30d)
  • Ensure one row per (user_id, prediction_date)

Expected columns: user_id, prediction_date, views_30d, orders_30d, spend_30d, purchase_next_30d

Hints:

  • Build a small dates CTE and join it to users with a created_at cutoff.
  • Filter features with event_time/order_time < prediction_date.
  • Aggregate by user_id and prediction_date.
Show solution
WITH dates AS (
  SELECT DATE '2023-10-01' AS prediction_date
  UNION ALL SELECT DATE '2023-11-01'
  UNION ALL SELECT DATE '2023-12-01'
),
cohort AS (
  SELECT u.user_id, d.prediction_date
  FROM users u
  CROSS JOIN dates d
  WHERE u.created_at < d.prediction_date
),
features AS (
  SELECT c.user_id, c.prediction_date,
         COUNT(*) FILTER (
           WHERE e.event_time >= c.prediction_date - INTERVAL '30 days'
             AND e.event_time <  c.prediction_date
             AND e.event_type='view'
         ) AS views_30d,
         COUNT(*) FILTER (
           WHERE o.order_time >= c.prediction_date - INTERVAL '30 days'
             AND o.order_time <  c.prediction_date
         ) AS orders_30d,
         COALESCE(SUM(CASE WHEN o.order_time >= c.prediction_date - INTERVAL '30 days'
                            AND o.order_time <  c.prediction_date THEN o.amount END),0) AS spend_30d
  FROM cohort c
  LEFT JOIN events e ON e.user_id = c.user_id
  LEFT JOIN orders o ON o.user_id = c.user_id
  GROUP BY c.user_id, c.prediction_date
),
labels AS (
  SELECT c.user_id, c.prediction_date,
         CASE WHEN COUNT(*) > 0 THEN 1 ELSE 0 END AS purchase_next_30d
  FROM cohort c
  LEFT JOIN orders o
    ON o.user_id = c.user_id
   AND o.order_time >= c.prediction_date
   AND o.order_time <  c.prediction_date + INTERVAL '30 days'
  GROUP BY c.user_id, c.prediction_date
)
SELECT f.user_id, f.prediction_date,
       f.views_30d, f.orders_30d, f.spend_30d,
       l.purchase_next_30d
FROM features f
JOIN labels l USING (user_id, prediction_date)
ORDER BY f.user_id, f.prediction_date;

Exercise self-check checklist

  • Did you restrict features strictly before prediction_date?
  • Did your label use a window strictly after prediction_date?
  • Do you have exactly one row per (user_id, prediction_date)?
  • Do users with no activity still appear (LEFT JOIN used)?

Common mistakes and how to self-check

  • Leakage via windows: Using features computed without filtering by prediction_date. Self-check: ensure every feature has a time filter like event_time < prediction_date.
  • Double counting from joins: Joining multiple fact tables at row level increases counts. Fix: pre-aggregate each fact to (user_id, prediction_date) before joining.
  • Inconsistent timezones: Mixed UTC/local timestamps cause off-by-one-day windows. Fix: normalize timestamps first.
  • Non-deterministic deduping: Not specifying which record wins. Fix: use ROW_NUMBER() OVER (...) = 1 with explicit ORDER BY.
  • Random split leakage: Random train/val splits across the same user/time. Fix: split by time (earlier dates for train, later for validation).

Practical projects

  • Lead conversion: Build a dataset predicting if a lead converts in 14 days using web events and CRM updates.
  • Next purchase: Dataset predicting a user purchase in 30 days from ecommerce orders and page views.
  • Churn intent: Dataset predicting customer churn in 30 days from support tickets, logins, and payments.

Learning path

  1. SQL filtering and date arithmetic (NOW, INTERVAL, date truncation).
  2. GROUP BY and conditional aggregation (FILTER, CASE WHEN).
  3. JOIN strategies: pre-aggregation and dimensional joins.
  4. Window functions for dedup and rolling features.
  5. Time-based splits and dataset validation.

Next steps

  • Automate your extraction with scheduled queries.
  • Add feature stores or views for reusability.
  • Document label and feature definitions with exact SQL snippets.

Mini challenge

Create a weekly rolling dataset for prediction dates every Monday in November 2023. Build features from the prior 14 days and a 14-day label window. Ensure one row per (user_id, prediction_date). Add a query that reports positive rate per week.

Quick Test

The quick test below is available to everyone. Only logged-in users get saved progress.

Practice Exercises

2 exercises to complete

Instructions

Build one row per user with:

  • prediction_date = 2023-12-01
  • Features from the 30 days before prediction_date: views_30d, orders_30d, spend_30d
  • Label purchase_next_30d from [2023-12-01, 2023-12-31)

Schema to use:

  • users(user_id, created_at, country)
  • events(user_id, event_type, event_time)
  • orders(order_id, user_id, amount, order_time, status)
Expected Output
Columns: user_id, prediction_date, views_30d, orders_30d, spend_30d, purchase_next_30d. Example rows (illustrative): user_001 | 2023-12-01 | 12 | 1 | 45.50 | 0 user_002 | 2023-12-01 | 5 | 0 | 0.00 | 1 user_003 | 2023-12-01 | 0 | 0 | 0.00 | 0

Data Extraction For Modeling — Quick Test

Test your knowledge with 8 questions. Pass with 70% or higher.

8 questions70% to pass

Have questions about Data Extraction For Modeling?

AI Assistant

Ask questions about this tool