Why this matters
As a Data Scientist, you often need to train models to predict something: conversion, churn, fraud, upsell, or revenue. In SQL, this means building a reliable label (outcome) for each entity and timestamp, such as user-on-day or order-at-time. Good labels make your model meaningful; bad labels cause leakage and misleading metrics.
- Product: Predict signup-to-purchase within 7 days to optimize onboarding.
- Growth: Predict which users will convert this week for targeted emails.
- Ops: Flag orders that become chargebacks within 60 days.
- Retention: Identify accounts likely to churn next month.
Why it’s tricky
- You must freeze features at an as_of_date and only look forward to compute the label.
- Window boundaries (inclusive/exclusive) change results.
- Data gaps and late events create false positives/negatives.
Concept explained simply
Label (Outcome): The thing you want to predict, computed in a defined future window relative to a specific as_of_date.
Mental model: Imagine a pause button. At each as_of_date, you freeze the world (features). Then you press play and peek only into the approved future window to decide if the outcome happened. Anything outside this allowed future window is invisible.
- Entity: user_id, account_id, order_id, etc.
- As_of_date: when features are known (snapshot time).
- Lookback window: how far back you compute features (e.g., past 30 days).
- Label window: how far forward you look to compute the outcome (e.g., next 7 days).
- Leakage: when label or features use information that was not available at as_of_date.
Label design patterns
- Binary conversion: label = 1 if event occurs in (as_of_date, as_of_date + N] days.
- Churn by inactivity: label = 1 if no activity in (as_of_date, as_of_date + N] days.
- Time-to-event: label = days until event; censored if event not observed by end of window.
- Regression: label = numeric outcome (e.g., revenue in next 30 days).
Boundary choices (inclusive/exclusive)
Common leakage-safe boundary for binary conversion: event_time > as_of_time AND event_time <= as_of_time + window. Excluding the exact as_of moment reduces the chance of accidentally including events recorded later the same day.
Worked examples
Example 1: Conversion within 7 days of signup (binary)
-- Entities at as_of_date = signup_date
SELECT u.user_id,
u.signup_date AS as_of_date,
CASE WHEN EXISTS (
SELECT 1 FROM events e
WHERE e.user_id = u.user_id
AND e.event_type = 'purchase'
AND e.event_date > u.signup_date
AND e.event_date <= u.signup_date + INTERVAL '7 days'
) THEN 1 ELSE 0 END AS label_converted_7d
FROM users u;
-- Adjust date functions to your SQL engine if needed.
Example 2: Churn in next 30 days (inactivity)
-- Label = 1 if user has NO activity in the next 30 days after as_of_date
WITH obs AS (
SELECT user_id, as_of_date
FROM user_daily_snapshots -- one row per user and day
)
SELECT o.user_id, o.as_of_date,
CASE WHEN NOT EXISTS (
SELECT 1 FROM events e
WHERE e.user_id = o.user_id
AND e.event_date > o.as_of_date
AND e.event_date <= o.as_of_date + INTERVAL '30 days'
) THEN 1 ELSE 0 END AS label_churn_30d
FROM obs o;
Example 3: Next 30-day revenue (regression)
SELECT a.account_id,
a.as_of_date,
COALESCE(SUM(CASE WHEN o.order_date > a.as_of_date
AND o.order_date <= a.as_of_date + INTERVAL '30 days'
THEN o.amount ELSE 0 END), 0) AS label_rev_30d
FROM account_as_of a
LEFT JOIN orders o
ON o.account_id = a.account_id
AND o.order_date > a.as_of_date
AND o.order_date <= a.as_of_date + INTERVAL '30 days'
GROUP BY a.account_id, a.as_of_date;
Edge cases to consider
- Multiple events: use EXISTS for binary labels or aggregate for regression.
- Time zones: align to a single timezone before computing windows.
- Late-arriving data: keep a data freshness date to avoid partial windows.
Step-by-step recipe
- Define the question: What outcome do we predict and over what future window?
- Choose the as_of_date: When features are valid/frozen.
- Specify exact boundaries: e.g., (as_of_date, as_of_date + 7 days].
- Select the cohort: Which entities and which as_of_dates are included.
- Write the label query: Use EXISTS/NOT EXISTS or aggregates with strict future filters.
- Validate: Spot-check a few entities to confirm boundary logic.
- Persist: Save labels with entity_id, as_of_date, label, and window configuration.
Exercises
These exercises mirror the tasks below. Complete them, then check the solution and expected output.
Exercise ex1: Build a 7-day conversion label from signup
Goal: For each user, produce a single row at as_of_date = signup_date with label_converted_7d = 1 if the user has a purchase in the next 7 days (excluding the same day), else 0.
Sample data (use as CTEs or temp tables)
-- Users
user_id | signup_date
--------+------------
1 | 2021-06-01
2 | 2021-06-03
3 | 2021-06-05
-- Events
user_id | event_type | event_date
--------+------------+-----------
1 | purchase | 2021-06-06
1 | purchase | 2021-06-20
2 | page_view | 2021-06-04
2 | purchase | 2021-06-12
3 | page_view | 2021-06-06
3 | purchase | 2021-06-20
Rules:
- Window: (signup_date, signup_date + 7 days]
- Only event_type = 'purchase' counts
- Return columns: user_id, as_of_date, label_converted_7d
Expected output (exact)
user_id | as_of_date | label_converted_7d
--------+------------+-------------------
1 | 2021-06-01 | 1
2 | 2021-06-03 | 0
3 | 2021-06-05 | 0
- Self-check: Verify user 1 has a purchase within 5 days; users 2 and 3 purchase after 7 days.
Checklist before moving on
- [ ] I used strict forward-looking filters (no same-day events).
- [ ] I can explain why the boundary is exclusive on the start and inclusive on the end.
- [ ] I validated at least two users by hand.
Common mistakes and how to self-check
- Leakage from same-day events: Using
>= as_of_dateinstead of> as_of_date. Fix: exclude same-day or define a clear cutoff timestamp. - Wrong window end: Using
<instead of<=unintentionally drops events on the last day. Confirm with spot checks on boundary dates. - Mixing features and labels: Building labels in the same query that computes features can sneak future info into features. Keep label logic separate and join later by (entity_id, as_of_date).
- Duplicate observations: Multiple rows per (entity_id, as_of_date). Enforce primary keys and aggregate carefully.
- Time zone drift: Events recorded in different time zones. Normalize time zones before windows.
- Partial windows (censoring): End-of-dataset lacks full future window. Either drop those rows or mark as censored and exclude from training.
Self-audit mini procedure
- Pick 5 entities near the window boundary and verify labels manually.
- Compute label rates by cohort and compare to business intuition.
- Shift the window by ±1 day and confirm labels change as expected.
Practical projects
- Churn labeling pipeline: Daily as_of snapshots, 30-day inactivity label, persisted to a labels table with metadata (window size, boundaries).
- Conversion labeling: Signup cohort, 7-day conversion label, A/B test group awareness (avoid leakage from experiment exposure recorded after as_of_date).
- Revenue regression label: Next-30-day revenue for accounts; handle partial windows at dataset end by censoring or excluding.
Who this is for
- Data Scientists and ML Engineers preparing supervised learning datasets.
- Analysts who need robust forward-looking metrics without leakage.
Prerequisites
- Intermediate SQL: joins, GROUP BY, CASE, and basic date functions.
- Familiarity with supervised learning goals (binary/regression).
- Comfort with time concepts (time zones, inclusive/exclusive bounds).
Learning path
- Master time filtering and window boundaries.
- Practice binary labels (conversion, fraud) using EXISTS.
- Practice inactivity labels (churn) using NOT EXISTS.
- Build regression labels (next-30-day revenue) using SUM filters.
- Create a reusable labels table keyed by (entity_id, as_of_date).
- Validate with manual checks and boundary tests.
Next steps
- Join labels to feature snapshots and build a train/test split by as_of_date.
- Track label distribution over time to detect drift.
- Document your window definitions and boundaries for reproducibility.
Mini challenge
Extend Example 1 to also compute a 14-day conversion label in the same query, then compare 7-day vs 14-day conversion rates. What changes in your boundary conditions and how does the positive rate shift?
Quick test
The quick test is available to everyone. Only logged-in users will have their results and progress saved.