How to learn Building Labels And Outcomes for SQL in Data Scientist for free

Why this matters

As a Data Scientist, you often need to train models to predict something: conversion, churn, fraud, upsell, or revenue. In SQL, this means building a reliable label (outcome) for each entity and timestamp, such as user-on-day or order-at-time. Good labels make your model meaningful; bad labels cause leakage and misleading metrics.

Product: Predict signup-to-purchase within 7 days to optimize onboarding.
Growth: Predict which users will convert this week for targeted emails.
Ops: Flag orders that become chargebacks within 60 days.
Retention: Identify accounts likely to churn next month.

Why it’s tricky

You must freeze features at an as_of_date and only look forward to compute the label.
Window boundaries (inclusive/exclusive) change results.
Data gaps and late events create false positives/negatives.

Concept explained simply

Label (Outcome): The thing you want to predict, computed in a defined future window relative to a specific as_of_date.

Mental model: Imagine a pause button. At each as_of_date, you freeze the world (features). Then you press play and peek only into the approved future window to decide if the outcome happened. Anything outside this allowed future window is invisible.

Entity: user_id, account_id, order_id, etc.
As_of_date: when features are known (snapshot time).
Lookback window: how far back you compute features (e.g., past 30 days).
Label window: how far forward you look to compute the outcome (e.g., next 7 days).
Leakage: when label or features use information that was not available at as_of_date.

Label design patterns

Binary conversion: label = 1 if event occurs in (as_of_date, as_of_date + N] days.
Churn by inactivity: label = 1 if no activity in (as_of_date, as_of_date + N] days.
Time-to-event: label = days until event; censored if event not observed by end of window.
Regression: label = numeric outcome (e.g., revenue in next 30 days).

Boundary choices (inclusive/exclusive)

Common leakage-safe boundary for binary conversion: event_time > as_of_time AND event_time <= as_of_time + window. Excluding the exact as_of moment reduces the chance of accidentally including events recorded later the same day.

Worked examples

-- Entities at as_of_date = signup_date
SELECT u.user_id,
       u.signup_date AS as_of_date,
       CASE WHEN EXISTS (
         SELECT 1 FROM events e
         WHERE e.user_id = u.user_id
           AND e.event_type = 'purchase'
           AND e.event_date > u.signup_date
           AND e.event_date <= u.signup_date + INTERVAL '7 days'
       ) THEN 1 ELSE 0 END AS label_converted_7d
FROM users u;
-- Adjust date functions to your SQL engine if needed.

Example 2: Churn in next 30 days (inactivity)

-- Label = 1 if user has NO activity in the next 30 days after as_of_date
WITH obs AS (
  SELECT user_id, as_of_date
  FROM user_daily_snapshots  -- one row per user and day
)
SELECT o.user_id, o.as_of_date,
       CASE WHEN NOT EXISTS (
         SELECT 1 FROM events e
         WHERE e.user_id = o.user_id
           AND e.event_date > o.as_of_date
           AND e.event_date <= o.as_of_date + INTERVAL '30 days'
       ) THEN 1 ELSE 0 END AS label_churn_30d
FROM obs o;

Example 3: Next 30-day revenue (regression)

SELECT a.account_id,
       a.as_of_date,
       COALESCE(SUM(CASE WHEN o.order_date > a.as_of_date
                         AND o.order_date <= a.as_of_date + INTERVAL '30 days'
                    THEN o.amount ELSE 0 END), 0) AS label_rev_30d
FROM account_as_of a
LEFT JOIN orders o
  ON o.account_id = a.account_id
  AND o.order_date > a.as_of_date
  AND o.order_date <= a.as_of_date + INTERVAL '30 days'
GROUP BY a.account_id, a.as_of_date;

Edge cases to consider

Multiple events: use EXISTS for binary labels or aggregate for regression.
Time zones: align to a single timezone before computing windows.
Late-arriving data: keep a data freshness date to avoid partial windows.

Step-by-step recipe

Define the question: What outcome do we predict and over what future window?
Choose the as_of_date: When features are valid/frozen.
Specify exact boundaries: e.g., (as_of_date, as_of_date + 7 days].
Select the cohort: Which entities and which as_of_dates are included.
Write the label query: Use EXISTS/NOT EXISTS or aggregates with strict future filters.
Validate: Spot-check a few entities to confirm boundary logic.
Persist: Save labels with entity_id, as_of_date, label, and window configuration.

Exercises

These exercises mirror the tasks below. Complete them, then check the solution and expected output.

Goal: For each user, produce a single row at as_of_date = signup_date with label_converted_7d = 1 if the user has a purchase in the next 7 days (excluding the same day), else 0.

Sample data (use as CTEs or temp tables)

-- Users
user_id | signup_date
--------+------------
1       | 2021-06-01
2       | 2021-06-03
3       | 2021-06-05

-- Events
user_id | event_type | event_date
--------+------------+-----------
1       | purchase   | 2021-06-06
1       | purchase   | 2021-06-20
2       | page_view  | 2021-06-04
2       | purchase   | 2021-06-12
3       | page_view  | 2021-06-06
3       | purchase   | 2021-06-20

Rules:

Window: (signup_date, signup_date + 7 days]
Only event_type = 'purchase' counts
Return columns: user_id, as_of_date, label_converted_7d

Expected output (exact)

user_id | as_of_date | label_converted_7d
--------+------------+-------------------
1       | 2021-06-01 | 1
2       | 2021-06-03 | 0
3       | 2021-06-05 | 0

Self-check: Verify user 1 has a purchase within 5 days; users 2 and 3 purchase after 7 days.

Checklist before moving on

[ ] I used strict forward-looking filters (no same-day events).
[ ] I can explain why the boundary is exclusive on the start and inclusive on the end.
[ ] I validated at least two users by hand.

Common mistakes and how to self-check

Leakage from same-day events: Using >= as_of_date instead of > as_of_date. Fix: exclude same-day or define a clear cutoff timestamp.
Wrong window end: Using < instead of <= unintentionally drops events on the last day. Confirm with spot checks on boundary dates.
Mixing features and labels: Building labels in the same query that computes features can sneak future info into features. Keep label logic separate and join later by (entity_id, as_of_date).
Duplicate observations: Multiple rows per (entity_id, as_of_date). Enforce primary keys and aggregate carefully.
Time zone drift: Events recorded in different time zones. Normalize time zones before windows.
Partial windows (censoring): End-of-dataset lacks full future window. Either drop those rows or mark as censored and exclude from training.

Self-audit mini procedure

Pick 5 entities near the window boundary and verify labels manually.
Compute label rates by cohort and compare to business intuition.
Shift the window by ±1 day and confirm labels change as expected.

Practical projects

Churn labeling pipeline: Daily as_of snapshots, 30-day inactivity label, persisted to a labels table with metadata (window size, boundaries).
Conversion labeling: Signup cohort, 7-day conversion label, A/B test group awareness (avoid leakage from experiment exposure recorded after as_of_date).
Revenue regression label: Next-30-day revenue for accounts; handle partial windows at dataset end by censoring or excluding.

Who this is for

Data Scientists and ML Engineers preparing supervised learning datasets.
Analysts who need robust forward-looking metrics without leakage.

Prerequisites

Intermediate SQL: joins, GROUP BY, CASE, and basic date functions.
Familiarity with supervised learning goals (binary/regression).
Comfort with time concepts (time zones, inclusive/exclusive bounds).

Learning path

Master time filtering and window boundaries.
Practice binary labels (conversion, fraud) using EXISTS.
Practice inactivity labels (churn) using NOT EXISTS.
Build regression labels (next-30-day revenue) using SUM filters.
Create a reusable labels table keyed by (entity_id, as_of_date).
Validate with manual checks and boundary tests.

Next steps

Join labels to feature snapshots and build a train/test split by as_of_date.
Track label distribution over time to detect drift.
Document your window definitions and boundaries for reproducibility.

Mini challenge

Extend Example 1 to also compute a 14-day conversion label in the same query, then compare 7-day vs 14-day conversion rates. What changes in your boundary conditions and how does the positive rate shift?

Quick test

The quick test is available to everyone. Only logged-in users will have their results and progress saved.

Instructions

Write a SQL query that outputs one row per user with:

as_of_date = signup_date
label_converted_7d = 1 if the user has a purchase in (signup_date, signup_date + 7 days], else 0

Use the sample data below (you can create CTEs or temp tables):

-- Users
user_id | signup_date
--------+------------
1       | 2021-06-01
2       | 2021-06-03
3       | 2021-06-05

-- Events
user_id | event_type | event_date
--------+------------+-----------
1       | purchase   | 2021-06-06
1       | purchase   | 2021-06-20
2       | page_view  | 2021-06-04
2       | purchase   | 2021-06-12
3       | page_view  | 2021-06-06
3       | purchase   | 2021-06-20

Columns required: user_id, as_of_date, label_converted_7d. Avoid leakage by excluding events on signup_date.

Menu

Building Labels And Outcomes

Table of Contents

Why this matters

Concept explained simply

Label design patterns

Worked examples

Example 2: Churn in next 30 days (inactivity)

Example 3: Next 30-day revenue (regression)

Step-by-step recipe

Exercises

Checklist before moving on

Common mistakes and how to self-check

Practical projects

Who this is for

Prerequisites

Learning path

Next steps

Mini challenge

Quick test

Practice Exercises

Build a 7-day conversion label from signup

Instructions

Expected Output

Building Labels And Outcomes — Quick Test

Have questions about Building Labels And Outcomes?

AI Assistant

Menu

Building Labels And Outcomes

Table of Contents

Why this matters

Concept explained simply

Label design patterns

Worked examples

Example 1: Conversion within 7 days of signup (binary)

Example 2: Churn in next 30 days (inactivity)

Example 3: Next 30-day revenue (regression)

Step-by-step recipe

Exercises

Exercise ex1: Build a 7-day conversion label from signup

Checklist before moving on

Common mistakes and how to self-check

Practical projects

Who this is for

Prerequisites

Learning path

Next steps

Mini challenge

Quick test

Practice Exercises

Build a 7-day conversion label from signup

Instructions

Expected Output

Building Labels And Outcomes — Quick Test

Have questions about Building Labels And Outcomes?

AI Assistant