Skill Not Found

Why SQL matters for Data Scientists

SQL is the fastest way to turn raw data into modeling-ready datasets. As a Data Scientist, you will use SQL to extract cohorts, join multiple sources, create time-aware features, build labels without leakage, and validate data quality before modeling. Mastering SQL lets you answer product questions quickly, iterate features safely, and ship reproducible experiments.

What this unlocks in your day-to-day

Build a modeling base table (MBT) at the right grain (user-day, session, order, etc.).
Create aggregations and rolling features directly in the warehouse.
Prevent target leakage with snapshot dates and careful filters.
Sample, balance, and audit datasets for fair evaluation.
Write performance-aware queries on large tables.

Who this is for

Data Scientists and ML Engineers who need reliable datasets for modeling and experimentation.
Analysts transitioning into DS who want stronger feature engineering and labeling in SQL.
Students building portfolios with reproducible, warehouse-first workflows.

Prerequisites

Basic SQL: SELECT, WHERE, GROUP BY, ORDER BY.
Comfort with joins (INNER/LEFT).
Familiarity with data types, NULLs, and timestamps.
Some understanding of model training data (features, labels, splits).

Learning path

Querying & joining confidently (Milestone 1)
Goal: Extract clean slices from one or two tables.
Focus: WHERE, GROUP BY, DISTINCT, INNER/LEFT JOIN.
Output: A tidy dataset with a clear grain (e.g., user-level).
Time: 2–4 hours.
Feature aggregations (Milestone 2)
Goal: Compute robust per-entity features.
Focus: SUM, COUNT DISTINCT, AVG, conditional aggregates.
Output: Feature table aligned to modeling grain.
Time: 2–4 hours.
Window functions & time features (Milestone 3)
Goal: Rolling metrics, ranks, gaps, and trends.
Focus: PARTITION BY, ORDER BY, frames (ROWS/RANGE).
Output: Time-aware features without leakage.
Time: 3–5 hours.
Labels and cohorts (Milestone 4)
Goal: Build outcomes with snapshot logic.
Focus: NOT EXISTS, anti-joins, time windows.
Output: Labeled dataset ready for training/validation splits.
Time: 3–5 hours.
Data quality & performance (Milestone 5)
Goal: Trustworthy and efficient queries.
Focus: sanity checks, deduping, partitions, pruning, projection.
Output: Validated, performant SQL pipeline.
Time: 2–4 hours.

Milestone checklist

I can state the grain of every result set I produce.
My features and labels are computed at the same grain.
My label logic avoids peeking into the future.
I run basic data quality checks on every dataset.
My queries finish quickly on large tables by filtering early.

Worked examples

1) Joining sources to extract a modeling base table

Goal: Build a user-level table with signup info and last order metrics.

-- Tables:
-- users(user_id, signup_date, country)
-- orders(order_id, user_id, order_date, amount)

WITH last_order AS (
  SELECT
    o.user_id,
    MAX(o.order_date) AS last_order_date,
    MAX_BY(o.amount, o.order_date) AS last_order_amount -- use MAX_BY if available; else window
  FROM orders o
  GROUP BY o.user_id
)
SELECT
  u.user_id,
  u.signup_date,
  u.country,
  l.last_order_date,
  l.last_order_amount
FROM users u
LEFT JOIN last_order l
  ON u.user_id = l.user_id;

Notes

If MAX_BY is unavailable, compute last_order_amount using a window: ROW_NUMBER() OVER(PARTITION BY user_id ORDER BY order_date DESC) = 1.
Grain is one row per user.

2) Aggregations for feature prep

Goal: 30/90-day purchase features per user aligned to a reference date.

-- Assume ref_dates(user_id, ref_date) contains snapshot dates per user.
WITH orders_lookback AS (
  SELECT r.user_id, r.ref_date, o.order_id, o.amount
  FROM ref_dates r
  LEFT JOIN orders o
    ON o.user_id = r.user_id
   AND o.order_date <= r.ref_date
   AND o.order_date > r.ref_date - INTERVAL '90' DAY
)
SELECT
  user_id,
  ref_date,
  COUNT(CASE WHEN order_date > ref_date - INTERVAL '30' DAY THEN 1 END) AS orders_30d,
  SUM(CASE WHEN order_date > ref_date - INTERVAL '30' DAY THEN amount END) AS revenue_30d,
  COUNT(*) AS orders_90d,
  AVG(amount) AS avg_amount_90d
FROM orders_lookback
GROUP BY user_id, ref_date;

Notes

Use conditional aggregates to target different windows.
All features are computed using data on or before ref_date (no leakage).

3) Window functions for time features

Goal: Rolling 7-day order count per user per day.

-- dates(dt) is a calendar table; useful to avoid gaps.
WITH user_days AS (
  SELECT u.user_id, d.dt
  FROM users u
  JOIN dates d ON d.dt BETWEEN u.signup_date AND CURRENT_DATE
), daily_orders AS (
  SELECT o.user_id, CAST(o.order_date AS DATE) AS dt, COUNT(*) AS orders_day
  FROM orders o
  GROUP BY o.user_id, CAST(o.order_date AS DATE)
)
SELECT
  ud.user_id,
  ud.dt,
  COALESCE(do.orders_day, 0) AS orders_day,
  SUM(COALESCE(do.orders_day, 0)) OVER (
    PARTITION BY ud.user_id
    ORDER BY ud.dt
    ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
  ) AS orders_7d_roll
FROM user_days ud
LEFT JOIN daily_orders do
  ON do.user_id = ud.user_id AND do.dt = ud.dt;

Notes

ROWS BETWEEN 6 PRECEDING AND CURRENT ROW = last 7 rows (days).
Calendar join ensures correct rolling windows even on days without orders.

4) Building labels and outcomes (no leakage)

Goal: Churn label per user-month: 1 if no orders in the 30 days after month_end.

-- user_months(user_id, month_end) provides monthly snapshots per user.
SELECT
  um.user_id,
  um.month_end,
  CASE WHEN NOT EXISTS (
    SELECT 1
    FROM orders o
    WHERE o.user_id = um.user_id
      AND o.order_date > um.month_end
      AND o.order_date <= um.month_end + INTERVAL '30' DAY
  ) THEN 1 ELSE 0 END AS is_churned_30d
FROM user_months um;

Notes

NOT EXISTS with a future window relative to month_end defines the outcome cleanly.
Never use data after the labeling horizon to build features.

5) Sampling and cohort building

Goal: 10% stratified sample per country for quick iteration.

WITH ranked AS (
  SELECT
    u.*, 
    NTILE(10) OVER (PARTITION BY country ORDER BY RANDOM()) AS bucket
  FROM users u
)
SELECT *
FROM ranked
WHERE bucket = 1;

Notes

NTILE(10) over a random order yields ~10% per country.
For deterministic splits, replace RANDOM() with a hash of stable keys.

Drills and exercises

Write a query that produces one row per user with their first and last order dates.
Compute COUNT(DISTINCT session_id) per user for the past 14 days using a reference date table.
Create a 7-day rolling average order amount per user using a window frame.
Build a label: did the user purchase within 7 days after ref_date? No leakage.
Produce a balanced sample with equal positive/negative labels using window functions.
Detect duplicate primary keys in a dimension table and output offending keys.
Write a query that flags NULL-suspicious columns (e.g., country missing for active users).
Optimize a slow join by filtering early and selecting only needed columns.
Create a cohort by signup month and compute 30/60/90-day retention rates.
Convert event timestamps to dates safely across time zones (document your assumption).

Common mistakes and debugging tips

Unstated grain: If you can’t name the grain, expect wrong joins and duplicates. Tip: COUNT(*) after a join to verify expected multiplicities.
Target leakage: Features computed with data after the snapshot date. Tip: Always filter source tables with order_date <= ref_date when building features.
Wrong join type: INNER join drops users with no orders; usually you want LEFT to keep the full cohort.
GROUP BY mismatch: Selecting columns not in GROUP BY without aggregates. Tip: Use window functions or aggregate properly.
Window frames: Omitting frames gives cumulative values, not rolling windows. Tip: Specify ROWS BETWEEN ... PRECEDING AND CURRENT ROW.
NULL mishandling: Aggregations skip NULLs; COALESCE to avoid surprises.
Performance pitfalls: Selecting all columns, joining before filtering, no partition pruning. Tip: Project minimal columns, filter early, use date partitions when available.
Duplicate keys in features: Many-to-many joins create multiple feature rows per entity. Tip: Pre-aggregate before joining to the main grain.

Quick debugging checklist

Count rows before and after each join.
Verify key uniqueness in both sides.
Check min/max timestamps against your snapshot date.
Inspect 10 random entities end-to-end for correctness.

Mini project: End-to-end churn dataset

Build a reproducible SQL pipeline to predict 30-day churn.

Define snapshots: Create user_months(user_id, month_end) for the last 12 months.
Features: For each snapshot, compute orders_30d, revenue_30d, orders_90d, avg_amount_90d, and a 7-day rolling count on the final 7 days before month_end.
Labels: is_churned_30d using NOT EXISTS in the 30 days after month_end.
Cohort: Keep users who signed up at least 60 days before month_end.
Quality checks: Row counts per month; proportion positive labels; duplicates by (user_id, month_end) = 0.
Performance: Use calendar joins and filter source tables by date ranges before heavy joins.

Acceptance criteria

Exactly one row per (user_id, month_end).
No future data used for features.
All numeric features filled (COALESCE where needed).
Pipeline runs under a reasonable time on your sample.

Stretch goals

Add session-based features (views_7d, add_to_cart_7d).
Create train/validation/test splits by month to mimic production.
Export final table for modeling and report feature importances.

Subskills

Data Extraction For Modeling: Reproducible, grain-aware slices from raw tables.
Joining Multiple Data Sources: Safe keys, correct join types, deduplication.
Aggregations For Feature Prep: Conditional aggregates, distinct counts, pre-aggregation.
Window Functions For Time Features: Rolling metrics, lags, ranks, trends.
Building Labels And Outcomes: Snapshot dates, NOT EXISTS, leakage prevention.
Sampling And Cohort Building: Stratified samples, eligibility filters, balancing.
Data Quality Checks In SQL: Row-level tests, referential checks, null/dupe audits.
Performance Aware Queries On Large Tables: Partition pruning, early filters, minimal projection.

Practical projects

Conversion funnel features: From events, build per-user funnel rates and 14/30-day conversion labels.
Time-to-next-order modeling: Create survival-style features and a hazard label using time windows.
Content recommendation prep: User-item interactions with recency/decay features and a next-week click label.

Next steps

Go deeper on window frames (RANGE vs ROWS) and handling sparse calendars.
Adopt a consistent snapshot table pattern for any prediction task.
Automate data quality checks as part of your SQL workflows.

Menu

SQL

Table of Contents

Why SQL matters for Data Scientists

Who this is for

Prerequisites

Learning path

Worked examples

1) Joining sources to extract a modeling base table

2) Aggregations for feature prep

3) Window functions for time features

4) Building labels and outcomes (no leakage)

5) Sampling and cohort building

Drills and exercises

Common mistakes and debugging tips

Mini project: End-to-end churn dataset

Subskills

Practical projects

Next steps

SQL — Skill Exam

Topics

Data Extraction For Modeling

Joining Multiple Data Sources

Aggregations For Feature Prep

Window Functions For Time Features

Building Labels And Outcomes

Sampling And Cohort Building

Data Quality Checks In SQL

Performance Aware Queries On Large Tables

Have questions about SQL?

AI Assistant