Why SQL matters for Data Scientists
SQL is the fastest way to turn raw data into modeling-ready datasets. As a Data Scientist, you will use SQL to extract cohorts, join multiple sources, create time-aware features, build labels without leakage, and validate data quality before modeling. Mastering SQL lets you answer product questions quickly, iterate features safely, and ship reproducible experiments.
What this unlocks in your day-to-day
- Build a modeling base table (MBT) at the right grain (user-day, session, order, etc.).
- Create aggregations and rolling features directly in the warehouse.
- Prevent target leakage with snapshot dates and careful filters.
- Sample, balance, and audit datasets for fair evaluation.
- Write performance-aware queries on large tables.
Who this is for
- Data Scientists and ML Engineers who need reliable datasets for modeling and experimentation.
- Analysts transitioning into DS who want stronger feature engineering and labeling in SQL.
- Students building portfolios with reproducible, warehouse-first workflows.
Prerequisites
- Basic SQL: SELECT, WHERE, GROUP BY, ORDER BY.
- Comfort with joins (INNER/LEFT).
- Familiarity with data types, NULLs, and timestamps.
- Some understanding of model training data (features, labels, splits).
Learning path
- Querying & joining confidently (Milestone 1)
Goal: Extract clean slices from one or two tables.
Focus: WHERE, GROUP BY, DISTINCT, INNER/LEFT JOIN.
Output: A tidy dataset with a clear grain (e.g., user-level).
Time: 2–4 hours. - Feature aggregations (Milestone 2)
Goal: Compute robust per-entity features.
Focus: SUM, COUNT DISTINCT, AVG, conditional aggregates.
Output: Feature table aligned to modeling grain.
Time: 2–4 hours. - Window functions & time features (Milestone 3)
Goal: Rolling metrics, ranks, gaps, and trends.
Focus: PARTITION BY, ORDER BY, frames (ROWS/RANGE).
Output: Time-aware features without leakage.
Time: 3–5 hours. - Labels and cohorts (Milestone 4)
Goal: Build outcomes with snapshot logic.
Focus: NOT EXISTS, anti-joins, time windows.
Output: Labeled dataset ready for training/validation splits.
Time: 3–5 hours. - Data quality & performance (Milestone 5)
Goal: Trustworthy and efficient queries.
Focus: sanity checks, deduping, partitions, pruning, projection.
Output: Validated, performant SQL pipeline.
Time: 2–4 hours.
Milestone checklist
- I can state the grain of every result set I produce.
- My features and labels are computed at the same grain.
- My label logic avoids peeking into the future.
- I run basic data quality checks on every dataset.
- My queries finish quickly on large tables by filtering early.
Worked examples
1) Joining sources to extract a modeling base table
Goal: Build a user-level table with signup info and last order metrics.
-- Tables:
-- users(user_id, signup_date, country)
-- orders(order_id, user_id, order_date, amount)
WITH last_order AS (
SELECT
o.user_id,
MAX(o.order_date) AS last_order_date,
MAX_BY(o.amount, o.order_date) AS last_order_amount -- use MAX_BY if available; else window
FROM orders o
GROUP BY o.user_id
)
SELECT
u.user_id,
u.signup_date,
u.country,
l.last_order_date,
l.last_order_amount
FROM users u
LEFT JOIN last_order l
ON u.user_id = l.user_id;
Notes
- If MAX_BY is unavailable, compute last_order_amount using a window: ROW_NUMBER() OVER(PARTITION BY user_id ORDER BY order_date DESC) = 1.
- Grain is one row per user.
2) Aggregations for feature prep
Goal: 30/90-day purchase features per user aligned to a reference date.
-- Assume ref_dates(user_id, ref_date) contains snapshot dates per user.
WITH orders_lookback AS (
SELECT r.user_id, r.ref_date, o.order_id, o.amount
FROM ref_dates r
LEFT JOIN orders o
ON o.user_id = r.user_id
AND o.order_date <= r.ref_date
AND o.order_date > r.ref_date - INTERVAL '90' DAY
)
SELECT
user_id,
ref_date,
COUNT(CASE WHEN order_date > ref_date - INTERVAL '30' DAY THEN 1 END) AS orders_30d,
SUM(CASE WHEN order_date > ref_date - INTERVAL '30' DAY THEN amount END) AS revenue_30d,
COUNT(*) AS orders_90d,
AVG(amount) AS avg_amount_90d
FROM orders_lookback
GROUP BY user_id, ref_date;
Notes
- Use conditional aggregates to target different windows.
- All features are computed using data on or before ref_date (no leakage).
3) Window functions for time features
Goal: Rolling 7-day order count per user per day.
-- dates(dt) is a calendar table; useful to avoid gaps.
WITH user_days AS (
SELECT u.user_id, d.dt
FROM users u
JOIN dates d ON d.dt BETWEEN u.signup_date AND CURRENT_DATE
), daily_orders AS (
SELECT o.user_id, CAST(o.order_date AS DATE) AS dt, COUNT(*) AS orders_day
FROM orders o
GROUP BY o.user_id, CAST(o.order_date AS DATE)
)
SELECT
ud.user_id,
ud.dt,
COALESCE(do.orders_day, 0) AS orders_day,
SUM(COALESCE(do.orders_day, 0)) OVER (
PARTITION BY ud.user_id
ORDER BY ud.dt
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
) AS orders_7d_roll
FROM user_days ud
LEFT JOIN daily_orders do
ON do.user_id = ud.user_id AND do.dt = ud.dt;
Notes
- ROWS BETWEEN 6 PRECEDING AND CURRENT ROW = last 7 rows (days).
- Calendar join ensures correct rolling windows even on days without orders.
4) Building labels and outcomes (no leakage)
Goal: Churn label per user-month: 1 if no orders in the 30 days after month_end.
-- user_months(user_id, month_end) provides monthly snapshots per user.
SELECT
um.user_id,
um.month_end,
CASE WHEN NOT EXISTS (
SELECT 1
FROM orders o
WHERE o.user_id = um.user_id
AND o.order_date > um.month_end
AND o.order_date <= um.month_end + INTERVAL '30' DAY
) THEN 1 ELSE 0 END AS is_churned_30d
FROM user_months um;
Notes
- NOT EXISTS with a future window relative to month_end defines the outcome cleanly.
- Never use data after the labeling horizon to build features.
5) Sampling and cohort building
Goal: 10% stratified sample per country for quick iteration.
WITH ranked AS (
SELECT
u.*,
NTILE(10) OVER (PARTITION BY country ORDER BY RANDOM()) AS bucket
FROM users u
)
SELECT *
FROM ranked
WHERE bucket = 1;
Notes
- NTILE(10) over a random order yields ~10% per country.
- For deterministic splits, replace RANDOM() with a hash of stable keys.
Drills and exercises
- Write a query that produces one row per user with their first and last order dates.
- Compute COUNT(DISTINCT session_id) per user for the past 14 days using a reference date table.
- Create a 7-day rolling average order amount per user using a window frame.
- Build a label: did the user purchase within 7 days after ref_date? No leakage.
- Produce a balanced sample with equal positive/negative labels using window functions.
- Detect duplicate primary keys in a dimension table and output offending keys.
- Write a query that flags NULL-suspicious columns (e.g., country missing for active users).
- Optimize a slow join by filtering early and selecting only needed columns.
- Create a cohort by signup month and compute 30/60/90-day retention rates.
- Convert event timestamps to dates safely across time zones (document your assumption).
Common mistakes and debugging tips
- Unstated grain: If you can’t name the grain, expect wrong joins and duplicates. Tip: COUNT(*) after a join to verify expected multiplicities.
- Target leakage: Features computed with data after the snapshot date. Tip: Always filter source tables with order_date <= ref_date when building features.
- Wrong join type: INNER join drops users with no orders; usually you want LEFT to keep the full cohort.
- GROUP BY mismatch: Selecting columns not in GROUP BY without aggregates. Tip: Use window functions or aggregate properly.
- Window frames: Omitting frames gives cumulative values, not rolling windows. Tip: Specify ROWS BETWEEN ... PRECEDING AND CURRENT ROW.
- NULL mishandling: Aggregations skip NULLs; COALESCE to avoid surprises.
- Performance pitfalls: Selecting all columns, joining before filtering, no partition pruning. Tip: Project minimal columns, filter early, use date partitions when available.
- Duplicate keys in features: Many-to-many joins create multiple feature rows per entity. Tip: Pre-aggregate before joining to the main grain.
Quick debugging checklist
- Count rows before and after each join.
- Verify key uniqueness in both sides.
- Check min/max timestamps against your snapshot date.
- Inspect 10 random entities end-to-end for correctness.
Mini project: End-to-end churn dataset
Build a reproducible SQL pipeline to predict 30-day churn.
- Define snapshots: Create user_months(user_id, month_end) for the last 12 months.
- Features: For each snapshot, compute orders_30d, revenue_30d, orders_90d, avg_amount_90d, and a 7-day rolling count on the final 7 days before month_end.
- Labels: is_churned_30d using NOT EXISTS in the 30 days after month_end.
- Cohort: Keep users who signed up at least 60 days before month_end.
- Quality checks: Row counts per month; proportion positive labels; duplicates by (user_id, month_end) = 0.
- Performance: Use calendar joins and filter source tables by date ranges before heavy joins.
Acceptance criteria
- Exactly one row per (user_id, month_end).
- No future data used for features.
- All numeric features filled (COALESCE where needed).
- Pipeline runs under a reasonable time on your sample.
Stretch goals
- Add session-based features (views_7d, add_to_cart_7d).
- Create train/validation/test splits by month to mimic production.
- Export final table for modeling and report feature importances.
Subskills
- Data Extraction For Modeling: Reproducible, grain-aware slices from raw tables.
- Joining Multiple Data Sources: Safe keys, correct join types, deduplication.
- Aggregations For Feature Prep: Conditional aggregates, distinct counts, pre-aggregation.
- Window Functions For Time Features: Rolling metrics, lags, ranks, trends.
- Building Labels And Outcomes: Snapshot dates, NOT EXISTS, leakage prevention.
- Sampling And Cohort Building: Stratified samples, eligibility filters, balancing.
- Data Quality Checks In SQL: Row-level tests, referential checks, null/dupe audits.
- Performance Aware Queries On Large Tables: Partition pruning, early filters, minimal projection.
Practical projects
- Conversion funnel features: From events, build per-user funnel rates and 14/30-day conversion labels.
- Time-to-next-order modeling: Create survival-style features and a hazard label using time windows.
- Content recommendation prep: User-item interactions with recency/decay features and a next-week click label.
Next steps
- Go deeper on window frames (RANGE vs ROWS) and handling sparse calendars.
- Adopt a consistent snapshot table pattern for any prediction task.
- Automate data quality checks as part of your SQL workflows.