How to learn Aggregations For Feature Prep for SQL in Data Scientist for free

Why this matters

Aggregations turn raw events into compact, model-ready features. As a Data Scientist, you will routinely convert clicks, orders, transactions, and logs into per-entity summaries, such as orders in the last 30 days, average spend, or time since last activity. This is essential for churn prediction, fraud detection, recommendations, and forecasting.

Churn: Count user sessions last 14 days, recency since last login.
Fraud: Sum of amounts over short windows, distinct payment methods used.
Recommendations: Number of categories purchased, average rating given.
Forecasting: Weekly sales per product, moving averages of demand.

Who this is for

Data Scientists preparing features directly from SQL data sources.
Analytics Engineers and SQL-savvy analysts supporting ML teams.
MLOps practitioners validating feature logic in the warehouse.

Prerequisites

Comfort with SELECT, WHERE, GROUP BY, and basic JOINs.
Familiarity with timestamps/dates and simple CASE expressions.
Basic understanding of what a feature table is (one row per entity/time).

Concept explained simply

An aggregation summarizes many rows into one row per entity (e.g., per user_id or product_id) using functions like COUNT, SUM, AVG, MIN, and MAX. For feature prep, you usually:

Define a time window (e.g., last 30 or 90 days up to a reference date).
Filter events into that window.
GROUP BY the entity and compute aggregations.
Handle nulls (use COALESCE) and safe divisions.
Join features back to a main entity table (e.g., users).

Mental model

Think of a "+magnifying glass over time+": you slide a window up to a cutoff date, collect all events inside the window, squeeze them into numbers per entity, and attach those numbers as columns in a wide feature table. Repeat for each feature and each time window you need.

What about SQL dialects?

Intervals differ by engine. Example variants:

PostgreSQL: CURRENT_DATE - INTERVAL '30 days'
MySQL: order_date >= DATE_SUB(CURDATE(), INTERVAL 30 DAY)
SQLite: date(order_date) >= date('now','-30 day')

The concept is identical: filter by a window, group, aggregate.

Common feature patterns

Counts: COUNT(*) of events per entity and window.
Totals and averages: SUM(amount), AVG(amount), MIN/MAX metrics.
Distinct counts: COUNT(DISTINCT category_id) to capture variety.
Conditional sums: SUM(CASE WHEN condition THEN 1 ELSE 0 END).
Ratios: SUM(flag) / NULLIF(COUNT(*),0) for safe division.
Recency: CURRENT_DATE - MAX(event_date) as days_since_last_event.
Frequency: COUNT(*) divided by number of days/weeks in window.
Last known value: MAX_BY(value, ts) or conditional agg to capture latest.

Worked examples

Example 1: User order counts and spend in last 30/90 days

-- Assumptions: orders(user_id, order_id, order_date, amount)
WITH w30 AS (
  SELECT user_id,
         COUNT(*) AS orders_30d,
         COALESCE(SUM(amount),0) AS spend_30d
  FROM orders
  WHERE order_date >= CURRENT_DATE - INTERVAL '30 days'
  GROUP BY user_id
),
w90 AS (
  SELECT user_id,
         COUNT(*) AS orders_90d,
         COALESCE(SUM(amount),0) AS spend_90d,
         COALESCE(AVG(amount),0) AS aov_90d
  FROM orders
  WHERE order_date >= CURRENT_DATE - INTERVAL '90 days'
  GROUP BY user_id
)
SELECT COALESCE(u.user_id, w30.user_id, w90.user_id) AS user_id,
       COALESCE(w30.orders_30d,0) AS orders_30d,
       COALESCE(w30.spend_30d,0)  AS spend_30d,
       COALESCE(w90.orders_90d,0) AS orders_90d,
       COALESCE(w90.spend_90d,0)  AS spend_90d,
       COALESCE(w90.aov_90d,0)    AS aov_90d
FROM (SELECT DISTINCT user_id FROM orders) u
LEFT JOIN w30 ON u.user_id = w30.user_id
LEFT JOIN w90 ON u.user_id = w90.user_id;

MySQL interval variant

WHERE order_date >= DATE_SUB(CURDATE(), INTERVAL 30 DAY)
-- and similarly for 90 days

Example 2: Days since last order (recency)

-- Return one row per user with recency in days
SELECT o.user_id,
       COALESCE(DATE_PART('day', CURRENT_DATE - MAX(o.order_date)), 9999) AS days_since_last_order
FROM orders o
GROUP BY o.user_id;

Dialect note

In some engines use DATEDIFF(CURRENT_DATE, MAX(order_date)) or julianday(CURRENT_DATE) - julianday(MAX(order_date)).

Example 3: Distinct categories and repeat-purchase ratio

-- orders(user_id, order_id, order_date, amount, product_category)
WITH base AS (
  SELECT *
  FROM orders
  WHERE order_date >= CURRENT_DATE - INTERVAL '90 days'
),
user_stats AS (
  SELECT user_id,
         COUNT(*) AS orders_90d,
         COUNT(DISTINCT product_category) AS distinct_cats_90d,
         SUM(CASE WHEN order_id IS NOT NULL THEN 1 ELSE 0 END) AS order_events
  FROM base
  GROUP BY user_id
),
repeat_flag AS (
  -- If a user has >= 2 orders in the window, count them as repeat
  SELECT user_id,
         SUM(CASE WHEN rn >= 2 THEN 1 ELSE 0 END) AS repeat_orders
  FROM (
    SELECT user_id, order_id,
           ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY order_date) AS rn
    FROM base
  ) t
  GROUP BY user_id
)
SELECT s.user_id,
       s.orders_90d,
       s.distinct_cats_90d,
       COALESCE(repeat_orders,0) / NULLIF(s.orders_90d,0)::numeric AS repeat_ratio
FROM user_stats s
LEFT JOIN repeat_flag r USING (user_id);

Note: Use CAST to numeric for non-integer division in engines that need it.

Example 4: Product-level weekly sales (for forecasting)

-- sales(product_id, sold_at, qty)
SELECT product_id,
       DATE_TRUNC('week', sold_at) AS week_start,
       SUM(qty) AS units_sold
FROM sales
WHERE sold_at < CURRENT_DATE -- up to reference date
GROUP BY product_id, DATE_TRUNC('week', sold_at);

These weekly aggregates can be lagged or joined to form model features.

From raw to features: a repeatable 5-step flow

Choose entity and cutoff: e.g., per user_id, up to today or a label date.
Select window(s): 7/30/90 days, or weeks/months.
Aggregate: COUNT, SUM, AVG, distinct counts, conditional sums.
Harden features: COALESCE nulls, safe divide with NULLIF, clamp outliers if needed.
Validate: Spot-check a few entities manually; compare totals to raw data.

Learning path

Start with simple GROUP BY per entity and a single window.
Add conditional aggregations (CASE inside SUM/COUNT).
Introduce distinct counts and ratios.
Combine multiple windows (7/30/90 days) using CTEs.
Compute recency and last-known values; practice joins back to entity tables.

Practice: Exercises

You can take the Quick Test at the end—available to everyone. Only logged-in users get saved progress.

Exercise: Build a user feature table

Using an orders table with columns (user_id, order_id, order_date, amount, product_category), create a feature table per user with:

orders_30d, spend_30d
orders_90d, spend_90d, aov_90d
days_since_last_order
distinct_categories_90d

Hints

Build 30/90-day aggregates in separate CTEs, then join.
Use COALESCE for users with no orders in a window.
Compute aov_90d as spend_90d / NULLIF(orders_90d,0).
Recency: CURRENT_DATE - MAX(order_date).

Checklist: ☐ Separate windows; ☐ Safe division; ☐ COALESCE nulls; ☐ One row per user.

Common mistakes and self-check

Data leakage: Using events after the label date. Self-check: Ensure your WHERE filter caps at the cutoff.
Incorrect window math: Off-by-one days. Self-check: Test a known user and manually count.
Division by zero or integer division. Self-check: Use NULLIF and casts to get decimals.
Missing entities with zero events. Self-check: LEFT JOIN from the entity list and COALESCE.
Duplicate rows per entity. Self-check: After joins, verify one row per entity_id.
HAVING vs WHERE confusion. Self-check: WHERE filters rows before grouping; HAVING filters groups after aggregation.

Practical projects

Churn mini-project: Build user-level 30/90-day features from a synthetic orders log; compute labels (churned in next 30 days) and export the feature table.
Fraud signals: Aggregate per user payment attempts in 24/72 hours windows; create ratios like high_value_txns / total_txns.
Product demand: Weekly product sales plus 3-week moving sums; derive features like growth rate and volatility (stddev if available).

Next steps

Practice joining features to labels on the correct cutoff date.
Add multiple time windows and compare feature stability over time.
Explore window functions for rolling features when needed.

Mini challenge

Create three features per user for a 60-day window: (1) total_spend_60d, (2) distinct_merchants_60d, (3) share_high_value_txns_60d defined as txns >= 100 over total. Ensure safe division and one row per user. Validate with two hand-checked users.

Menu

Aggregations For Feature Prep

Table of Contents

Why this matters

Who this is for

Prerequisites

Concept explained simply

Mental model

Common feature patterns

Worked examples

From raw to features: a repeatable 5-step flow

Learning path

Practice: Exercises

Exercise: Build a user feature table

Common mistakes and self-check

Practical projects

Next steps

Mini challenge

Practice Exercises

User-level features with 30/90-day windows

Instructions

Expected Output

Aggregations For Feature Prep — Quick Test

Have questions about Aggregations For Feature Prep?

AI Assistant