luvv to helpDiscover the Best Free Online Tools
Topic 3 of 8

Aggregations For Feature Prep

Learn Aggregations For Feature Prep for free with explanations, exercises, and a quick test (for Data Scientist).

Published: January 1, 2026 | Updated: January 1, 2026

Why this matters

Aggregations turn raw events into compact, model-ready features. As a Data Scientist, you will routinely convert clicks, orders, transactions, and logs into per-entity summaries, such as orders in the last 30 days, average spend, or time since last activity. This is essential for churn prediction, fraud detection, recommendations, and forecasting.

  • Churn: Count user sessions last 14 days, recency since last login.
  • Fraud: Sum of amounts over short windows, distinct payment methods used.
  • Recommendations: Number of categories purchased, average rating given.
  • Forecasting: Weekly sales per product, moving averages of demand.

Who this is for

  • Data Scientists preparing features directly from SQL data sources.
  • Analytics Engineers and SQL-savvy analysts supporting ML teams.
  • MLOps practitioners validating feature logic in the warehouse.

Prerequisites

  • Comfort with SELECT, WHERE, GROUP BY, and basic JOINs.
  • Familiarity with timestamps/dates and simple CASE expressions.
  • Basic understanding of what a feature table is (one row per entity/time).

Concept explained simply

An aggregation summarizes many rows into one row per entity (e.g., per user_id or product_id) using functions like COUNT, SUM, AVG, MIN, and MAX. For feature prep, you usually:

  1. Define a time window (e.g., last 30 or 90 days up to a reference date).
  2. Filter events into that window.
  3. GROUP BY the entity and compute aggregations.
  4. Handle nulls (use COALESCE) and safe divisions.
  5. Join features back to a main entity table (e.g., users).

Mental model

Think of a "+magnifying glass over time+": you slide a window up to a cutoff date, collect all events inside the window, squeeze them into numbers per entity, and attach those numbers as columns in a wide feature table. Repeat for each feature and each time window you need.

What about SQL dialects?

Intervals differ by engine. Example variants:

  • PostgreSQL: CURRENT_DATE - INTERVAL '30 days'
  • MySQL: order_date >= DATE_SUB(CURDATE(), INTERVAL 30 DAY)
  • SQLite: date(order_date) >= date('now','-30 day')

The concept is identical: filter by a window, group, aggregate.

Common feature patterns

  • Counts: COUNT(*) of events per entity and window.
  • Totals and averages: SUM(amount), AVG(amount), MIN/MAX metrics.
  • Distinct counts: COUNT(DISTINCT category_id) to capture variety.
  • Conditional sums: SUM(CASE WHEN condition THEN 1 ELSE 0 END).
  • Ratios: SUM(flag) / NULLIF(COUNT(*),0) for safe division.
  • Recency: CURRENT_DATE - MAX(event_date) as days_since_last_event.
  • Frequency: COUNT(*) divided by number of days/weeks in window.
  • Last known value: MAX_BY(value, ts) or conditional agg to capture latest.

Worked examples

Example 1: User order counts and spend in last 30/90 days
-- Assumptions: orders(user_id, order_id, order_date, amount)
WITH w30 AS (
  SELECT user_id,
         COUNT(*) AS orders_30d,
         COALESCE(SUM(amount),0) AS spend_30d
  FROM orders
  WHERE order_date >= CURRENT_DATE - INTERVAL '30 days'
  GROUP BY user_id
),
w90 AS (
  SELECT user_id,
         COUNT(*) AS orders_90d,
         COALESCE(SUM(amount),0) AS spend_90d,
         COALESCE(AVG(amount),0) AS aov_90d
  FROM orders
  WHERE order_date >= CURRENT_DATE - INTERVAL '90 days'
  GROUP BY user_id
)
SELECT COALESCE(u.user_id, w30.user_id, w90.user_id) AS user_id,
       COALESCE(w30.orders_30d,0) AS orders_30d,
       COALESCE(w30.spend_30d,0)  AS spend_30d,
       COALESCE(w90.orders_90d,0) AS orders_90d,
       COALESCE(w90.spend_90d,0)  AS spend_90d,
       COALESCE(w90.aov_90d,0)    AS aov_90d
FROM (SELECT DISTINCT user_id FROM orders) u
LEFT JOIN w30 ON u.user_id = w30.user_id
LEFT JOIN w90 ON u.user_id = w90.user_id;
MySQL interval variant
WHERE order_date >= DATE_SUB(CURDATE(), INTERVAL 30 DAY)
-- and similarly for 90 days
Example 2: Days since last order (recency)
-- Return one row per user with recency in days
SELECT o.user_id,
       COALESCE(DATE_PART('day', CURRENT_DATE - MAX(o.order_date)), 9999) AS days_since_last_order
FROM orders o
GROUP BY o.user_id;
Dialect note

In some engines use DATEDIFF(CURRENT_DATE, MAX(order_date)) or julianday(CURRENT_DATE) - julianday(MAX(order_date)).

Example 3: Distinct categories and repeat-purchase ratio
-- orders(user_id, order_id, order_date, amount, product_category)
WITH base AS (
  SELECT *
  FROM orders
  WHERE order_date >= CURRENT_DATE - INTERVAL '90 days'
),
user_stats AS (
  SELECT user_id,
         COUNT(*) AS orders_90d,
         COUNT(DISTINCT product_category) AS distinct_cats_90d,
         SUM(CASE WHEN order_id IS NOT NULL THEN 1 ELSE 0 END) AS order_events
  FROM base
  GROUP BY user_id
),
repeat_flag AS (
  -- If a user has >= 2 orders in the window, count them as repeat
  SELECT user_id,
         SUM(CASE WHEN rn >= 2 THEN 1 ELSE 0 END) AS repeat_orders
  FROM (
    SELECT user_id, order_id,
           ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY order_date) AS rn
    FROM base
  ) t
  GROUP BY user_id
)
SELECT s.user_id,
       s.orders_90d,
       s.distinct_cats_90d,
       COALESCE(repeat_orders,0) / NULLIF(s.orders_90d,0)::numeric AS repeat_ratio
FROM user_stats s
LEFT JOIN repeat_flag r USING (user_id);

Note: Use CAST to numeric for non-integer division in engines that need it.

Example 4: Product-level weekly sales (for forecasting)
-- sales(product_id, sold_at, qty)
SELECT product_id,
       DATE_TRUNC('week', sold_at) AS week_start,
       SUM(qty) AS units_sold
FROM sales
WHERE sold_at < CURRENT_DATE -- up to reference date
GROUP BY product_id, DATE_TRUNC('week', sold_at);

These weekly aggregates can be lagged or joined to form model features.

From raw to features: a repeatable 5-step flow

  1. Choose entity and cutoff: e.g., per user_id, up to today or a label date.
  2. Select window(s): 7/30/90 days, or weeks/months.
  3. Aggregate: COUNT, SUM, AVG, distinct counts, conditional sums.
  4. Harden features: COALESCE nulls, safe divide with NULLIF, clamp outliers if needed.
  5. Validate: Spot-check a few entities manually; compare totals to raw data.

Learning path

  • Start with simple GROUP BY per entity and a single window.
  • Add conditional aggregations (CASE inside SUM/COUNT).
  • Introduce distinct counts and ratios.
  • Combine multiple windows (7/30/90 days) using CTEs.
  • Compute recency and last-known values; practice joins back to entity tables.

Practice: Exercises

You can take the Quick Test at the end—available to everyone. Only logged-in users get saved progress.

Exercise: Build a user feature table

Using an orders table with columns (user_id, order_id, order_date, amount, product_category), create a feature table per user with:

  • orders_30d, spend_30d
  • orders_90d, spend_90d, aov_90d
  • days_since_last_order
  • distinct_categories_90d
Hints
  • Build 30/90-day aggregates in separate CTEs, then join.
  • Use COALESCE for users with no orders in a window.
  • Compute aov_90d as spend_90d / NULLIF(orders_90d,0).
  • Recency: CURRENT_DATE - MAX(order_date).
  • Checklist: ☐ Separate windows; ☐ Safe division; ☐ COALESCE nulls; ☐ One row per user.

Common mistakes and self-check

  • Data leakage: Using events after the label date. Self-check: Ensure your WHERE filter caps at the cutoff.
  • Incorrect window math: Off-by-one days. Self-check: Test a known user and manually count.
  • Division by zero or integer division. Self-check: Use NULLIF and casts to get decimals.
  • Missing entities with zero events. Self-check: LEFT JOIN from the entity list and COALESCE.
  • Duplicate rows per entity. Self-check: After joins, verify one row per entity_id.
  • HAVING vs WHERE confusion. Self-check: WHERE filters rows before grouping; HAVING filters groups after aggregation.

Practical projects

  • Churn mini-project: Build user-level 30/90-day features from a synthetic orders log; compute labels (churned in next 30 days) and export the feature table.
  • Fraud signals: Aggregate per user payment attempts in 24/72 hours windows; create ratios like high_value_txns / total_txns.
  • Product demand: Weekly product sales plus 3-week moving sums; derive features like growth rate and volatility (stddev if available).

Next steps

  • Practice joining features to labels on the correct cutoff date.
  • Add multiple time windows and compare feature stability over time.
  • Explore window functions for rolling features when needed.

Mini challenge

Create three features per user for a 60-day window: (1) total_spend_60d, (2) distinct_merchants_60d, (3) share_high_value_txns_60d defined as txns >= 100 over total. Ensure safe division and one row per user. Validate with two hand-checked users.

Practice Exercises

1 exercises to complete

Instructions

Table: orders(user_id, order_id, order_date, amount, product_category).

Task: Create one row per user with these columns:

  • orders_30d, spend_30d
  • orders_90d, spend_90d, aov_90d
  • days_since_last_order
  • distinct_categories_90d

Notes: Use COALESCE for users without orders in a window; safe divide for aov_90d.

Expected Output
Columns: user_id, orders_30d, spend_30d, orders_90d, spend_90d, aov_90d, days_since_last_order, distinct_categories_90d. Example row: 101 | 2 | 84.50 | 5 | 210.00 | 42.00 | 8 | 3

Aggregations For Feature Prep — Quick Test

Test your knowledge with 10 questions. Pass with 70% or higher.

10 questions70% to pass

Have questions about Aggregations For Feature Prep?

AI Assistant

Ask questions about this tool