Why this matters
Aggregations turn raw events into compact, model-ready features. As a Data Scientist, you will routinely convert clicks, orders, transactions, and logs into per-entity summaries, such as orders in the last 30 days, average spend, or time since last activity. This is essential for churn prediction, fraud detection, recommendations, and forecasting.
- Churn: Count user sessions last 14 days, recency since last login.
- Fraud: Sum of amounts over short windows, distinct payment methods used.
- Recommendations: Number of categories purchased, average rating given.
- Forecasting: Weekly sales per product, moving averages of demand.
Who this is for
- Data Scientists preparing features directly from SQL data sources.
- Analytics Engineers and SQL-savvy analysts supporting ML teams.
- MLOps practitioners validating feature logic in the warehouse.
Prerequisites
- Comfort with SELECT, WHERE, GROUP BY, and basic JOINs.
- Familiarity with timestamps/dates and simple CASE expressions.
- Basic understanding of what a feature table is (one row per entity/time).
Concept explained simply
An aggregation summarizes many rows into one row per entity (e.g., per user_id or product_id) using functions like COUNT, SUM, AVG, MIN, and MAX. For feature prep, you usually:
- Define a time window (e.g., last 30 or 90 days up to a reference date).
- Filter events into that window.
- GROUP BY the entity and compute aggregations.
- Handle nulls (use COALESCE) and safe divisions.
- Join features back to a main entity table (e.g., users).
Mental model
Think of a "+magnifying glass over time+": you slide a window up to a cutoff date, collect all events inside the window, squeeze them into numbers per entity, and attach those numbers as columns in a wide feature table. Repeat for each feature and each time window you need.
What about SQL dialects?
Intervals differ by engine. Example variants:
- PostgreSQL: CURRENT_DATE - INTERVAL '30 days'
- MySQL: order_date >= DATE_SUB(CURDATE(), INTERVAL 30 DAY)
- SQLite: date(order_date) >= date('now','-30 day')
The concept is identical: filter by a window, group, aggregate.
Common feature patterns
- Counts: COUNT(*) of events per entity and window.
- Totals and averages: SUM(amount), AVG(amount), MIN/MAX metrics.
- Distinct counts: COUNT(DISTINCT category_id) to capture variety.
- Conditional sums: SUM(CASE WHEN condition THEN 1 ELSE 0 END).
- Ratios: SUM(flag) / NULLIF(COUNT(*),0) for safe division.
- Recency: CURRENT_DATE - MAX(event_date) as days_since_last_event.
- Frequency: COUNT(*) divided by number of days/weeks in window.
- Last known value: MAX_BY(value, ts) or conditional agg to capture latest.
Worked examples
Example 1: User order counts and spend in last 30/90 days
-- Assumptions: orders(user_id, order_id, order_date, amount)
WITH w30 AS (
SELECT user_id,
COUNT(*) AS orders_30d,
COALESCE(SUM(amount),0) AS spend_30d
FROM orders
WHERE order_date >= CURRENT_DATE - INTERVAL '30 days'
GROUP BY user_id
),
w90 AS (
SELECT user_id,
COUNT(*) AS orders_90d,
COALESCE(SUM(amount),0) AS spend_90d,
COALESCE(AVG(amount),0) AS aov_90d
FROM orders
WHERE order_date >= CURRENT_DATE - INTERVAL '90 days'
GROUP BY user_id
)
SELECT COALESCE(u.user_id, w30.user_id, w90.user_id) AS user_id,
COALESCE(w30.orders_30d,0) AS orders_30d,
COALESCE(w30.spend_30d,0) AS spend_30d,
COALESCE(w90.orders_90d,0) AS orders_90d,
COALESCE(w90.spend_90d,0) AS spend_90d,
COALESCE(w90.aov_90d,0) AS aov_90d
FROM (SELECT DISTINCT user_id FROM orders) u
LEFT JOIN w30 ON u.user_id = w30.user_id
LEFT JOIN w90 ON u.user_id = w90.user_id;
MySQL interval variant
WHERE order_date >= DATE_SUB(CURDATE(), INTERVAL 30 DAY)
-- and similarly for 90 days
Example 2: Days since last order (recency)
-- Return one row per user with recency in days
SELECT o.user_id,
COALESCE(DATE_PART('day', CURRENT_DATE - MAX(o.order_date)), 9999) AS days_since_last_order
FROM orders o
GROUP BY o.user_id;
Dialect note
In some engines use DATEDIFF(CURRENT_DATE, MAX(order_date)) or julianday(CURRENT_DATE) - julianday(MAX(order_date)).
Example 3: Distinct categories and repeat-purchase ratio
-- orders(user_id, order_id, order_date, amount, product_category)
WITH base AS (
SELECT *
FROM orders
WHERE order_date >= CURRENT_DATE - INTERVAL '90 days'
),
user_stats AS (
SELECT user_id,
COUNT(*) AS orders_90d,
COUNT(DISTINCT product_category) AS distinct_cats_90d,
SUM(CASE WHEN order_id IS NOT NULL THEN 1 ELSE 0 END) AS order_events
FROM base
GROUP BY user_id
),
repeat_flag AS (
-- If a user has >= 2 orders in the window, count them as repeat
SELECT user_id,
SUM(CASE WHEN rn >= 2 THEN 1 ELSE 0 END) AS repeat_orders
FROM (
SELECT user_id, order_id,
ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY order_date) AS rn
FROM base
) t
GROUP BY user_id
)
SELECT s.user_id,
s.orders_90d,
s.distinct_cats_90d,
COALESCE(repeat_orders,0) / NULLIF(s.orders_90d,0)::numeric AS repeat_ratio
FROM user_stats s
LEFT JOIN repeat_flag r USING (user_id);
Note: Use CAST to numeric for non-integer division in engines that need it.
Example 4: Product-level weekly sales (for forecasting)
-- sales(product_id, sold_at, qty)
SELECT product_id,
DATE_TRUNC('week', sold_at) AS week_start,
SUM(qty) AS units_sold
FROM sales
WHERE sold_at < CURRENT_DATE -- up to reference date
GROUP BY product_id, DATE_TRUNC('week', sold_at);
These weekly aggregates can be lagged or joined to form model features.
From raw to features: a repeatable 5-step flow
- Choose entity and cutoff: e.g., per user_id, up to today or a label date.
- Select window(s): 7/30/90 days, or weeks/months.
- Aggregate: COUNT, SUM, AVG, distinct counts, conditional sums.
- Harden features: COALESCE nulls, safe divide with NULLIF, clamp outliers if needed.
- Validate: Spot-check a few entities manually; compare totals to raw data.
Learning path
- Start with simple GROUP BY per entity and a single window.
- Add conditional aggregations (CASE inside SUM/COUNT).
- Introduce distinct counts and ratios.
- Combine multiple windows (7/30/90 days) using CTEs.
- Compute recency and last-known values; practice joins back to entity tables.
Practice: Exercises
You can take the Quick Test at the end—available to everyone. Only logged-in users get saved progress.
Exercise: Build a user feature table
Using an orders table with columns (user_id, order_id, order_date, amount, product_category), create a feature table per user with:
- orders_30d, spend_30d
- orders_90d, spend_90d, aov_90d
- days_since_last_order
- distinct_categories_90d
Hints
- Build 30/90-day aggregates in separate CTEs, then join.
- Use COALESCE for users with no orders in a window.
- Compute aov_90d as spend_90d / NULLIF(orders_90d,0).
- Recency: CURRENT_DATE - MAX(order_date).
- Checklist: ☐ Separate windows; ☐ Safe division; ☐ COALESCE nulls; ☐ One row per user.
Common mistakes and self-check
- Data leakage: Using events after the label date. Self-check: Ensure your WHERE filter caps at the cutoff.
- Incorrect window math: Off-by-one days. Self-check: Test a known user and manually count.
- Division by zero or integer division. Self-check: Use NULLIF and casts to get decimals.
- Missing entities with zero events. Self-check: LEFT JOIN from the entity list and COALESCE.
- Duplicate rows per entity. Self-check: After joins, verify one row per entity_id.
- HAVING vs WHERE confusion. Self-check: WHERE filters rows before grouping; HAVING filters groups after aggregation.
Practical projects
- Churn mini-project: Build user-level 30/90-day features from a synthetic orders log; compute labels (churned in next 30 days) and export the feature table.
- Fraud signals: Aggregate per user payment attempts in 24/72 hours windows; create ratios like high_value_txns / total_txns.
- Product demand: Weekly product sales plus 3-week moving sums; derive features like growth rate and volatility (stddev if available).
Next steps
- Practice joining features to labels on the correct cutoff date.
- Add multiple time windows and compare feature stability over time.
- Explore window functions for rolling features when needed.
Mini challenge
Create three features per user for a 60-day window: (1) total_spend_60d, (2) distinct_merchants_60d, (3) share_high_value_txns_60d defined as txns >= 100 over total. Ensure safe division and one row per user. Validate with two hand-checked users.