Note: Quick Test and exercises are available to everyone; only logged-in users get saved progress.
Why this matters
Data Scientists frequently need time-based features for forecasting, anomaly detection, growth metrics, and user behavior modeling. Window functions let you compute rolling averages, running totals, time since last event, and percent changes efficiently in SQL—close to the raw data—before modeling.
- Forecasting: build rolling means and lags as model inputs.
- Product analytics: sessionize events and compute time-between-actions.
- Finance/RevOps: running revenue, week-over-week or month-over-month change.
- IoT/Monitoring: rolling minima/maxima and gaps to detect anomalies.
Who this is for
- Data Scientists and Analysts preparing time features for models or dashboards.
- Anyone comfortable with basic SQL who needs reliable time-based aggregations.
Prerequisites
- Basic SQL SELECT, WHERE, GROUP BY, ORDER BY.
- Comfort with timestamps/dates and simple aggregations (SUM, AVG, COUNT).
- Understanding that window functions do not reduce rows like GROUP BY does.
Concept explained simply
A window function computes a value for each row, looking at other rows defined by a window (partition and order). For time features, we usually:
- PARTITION BY an entity (user_id, product_id, store_id).
- ORDER BY event time (event_time, order_date).
- Pick a frame (e.g., last 6 rows plus current) to compute rolling stats.
Mental model
Imagine each row carries a small magnifying glass. The partition decides which rows are in the same group. The order aligns them by time. The frame says how far back/forward the magnifying glass can look. The function (AVG, SUM, LAG) inspects only the visible rows to compute the feature for the current row.
Key patterns you will use
- LAG/LEAD: grab a previous/next value to compute differences or gaps.
- Running totals: SUM(...) OVER (PARTITION BY ... ORDER BY ... ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW).
- Rolling windows: AVG(...) OVER (PARTITION BY ... ORDER BY ... ROWS BETWEEN k PRECEDING AND CURRENT ROW).
- Time bucketing: aggregate per day/week/month, then apply window logic at that grain.
Worked examples
Example 1 — Running daily revenue per store
First aggregate to daily grain, then compute a running total.
-- Given table orders(order_id, store_id, order_time, revenue)
-- 1) Aggregate to daily
WITH daily AS (
SELECT
store_id,
CAST(order_time AS DATE) AS order_date,
SUM(revenue) AS daily_revenue
FROM orders
GROUP BY store_id, CAST(order_time AS DATE)
)
SELECT
store_id,
order_date,
daily_revenue,
SUM(daily_revenue) OVER (
PARTITION BY store_id
ORDER BY order_date
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS running_revenue
FROM daily
ORDER BY store_id, order_date;
Use this for cumulative KPIs and as a feature for growth trend.
Example 2 — 7-day rolling average of daily orders
On daily data, a row-based frame is simple and robust.
-- Given table daily_orders(store_id, order_date, orders_count)
SELECT
store_id,
order_date,
orders_count,
AVG(orders_count) OVER (
PARTITION BY store_id
ORDER BY order_date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
) AS orders_7d_avg
FROM daily_orders
ORDER BY store_id, order_date;
This uses last up to 7 rows (days), including today.
Example 3 — Percent change and time since last event
Compare current value to the previous value and measure gaps.
-- Given table prices(product_id, price_time, price)
SELECT
product_id,
price_time,
price,
LAG(price) OVER (
PARTITION BY product_id
ORDER BY price_time
) AS prev_price,
CASE
WHEN LAG(price) OVER (PARTITION BY product_id ORDER BY price_time) IS NULL OR
LAG(price) OVER (PARTITION BY product_id ORDER BY price_time) = 0
THEN NULL
ELSE (price - LAG(price) OVER (PARTITION BY product_id ORDER BY price_time))
/ NULLIF(LAG(price) OVER (PARTITION BY product_id ORDER BY price_time), 0)
END AS pct_change,
(price_time - LAG(price_time) OVER (
PARTITION BY product_id
ORDER BY price_time
)) AS time_since_last
FROM prices
ORDER BY product_id, price_time;
pct_change is safe against null/zero. time_since_last is an interval; convert to seconds/minutes if needed by your SQL engine.
Practice exercises
Do these in your SQL environment. Keep results deterministic by aggregating to daily grain first when needed.
Exercise 1 — 7-day rolling average of daily orders per store
Table: daily_orders(store_id, order_date DATE, orders_count INT)
- Return store_id, order_date, orders_count.
- Add a column orders_7d_avg: average of orders_count over the last 7 days (including current day) within the same store_id.
- Order by store_id, order_date.
Hints
- Use PARTITION BY store_id and ORDER BY order_date.
- Frame: ROWS BETWEEN 6 PRECEDING AND CURRENT ROW.
Expected output
Columns: store_id, order_date, orders_count, orders_7d_avg. For the first few days per store, the average uses fewer than 7 rows.
Exercise 2 — Percent change and gap since last price
Table: prices(product_id, price_time TIMESTAMP, price NUMERIC)
- Compute prev_price using LAG(price) by product_id ordered by price_time.
- Compute pct_change = (price - prev_price) / NULLIF(prev_price, 0).
- Compute time_since_last = price_time - LAG(price_time) by the same window.
Hints
- Use LAG twice: once for price, once for price_time.
- Protect against division by zero with NULLIF.
Expected output
Columns: product_id, price_time, price, prev_price, pct_change, time_since_last. First row per product_id has NULLs for prev_* fields.
- Checklist before you move on
- You used PARTITION BY correctly for each entity.
- You ordered by a proper time column.
- You chose a frame that does not peek into the future.
- Edge rows (first N) behave as expected with partial windows.
Common mistakes and self-check
- Forgetting ORDER BY in the window: results become non-deterministic. Self-check: remove ORDER BY and see if results change—if yes, fix the window.
- Using RANGE for row-count windows on daily data: duplicates per day can distort. Prefer ROWS for exact N-row windows; use RANGE only when you truly need a time interval frame and your warehouse supports it.
- Leaking future data: frames like BETWEEN CURRENT ROW AND 6 FOLLOWING will look ahead. For modeling, use PRECEDING only.
- Missing PARTITION BY entity: windows run across all entities and mix features. Always partition by user/product/store as needed.
- Not aggregating to the right grain first: compute rolling averages on daily data, not raw events, if the feature is daily.
- Division by zero in percent change: use NULLIF(prev, 0).
Quick self-checks
- Do counts per entity remain the same before vs. after adding window columns? (They should.)
- Is the first row per partition producing NULLs where expected (LAG)?
- Does the rolling mean for a constant series equal the constant?
- Do running totals never decrease?
Practical projects to reinforce learning
- Build forecasting features: daily revenue rolling 7/14/28-day means, running totals, and lags for each store.
- User activity model: compute time since last session, session length (gap-based), and rolling 7-day active-days per user.
- Price monitoring: pct_change, rolling volatility (STDDEV over a window), and time to recovery after a drop.
Learning path
- Before this: basic aggregations and date/time functions; grouping by day/week/month.
- This lesson: window frames (ROWS vs RANGE), LAG/LEAD, rolling averages, running totals, percent change, time gaps.
- After this: feature stores or ETL patterns for time features, windowed joins, and advanced time bucketing or sessionization at scale.
Next steps
- Convert features into reusable views (daily grain) to keep modeling datasets fresh.
- Add validation queries that check monotonicity and null rates for new features.
- Benchmark frame sizes (7/14/28) to see which improves model performance.
Mini challenge
Given events(user_id, event_time, value), produce for each row: (a) value minus previous value, (b) 3-event rolling average of value, and (c) time_since_last. Ensure no future leakage. Keep per-user partitions.
Ready for the Quick Test?
When you feel confident, take the Quick Test below. You can retake it any time. Only logged-in users get saved progress.