How to learn Time Based Features Lags Rolling Windows for Feature Engineering in Data Scientist for free

Why this matters

Time-based features turn raw timestamps into predictive signal. As a Data Scientist, you will forecast demand, detect anomalies, predict churn, and plan capacity. Lags capture how the past influences the future; rolling windows summarize recent behavior. Getting these right avoids leakage, boosts model accuracy, and makes features interpretable for stakeholders.

Real tasks: sales forecasting, energy load prediction, fraud spikes, user retention trends, SLA breach risk.
Impact: more stable models, better early warning signals, and actionable operational insights.

Who this is for

Data Scientists and Analysts building predictive or monitoring models on time-stamped data.
MLOps/Analytics Engineers implementing feature pipelines.

Prerequisites

Comfort with pandas or SQL window functions.
Understanding of train/validation/test splits.
Basic stats: mean, std, median, percent change.

Concept explained simply

- Lag: the value from k steps ago (e.g., yesterday's sales).
- Rolling window: a moving slice of the past (e.g., last 7 days) summarized by functions (mean, sum, std, min/max).
- Expanding window: from the beginning up to now (cumulative stats).
- Horizon: how far ahead you predict (e.g., t+7 days). Choose lags/rolls that strictly use data before the forecast timestamp.

Mental model

Imagine a conveyor belt of time. At each moment t, you are allowed to look behind you (lags, rolling windows) but never ahead. Your features are little gauges summarizing what happened just behind you.

Leakage rules (open me)

Never compute features using future rows relative to the prediction timestamp.
When doing cross-validation for time series, fit scalers/encoders only on training folds.
For grouped data (e.g., per store/user), compute lags/rolls within each group.

Worked examples

Example 1: Daily sales forecasting

Goal: Predict sales for day t using past values.

# Pseudocode with pandas
# df columns: date, sales (daily)
df = df.sort_values('date')
df['lag_1'] = df['sales'].shift(1)
df['lag_7'] = df['sales'].shift(7)
# Right-aligned (uses previous 7 days, excludes today)
df['roll7_mean'] = df['sales'].shift(1).rolling(window=7).mean()
df['roll7_std']  = df['sales'].shift(1).rolling(window=7).std()

# Split by date to avoid leakage
train = df[df['date'] < '2023-07-01']
valid = df[(df['date'] >= '2023-07-01') & (df['date'] < '2023-08-01')]

Why shift before rolling? It ensures today's features never peek at today's target.

Example 2: Customer churn risk

Goal: Predict if a user will churn in the next 30 days using behavioral features.

days_since_last_event: difference between t and last event timestamp.
events_last_7d: rolling 7-day count of actions per user.
rolling_28d_amount_mean: average spend in last 28 days.

# df: user_id, ts, event_count, amount
# Ensure per-user time order
df = df.sort_values(['user_id','ts'])
# Grouped features
s = df.groupby('user_id')['event_count']
df['lag_events_1'] = s.shift(1)
df['events_last_7d'] = s.shift(1).rolling('7D', on=None).sum()  # alternative: resample per user then roll

With irregular timestamps, consider resampling to daily per user, then apply rolling windows on a regular grid.

Example 3: Anomaly detection with rolling z-score

Goal: Flag unusual spikes in metrics.

# df: timestamp, metric
m = df['metric'].shift(1)
mean = m.rolling(window=24).mean()
std  = m.rolling(window=24).std()
df['zscore_24'] = (df['metric'] - mean) / std
# High |z| suggests anomaly

Choosing window sizes

Match seasonality (7-day for weekly seasonality, 24-hour for daily hourly data).
Use multiple windows (short + long) to capture short trends and baseline.
Validate via time-based CV; do not trust in-sample gains.

Step-by-step workflow

Sort data by time (and by entity for panel data).
Define prediction horizon (e.g., t+1 day). All features must be strictly before t.
Create lags (1, 7, 14, season length) per entity.
Create rolling features (mean/sum/std/min/max) with right-aligned windows.
Handle missing starts (NaNs) via careful imputation or allow model to learn "cold-start".
Split using time-based folds; fit preprocessing only on training folds.
Monitor feature drift over time.

Feature checklist

Data sorted by time within each group
Lags/rolls use shift before rolling
Grouped by entity (store/user) where applicable
Windows match natural seasonality
No future leakage in any transform
Validated with time-series CV

Exercises

Do these to solidify concepts. Everyone can take them; only logged-in users get saved progress.

Exercise 1 — Daily sales lags and rolling

You have daily sales:

date        sales
2023-01-01  10
2023-01-02  12
2023-01-03  11
2023-01-04  13
2023-01-05  15
2023-01-06  14
2023-01-07  16
2023-01-08  20
2023-01-09  18

Create features: lag_1, lag_7, roll3_mean (right-aligned, exclude today). Show the resulting table for the last three dates.

Show solution

For 2023-01-07:
- lag_1 = 14
- lag_7 = NaN (no 7-day history)
- roll3_mean = mean(2023-01-04..06) = (13+15+14)/3 = 14.0

For 2023-01-08:
- lag_1 = 16
- lag_7 = 12
- roll3_mean = mean(2023-01-05..07) = (15+14+16)/3 = 15.0

For 2023-01-09:
- lag_1 = 20
- lag_7 = 11
- roll3_mean = mean(2023-01-06..08) = (14+16+20)/3 = 16.67

Exercise 2 — Per-user features

Events for user U1 (timestamps):

2023-05-01 1 event
2023-05-02 2 events
2023-05-05 1 event
2023-05-10 3 events

Compute for each event row: days_since_last_event and events_last_7d (count of events in the previous 7 days, excluding the current row). Show values for the last two rows.

Show solution

Row 3 (2023-05-05):
- days_since_last_event = 3 days (from 2023-05-02)
- events_last_7d = events on 2023-05-01 and 2023-05-02 = 1 + 2 = 3

Row 4 (2023-05-10):
- days_since_last_event = 5 days (from 2023-05-05)
- events_last_7d = events between 2023-05-03..2023-05-09: only 2023-05-05 (1) => 1

Common mistakes and self-checks

Common mistakes

Leakage from improper rolling: computing rolling without shift includes the target day.
Mixing entities: forgetting groupby leads to cross-entity leakage.
Misaligned windows: left vs right alignment confusion.
Imputing with future knowledge: forward/back-filling across the split boundary.
Resampling pitfalls: aggregating using full-range stats before splitting.

Self-check

When you predict at time t, can any feature use data from t or later? Answer: No.
Do features change if you shuffle rows? If yes, your method depends on order and is risky.
Did you validate with time-based folds and freeze preprocessing to the training fold only?

Practical projects

Retail demand forecast: Build lag_1, lag_7, roll7_mean/std, holiday flags, and evaluate with time-series CV.
Energy load modeling: Hourly data with lag_1, lag_24, roll24/168 means; add temperature rolling stats.
Support ticket volume: Weekly forecast with expanding mean and recent-2-week momentum features.

Learning path

Time-based features (this lesson).
Time-aware cross-validation and backtesting.
Stationarity, differencing, and seasonal decomposition.
Feature selection and importance over time.
Deployment: reproducible feature pipelines with scheduled recomputation.

Next steps

Apply lags/rolling to one of your datasets.
Run a simple model baseline; add features incrementally and track uplift.
Take the quick test to verify understanding.

Mini challenge

You must forecast daily volume 28 days ahead. Propose 6 features (mix of lags and rolling windows) that avoid leakage and capture weekly seasonality. Write one sentence explaining each feature's intent and why it respects the t+28 horizon.

Note: Tests are available to everyone; only logged-in users get saved progress.

Menu

Time Based Features Lags Rolling Windows

Table of Contents