Why this matters
Time-based features turn raw timestamps into predictive signal. As a Data Scientist, you will forecast demand, detect anomalies, predict churn, and plan capacity. Lags capture how the past influences the future; rolling windows summarize recent behavior. Getting these right avoids leakage, boosts model accuracy, and makes features interpretable for stakeholders.
- Real tasks: sales forecasting, energy load prediction, fraud spikes, user retention trends, SLA breach risk.
- Impact: more stable models, better early warning signals, and actionable operational insights.
Who this is for
- Data Scientists and Analysts building predictive or monitoring models on time-stamped data.
- MLOps/Analytics Engineers implementing feature pipelines.
Prerequisites
- Comfort with pandas or SQL window functions.
- Understanding of train/validation/test splits.
- Basic stats: mean, std, median, percent change.
Concept explained simply
- Lag: the value from k steps ago (e.g., yesterday's sales).
- Rolling window: a moving slice of the past (e.g., last 7 days) summarized by functions (mean, sum, std, min/max).
- Expanding window: from the beginning up to now (cumulative stats).
- Horizon: how far ahead you predict (e.g., t+7 days). Choose lags/rolls that strictly use data before the forecast timestamp.
Mental model
Imagine a conveyor belt of time. At each moment t, you are allowed to look behind you (lags, rolling windows) but never ahead. Your features are little gauges summarizing what happened just behind you.
Leakage rules (open me)
- Never compute features using future rows relative to the prediction timestamp.
- When doing cross-validation for time series, fit scalers/encoders only on training folds.
- For grouped data (e.g., per store/user), compute lags/rolls within each group.
Worked examples
Example 1: Daily sales forecasting
Goal: Predict sales for day t using past values.
# Pseudocode with pandas
# df columns: date, sales (daily)
df = df.sort_values('date')
df['lag_1'] = df['sales'].shift(1)
df['lag_7'] = df['sales'].shift(7)
# Right-aligned (uses previous 7 days, excludes today)
df['roll7_mean'] = df['sales'].shift(1).rolling(window=7).mean()
df['roll7_std'] = df['sales'].shift(1).rolling(window=7).std()
# Split by date to avoid leakage
train = df[df['date'] < '2023-07-01']
valid = df[(df['date'] >= '2023-07-01') & (df['date'] < '2023-08-01')]
Why shift before rolling? It ensures today's features never peek at today's target.
Example 2: Customer churn risk
Goal: Predict if a user will churn in the next 30 days using behavioral features.
- days_since_last_event: difference between t and last event timestamp.
- events_last_7d: rolling 7-day count of actions per user.
- rolling_28d_amount_mean: average spend in last 28 days.
# df: user_id, ts, event_count, amount
# Ensure per-user time order
df = df.sort_values(['user_id','ts'])
# Grouped features
s = df.groupby('user_id')['event_count']
df['lag_events_1'] = s.shift(1)
df['events_last_7d'] = s.shift(1).rolling('7D', on=None).sum() # alternative: resample per user then roll
With irregular timestamps, consider resampling to daily per user, then apply rolling windows on a regular grid.
Example 3: Anomaly detection with rolling z-score
Goal: Flag unusual spikes in metrics.
# df: timestamp, metric
m = df['metric'].shift(1)
mean = m.rolling(window=24).mean()
std = m.rolling(window=24).std()
df['zscore_24'] = (df['metric'] - mean) / std
# High |z| suggests anomaly
Choosing window sizes
- Match seasonality (7-day for weekly seasonality, 24-hour for daily hourly data).
- Use multiple windows (short + long) to capture short trends and baseline.
- Validate via time-based CV; do not trust in-sample gains.
Step-by-step workflow
- Sort data by time (and by entity for panel data).
- Define prediction horizon (e.g., t+1 day). All features must be strictly before t.
- Create lags (1, 7, 14, season length) per entity.
- Create rolling features (mean/sum/std/min/max) with right-aligned windows.
- Handle missing starts (NaNs) via careful imputation or allow model to learn "cold-start".
- Split using time-based folds; fit preprocessing only on training folds.
- Monitor feature drift over time.
Feature checklist
- Data sorted by time within each group
- Lags/rolls use shift before rolling
- Grouped by entity (store/user) where applicable
- Windows match natural seasonality
- No future leakage in any transform
- Validated with time-series CV
Exercises
Do these to solidify concepts. Everyone can take them; only logged-in users get saved progress.
Exercise 1 — Daily sales lags and rolling
You have daily sales:
date sales
2023-01-01 10
2023-01-02 12
2023-01-03 11
2023-01-04 13
2023-01-05 15
2023-01-06 14
2023-01-07 16
2023-01-08 20
2023-01-09 18
Create features: lag_1, lag_7, roll3_mean (right-aligned, exclude today). Show the resulting table for the last three dates.
Show solution
For 2023-01-07:
- lag_1 = 14
- lag_7 = NaN (no 7-day history)
- roll3_mean = mean(2023-01-04..06) = (13+15+14)/3 = 14.0
For 2023-01-08:
- lag_1 = 16
- lag_7 = 12
- roll3_mean = mean(2023-01-05..07) = (15+14+16)/3 = 15.0
For 2023-01-09:
- lag_1 = 20
- lag_7 = 11
- roll3_mean = mean(2023-01-06..08) = (14+16+20)/3 = 16.67Exercise 2 — Per-user features
Events for user U1 (timestamps):
2023-05-01 1 event
2023-05-02 2 events
2023-05-05 1 event
2023-05-10 3 events
Compute for each event row: days_since_last_event and events_last_7d (count of events in the previous 7 days, excluding the current row). Show values for the last two rows.
Show solution
Row 3 (2023-05-05):
- days_since_last_event = 3 days (from 2023-05-02)
- events_last_7d = events on 2023-05-01 and 2023-05-02 = 1 + 2 = 3
Row 4 (2023-05-10):
- days_since_last_event = 5 days (from 2023-05-05)
- events_last_7d = events between 2023-05-03..2023-05-09: only 2023-05-05 (1) => 1Common mistakes and self-checks
Common mistakes
- Leakage from improper rolling: computing rolling without shift includes the target day.
- Mixing entities: forgetting groupby leads to cross-entity leakage.
- Misaligned windows: left vs right alignment confusion.
- Imputing with future knowledge: forward/back-filling across the split boundary.
- Resampling pitfalls: aggregating using full-range stats before splitting.
Self-check
- When you predict at time t, can any feature use data from t or later? Answer: No.
- Do features change if you shuffle rows? If yes, your method depends on order and is risky.
- Did you validate with time-based folds and freeze preprocessing to the training fold only?
Practical projects
- Retail demand forecast: Build lag_1, lag_7, roll7_mean/std, holiday flags, and evaluate with time-series CV.
- Energy load modeling: Hourly data with lag_1, lag_24, roll24/168 means; add temperature rolling stats.
- Support ticket volume: Weekly forecast with expanding mean and recent-2-week momentum features.
Learning path
- Time-based features (this lesson).
- Time-aware cross-validation and backtesting.
- Stationarity, differencing, and seasonal decomposition.
- Feature selection and importance over time.
- Deployment: reproducible feature pipelines with scheduled recomputation.
Next steps
- Apply lags/rolling to one of your datasets.
- Run a simple model baseline; add features incrementally and track uplift.
- Take the quick test to verify understanding.
Mini challenge
You must forecast daily volume 28 days ahead. Propose 6 features (mix of lags and rolling windows) that avoid leakage and capture weekly seasonality. Write one sentence explaining each feature's intent and why it respects the t+28 horizon.
Note: Tests are available to everyone; only logged-in users get saved progress.