Why this matters
Great models start with a reliable feature matrix X: rows are observations, columns are numeric features, and no target leakage. As a Data Scientist, you will:
- Turn raw tables (users, orders, logs) into model-ready X and y.
- Aggregate events (e.g., transactions) to customer-level features.
- Encode categories, handle dates, and prevent data leakage in time-based data.
- Guarantee consistent columns for train/validation/production.
Who this is for
- Beginner–intermediate data scientists who know basic pandas and NumPy.
- Anyone preparing data for classical ML (logistic regression, tree models, gradient boosting).
Prerequisites
- Python, pandas basics (DataFrame, index, groupby, merge).
- NumPy basics (arrays, vectorized operations).
- Basic ML vocabulary (features, target, train/test split).
Concept explained simply
A feature matrix is a 2D table (X) where each row is one training example and each column is a numeric descriptor of that example. The target y is a separate 1D array aligned by index.
- Rows = what you want to predict for (users, sessions, products).
- Columns = numeric features derived from raw data (counts, flags, one-hots, rolling means).
- No target info inside X. No future info when predicting the past.
Mental model
Think in layers:
- Define the prediction grain: one row per user/order/session at a specific time.
- Collect sources: static attributes, recent behavior, historical aggregates.
- Transform to numbers: impute, encode, aggregate, scale (if needed).
- Align indices and columns: X.index equals y.index, same feature columns across splits.
- Freeze logic so new data produces the same columns (stable schema).
Practical recipe
- Choose the row unit (e.g., one row per customer on a cutoff date).
- As-of time: decide the latest timestamp allowed per row to avoid leakage.
- Clean numerics: fill missing, cap outliers if needed.
- Encode categoricals: one-hot or ordinal mapping; avoid object dtype in X.
- Aggregate events: groupby, pivot, rolling windows.
- Assemble X: merge features; ensure consistent columns; fill NaNs.
- Validate: check shapes/dtypes, no leakage, X.index == y.index.
Tips that save hours
- Run pd.get_dummies(..., dtype=int) for compact one-hots.
- Reindex train/valid/test to the same columns with missing columns filled to 0.
- For time data, slice transactions up to the row's cutoff timestamp before aggregating.
- Prefer vectorized operations and groupby over Python loops.
Worked examples
Example 1 — Basic tabular features
import pandas as pd
import numpy as np
users = pd.DataFrame({
'age': [25, 30, np.nan, 22],
'income': [50000, 64000, 58000, np.nan],
'city': ['SF', 'NY', 'SF', 'LA'],
'signup_date': pd.to_datetime(['2023-01-05','2022-11-10','2023-03-22','2023-03-01']),
'churned': [0,1,0,1]
}, index=[101,102,103,104])
# Target
y = users['churned']
# Features
X = users.drop(columns=['churned']).copy()
X['age'] = X['age'].fillna(X['age'].median())
X['income'] = X['income'].fillna(X['income'].median())
X['signup_year'] = X['signup_date'].dt.year
X = X.drop(columns=['signup_date'])
X = pd.get_dummies(X, columns=['city'], dtype=int)
# Final checks
assert X.isna().sum().sum() == 0
assert X.index.equals(y.index)
print(X)
Takeaways: impute numerics, derive date parts, one-hot categoricals, keep index aligned.
Example 2 — Aggregating transactions to customer level
tx = pd.DataFrame({
'user_id':[1,1,1,2,2,3],
'amount':[20,35,15,50,10,5],
'category':['A','B','A','A','C','B'],
'ts': pd.to_datetime(['2023-03-01','2023-03-05','2023-03-07','2023-03-02','2023-03-09','2023-03-08'])
})
asof = pd.Timestamp('2023-03-08')
# Slice to avoid leakage
past = tx[tx['ts'] <= asof]
# 7-day window features
win = past[past['ts'] >= asof - pd.Timedelta(days=7)]
last7 = win.groupby('user_id').agg(total_spend_7d=('amount','sum'),
n_tx_7d=('amount','count'))
# Lifetime features (up to asof)
lifetime = past.groupby('user_id').agg(avg_ticket_lifetime=('amount','mean'))
# Category spend pivot
cat = past.pivot_table(index='user_id', columns='category', values='amount', aggfunc='sum', fill_value=0)
cat.columns = [f"spend_cat_{c}" for c in cat.columns]
# Assemble X
X = last7.join(lifetime, how='outer').join(cat, how='outer').fillna(0)
print(X.sort_index())
Takeaways: filter by as-of time first, then aggregate; fill missing users/categories with zeros.
Example 3 — Simple text/date-derived features
df = pd.DataFrame({
'id':[1,2,3],
'bio':['Loves hiking and data','Coffee enthusiast',''],
'last_active': pd.to_datetime(['2023-03-10','2023-03-01','2023-02-25'])
}).set_index('id')
asof = pd.Timestamp('2023-03-10')
X = pd.DataFrame(index=df.index)
X['bio_len'] = df['bio'].str.len().fillna(0)
X['bio_word_count'] = df['bio'].str.split().apply(lambda x: len(x) if isinstance(x, list) else 0)
X['days_since_active'] = (asof - df['last_active']).dt.days.clip(lower=0)
print(X)
Takeaways: derive numeric features from text length and dates; avoid raw text in X unless vectorized separately.
Exercises
Mirror of the graded exercises below. Work in your own notebook, then compare using the solution revealers.
- Checklist before you run models:
- No NaNs in X
- No object dtype columns in X
- X.index equals y.index
- Columns are consistent across splits
- No future info used for any row
Exercise 1 — Build a clean X from mixed types
Use the small dataset to create X with: imputed numerics, signup_year, and one-hot city.
import pandas as pd
import numpy as np
users = pd.DataFrame({
'age': [25, 30, np.nan, 22],
'income': [50000, 64000, 58000, np.nan],
'city': ['SF', 'NY', 'SF', 'LA'],
'signup_date': pd.to_datetime(['2023-01-05','2022-11-10','2023-03-22','2023-03-01']),
'churned': [0,1,0,1]
}, index=[101,102,103,104])
# Task: produce X with columns [age, income, signup_year, city_LA, city_NY, city_SF]
# with missing imputed (median for numerics), and no NaNs.
Exercise 2 — Aggregate transactions with a 7-day window
Create customer features as of 2023-03-08: total_spend_7d, n_tx_7d, avg_ticket_lifetime, and spend per category A/B/C.
import pandas as pd
tx = pd.DataFrame({
'user_id':[1,1,1,2,2,3],
'amount':[20,35,15,50,10,5],
'category':['A','B','A','A','C','B'],
'ts': pd.to_datetime(['2023-03-01','2023-03-05','2023-03-07','2023-03-02','2023-03-09','2023-03-08'])
})
# Task: filter by as-of date, build the requested features, return a DataFrame indexed by user_id.
Common mistakes and self-check
Leakage from the future
Fix: define an as-of timestamp per row; slice data to ts <= asof BEFORE aggregations.
Mismatched indices between X and y
Fix: align via X = X.loc[y.index]; assert X.index.equals(y.index).
Object dtype sneaking into X
Fix: one-hot or map to numbers; check X.select_dtypes(include='object').columns is empty.
Inconsistent columns across splits
Fix: keep a master list of columns; reindex test with missing columns filled with 0 and extra dropped.
Practical projects
- Churn baseline: customer features from usage logs (counts, recency, distinct actions) and a simple logistic regression.
- Basket value prediction: aggregate last-14-day spend, item diversity, and category one-hots; predict next order value.
- Fraud signals: rolling sums, velocity features, and cross-features (amount x hour-of-day) for a gradient boosting model.
Mini challenge
You have sessions with columns [user_id, duration_sec, device, ts]. Build one row per user as of a given date with: total_duration_7d, sessions_7d, avg_duration_lifetime, device_one-hots (seen up to as-of). Ensure no NaNs and consistent columns if a device category is missing.
Learning path
- Before: Data cleaning, joins, and indexing in pandas.
- Now: Feature Matrix Construction (this lesson).
- Next: Train/validation splits and evaluation; then feature scaling/selection; finally, modeling pipelines.
Next steps
- Complete the exercises and reveal the solutions to compare.
- Open the Quick Test at the end of this page to check mastery. The test is available to everyone; only logged-in users get saved progress.
- Package your transformations as reusable functions for repeatability.