luvv to helpDiscover the Best Free Online Tools
Topic 9 of 10

Feature Matrix Construction

Learn Feature Matrix Construction for free with explanations, exercises, and a quick test (for Data Scientist).

Published: January 1, 2026 | Updated: January 1, 2026

Why this matters

Great models start with a reliable feature matrix X: rows are observations, columns are numeric features, and no target leakage. As a Data Scientist, you will:

  • Turn raw tables (users, orders, logs) into model-ready X and y.
  • Aggregate events (e.g., transactions) to customer-level features.
  • Encode categories, handle dates, and prevent data leakage in time-based data.
  • Guarantee consistent columns for train/validation/production.

Who this is for

  • Beginner–intermediate data scientists who know basic pandas and NumPy.
  • Anyone preparing data for classical ML (logistic regression, tree models, gradient boosting).

Prerequisites

  • Python, pandas basics (DataFrame, index, groupby, merge).
  • NumPy basics (arrays, vectorized operations).
  • Basic ML vocabulary (features, target, train/test split).

Concept explained simply

A feature matrix is a 2D table (X) where each row is one training example and each column is a numeric descriptor of that example. The target y is a separate 1D array aligned by index.

  • Rows = what you want to predict for (users, sessions, products).
  • Columns = numeric features derived from raw data (counts, flags, one-hots, rolling means).
  • No target info inside X. No future info when predicting the past.

Mental model

Think in layers:

  1. Define the prediction grain: one row per user/order/session at a specific time.
  2. Collect sources: static attributes, recent behavior, historical aggregates.
  3. Transform to numbers: impute, encode, aggregate, scale (if needed).
  4. Align indices and columns: X.index equals y.index, same feature columns across splits.
  5. Freeze logic so new data produces the same columns (stable schema).

Practical recipe

  1. Choose the row unit (e.g., one row per customer on a cutoff date).
  2. As-of time: decide the latest timestamp allowed per row to avoid leakage.
  3. Clean numerics: fill missing, cap outliers if needed.
  4. Encode categoricals: one-hot or ordinal mapping; avoid object dtype in X.
  5. Aggregate events: groupby, pivot, rolling windows.
  6. Assemble X: merge features; ensure consistent columns; fill NaNs.
  7. Validate: check shapes/dtypes, no leakage, X.index == y.index.
Tips that save hours
  • Run pd.get_dummies(..., dtype=int) for compact one-hots.
  • Reindex train/valid/test to the same columns with missing columns filled to 0.
  • For time data, slice transactions up to the row's cutoff timestamp before aggregating.
  • Prefer vectorized operations and groupby over Python loops.

Worked examples

Example 1 — Basic tabular features

import pandas as pd
import numpy as np

users = pd.DataFrame({
    'age': [25, 30, np.nan, 22],
    'income': [50000, 64000, 58000, np.nan],
    'city': ['SF', 'NY', 'SF', 'LA'],
    'signup_date': pd.to_datetime(['2023-01-05','2022-11-10','2023-03-22','2023-03-01']),
    'churned': [0,1,0,1]
}, index=[101,102,103,104])

# Target
y = users['churned']

# Features
X = users.drop(columns=['churned']).copy()
X['age'] = X['age'].fillna(X['age'].median())
X['income'] = X['income'].fillna(X['income'].median())
X['signup_year'] = X['signup_date'].dt.year
X = X.drop(columns=['signup_date'])
X = pd.get_dummies(X, columns=['city'], dtype=int)

# Final checks
assert X.isna().sum().sum() == 0
assert X.index.equals(y.index)
print(X)

Takeaways: impute numerics, derive date parts, one-hot categoricals, keep index aligned.

Example 2 — Aggregating transactions to customer level

tx = pd.DataFrame({
    'user_id':[1,1,1,2,2,3],
    'amount':[20,35,15,50,10,5],
    'category':['A','B','A','A','C','B'],
    'ts': pd.to_datetime(['2023-03-01','2023-03-05','2023-03-07','2023-03-02','2023-03-09','2023-03-08'])
})

asof = pd.Timestamp('2023-03-08')
# Slice to avoid leakage
past = tx[tx['ts'] <= asof]

# 7-day window features
win = past[past['ts'] >= asof - pd.Timedelta(days=7)]
last7 = win.groupby('user_id').agg(total_spend_7d=('amount','sum'),
                                   n_tx_7d=('amount','count'))

# Lifetime features (up to asof)
lifetime = past.groupby('user_id').agg(avg_ticket_lifetime=('amount','mean'))

# Category spend pivot
cat = past.pivot_table(index='user_id', columns='category', values='amount', aggfunc='sum', fill_value=0)
cat.columns = [f"spend_cat_{c}" for c in cat.columns]

# Assemble X
X = last7.join(lifetime, how='outer').join(cat, how='outer').fillna(0)
print(X.sort_index())

Takeaways: filter by as-of time first, then aggregate; fill missing users/categories with zeros.

Example 3 — Simple text/date-derived features

df = pd.DataFrame({
    'id':[1,2,3],
    'bio':['Loves hiking and data','Coffee enthusiast',''],
    'last_active': pd.to_datetime(['2023-03-10','2023-03-01','2023-02-25'])
}).set_index('id')

asof = pd.Timestamp('2023-03-10')
X = pd.DataFrame(index=df.index)
X['bio_len'] = df['bio'].str.len().fillna(0)
X['bio_word_count'] = df['bio'].str.split().apply(lambda x: len(x) if isinstance(x, list) else 0)
X['days_since_active'] = (asof - df['last_active']).dt.days.clip(lower=0)
print(X)

Takeaways: derive numeric features from text length and dates; avoid raw text in X unless vectorized separately.

Exercises

Mirror of the graded exercises below. Work in your own notebook, then compare using the solution revealers.

  • Checklist before you run models:
    • No NaNs in X
    • No object dtype columns in X
    • X.index equals y.index
    • Columns are consistent across splits
    • No future info used for any row
Exercise 1 — Build a clean X from mixed types

Use the small dataset to create X with: imputed numerics, signup_year, and one-hot city.

import pandas as pd
import numpy as np

users = pd.DataFrame({
    'age': [25, 30, np.nan, 22],
    'income': [50000, 64000, 58000, np.nan],
    'city': ['SF', 'NY', 'SF', 'LA'],
    'signup_date': pd.to_datetime(['2023-01-05','2022-11-10','2023-03-22','2023-03-01']),
    'churned': [0,1,0,1]
}, index=[101,102,103,104])

# Task: produce X with columns [age, income, signup_year, city_LA, city_NY, city_SF]
# with missing imputed (median for numerics), and no NaNs.
Exercise 2 — Aggregate transactions with a 7-day window

Create customer features as of 2023-03-08: total_spend_7d, n_tx_7d, avg_ticket_lifetime, and spend per category A/B/C.

import pandas as pd

tx = pd.DataFrame({
    'user_id':[1,1,1,2,2,3],
    'amount':[20,35,15,50,10,5],
    'category':['A','B','A','A','C','B'],
    'ts': pd.to_datetime(['2023-03-01','2023-03-05','2023-03-07','2023-03-02','2023-03-09','2023-03-08'])
})

# Task: filter by as-of date, build the requested features, return a DataFrame indexed by user_id.

Common mistakes and self-check

Leakage from the future

Fix: define an as-of timestamp per row; slice data to ts <= asof BEFORE aggregations.

Mismatched indices between X and y

Fix: align via X = X.loc[y.index]; assert X.index.equals(y.index).

Object dtype sneaking into X

Fix: one-hot or map to numbers; check X.select_dtypes(include='object').columns is empty.

Inconsistent columns across splits

Fix: keep a master list of columns; reindex test with missing columns filled with 0 and extra dropped.

Practical projects

  • Churn baseline: customer features from usage logs (counts, recency, distinct actions) and a simple logistic regression.
  • Basket value prediction: aggregate last-14-day spend, item diversity, and category one-hots; predict next order value.
  • Fraud signals: rolling sums, velocity features, and cross-features (amount x hour-of-day) for a gradient boosting model.

Mini challenge

You have sessions with columns [user_id, duration_sec, device, ts]. Build one row per user as of a given date with: total_duration_7d, sessions_7d, avg_duration_lifetime, device_one-hots (seen up to as-of). Ensure no NaNs and consistent columns if a device category is missing.

Learning path

  • Before: Data cleaning, joins, and indexing in pandas.
  • Now: Feature Matrix Construction (this lesson).
  • Next: Train/validation splits and evaluation; then feature scaling/selection; finally, modeling pipelines.

Next steps

  • Complete the exercises and reveal the solutions to compare.
  • Open the Quick Test at the end of this page to check mastery. The test is available to everyone; only logged-in users get saved progress.
  • Package your transformations as reusable functions for repeatability.

Practice Exercises

2 exercises to complete

Instructions

Construct X from the provided users DataFrame with:

  • Median imputation for age and income
  • signup_year from signup_date
  • One-hot city with int dtype
  • No NaNs and no object dtype in X
Dataset
import pandas as pd
import numpy as np

users = pd.DataFrame({
    'age': [25, 30, np.nan, 22],
    'income': [50000, 64000, 58000, np.nan],
    'city': ['SF', 'NY', 'SF', 'LA'],
    'signup_date': pd.to_datetime(['2023-01-05','2022-11-10','2023-03-22','2023-03-01']),
    'churned': [0,1,0,1]
}, index=[101,102,103,104])

y = users['churned']
Expected Output
Index: [101, 102, 103, 104]. Columns: ['age', 'income', 'signup_year', 'city_LA', 'city_NY', 'city_SF'] Values (rows 101..104): - age: [25.0, 30.0, 25.0, 22.0] - income: [50000.0, 64000.0, 58000.0, 58000.0] - signup_year: [2023, 2022, 2023, 2023] - city_LA: [0, 0, 0, 1] - city_NY: [0, 1, 0, 0] - city_SF: [1, 0, 1, 0]

Feature Matrix Construction — Quick Test

Test your knowledge with 8 questions. Pass with 70% or higher.

8 questions70% to pass

Have questions about Feature Matrix Construction?

AI Assistant

Ask questions about this tool