How to learn Feature Matrix Construction for Python (pandas and numpy) in Data Scientist for free

Why this matters

Great models start with a reliable feature matrix X: rows are observations, columns are numeric features, and no target leakage. As a Data Scientist, you will:

Turn raw tables (users, orders, logs) into model-ready X and y.
Aggregate events (e.g., transactions) to customer-level features.
Encode categories, handle dates, and prevent data leakage in time-based data.
Guarantee consistent columns for train/validation/production.

Who this is for

Beginner–intermediate data scientists who know basic pandas and NumPy.
Anyone preparing data for classical ML (logistic regression, tree models, gradient boosting).

Prerequisites

Python, pandas basics (DataFrame, index, groupby, merge).
NumPy basics (arrays, vectorized operations).
Basic ML vocabulary (features, target, train/test split).

Concept explained simply

A feature matrix is a 2D table (X) where each row is one training example and each column is a numeric descriptor of that example. The target y is a separate 1D array aligned by index.

Rows = what you want to predict for (users, sessions, products).
Columns = numeric features derived from raw data (counts, flags, one-hots, rolling means).
No target info inside X. No future info when predicting the past.

Mental model

Think in layers:

Define the prediction grain: one row per user/order/session at a specific time.
Collect sources: static attributes, recent behavior, historical aggregates.
Transform to numbers: impute, encode, aggregate, scale (if needed).
Align indices and columns: X.index equals y.index, same feature columns across splits.
Freeze logic so new data produces the same columns (stable schema).

Practical recipe

Choose the row unit (e.g., one row per customer on a cutoff date).
As-of time: decide the latest timestamp allowed per row to avoid leakage.
Clean numerics: fill missing, cap outliers if needed.
Encode categoricals: one-hot or ordinal mapping; avoid object dtype in X.
Aggregate events: groupby, pivot, rolling windows.
Assemble X: merge features; ensure consistent columns; fill NaNs.
Validate: check shapes/dtypes, no leakage, X.index == y.index.

Tips that save hours

Run pd.get_dummies(..., dtype=int) for compact one-hots.
Reindex train/valid/test to the same columns with missing columns filled to 0.
For time data, slice transactions up to the row's cutoff timestamp before aggregating.
Prefer vectorized operations and groupby over Python loops.

Worked examples

Example 1 — Basic tabular features

import pandas as pd
import numpy as np

users = pd.DataFrame({
    'age': [25, 30, np.nan, 22],
    'income': [50000, 64000, 58000, np.nan],
    'city': ['SF', 'NY', 'SF', 'LA'],
    'signup_date': pd.to_datetime(['2023-01-05','2022-11-10','2023-03-22','2023-03-01']),
    'churned': [0,1,0,1]
}, index=[101,102,103,104])

# Target
y = users['churned']

# Features
X = users.drop(columns=['churned']).copy()
X['age'] = X['age'].fillna(X['age'].median())
X['income'] = X['income'].fillna(X['income'].median())
X['signup_year'] = X['signup_date'].dt.year
X = X.drop(columns=['signup_date'])
X = pd.get_dummies(X, columns=['city'], dtype=int)

# Final checks
assert X.isna().sum().sum() == 0
assert X.index.equals(y.index)
print(X)

Takeaways: impute numerics, derive date parts, one-hot categoricals, keep index aligned.

Example 2 — Aggregating transactions to customer level

tx = pd.DataFrame({
    'user_id':[1,1,1,2,2,3],
    'amount':[20,35,15,50,10,5],
    'category':['A','B','A','A','C','B'],
    'ts': pd.to_datetime(['2023-03-01','2023-03-05','2023-03-07','2023-03-02','2023-03-09','2023-03-08'])
})

asof = pd.Timestamp('2023-03-08')
# Slice to avoid leakage
past = tx[tx['ts'] <= asof]

# 7-day window features
win = past[past['ts'] >= asof - pd.Timedelta(days=7)]
last7 = win.groupby('user_id').agg(total_spend_7d=('amount','sum'),
                                   n_tx_7d=('amount','count'))

# Lifetime features (up to asof)
lifetime = past.groupby('user_id').agg(avg_ticket_lifetime=('amount','mean'))

# Category spend pivot
cat = past.pivot_table(index='user_id', columns='category', values='amount', aggfunc='sum', fill_value=0)
cat.columns = [f"spend_cat_{c}" for c in cat.columns]

# Assemble X
X = last7.join(lifetime, how='outer').join(cat, how='outer').fillna(0)
print(X.sort_index())

Takeaways: filter by as-of time first, then aggregate; fill missing users/categories with zeros.

Example 3 — Simple text/date-derived features

df = pd.DataFrame({
    'id':[1,2,3],
    'bio':['Loves hiking and data','Coffee enthusiast',''],
    'last_active': pd.to_datetime(['2023-03-10','2023-03-01','2023-02-25'])
}).set_index('id')

asof = pd.Timestamp('2023-03-10')
X = pd.DataFrame(index=df.index)
X['bio_len'] = df['bio'].str.len().fillna(0)
X['bio_word_count'] = df['bio'].str.split().apply(lambda x: len(x) if isinstance(x, list) else 0)
X['days_since_active'] = (asof - df['last_active']).dt.days.clip(lower=0)
print(X)

Takeaways: derive numeric features from text length and dates; avoid raw text in X unless vectorized separately.

Exercises

Mirror of the graded exercises below. Work in your own notebook, then compare using the solution revealers.

Checklist before you run models:
- No NaNs in X
- No object dtype columns in X
- X.index equals y.index
- Columns are consistent across splits
- No future info used for any row

Exercise 1 — Build a clean X from mixed types

Use the small dataset to create X with: imputed numerics, signup_year, and one-hot city.

import pandas as pd
import numpy as np

users = pd.DataFrame({
    'age': [25, 30, np.nan, 22],
    'income': [50000, 64000, 58000, np.nan],
    'city': ['SF', 'NY', 'SF', 'LA'],
    'signup_date': pd.to_datetime(['2023-01-05','2022-11-10','2023-03-22','2023-03-01']),
    'churned': [0,1,0,1]
}, index=[101,102,103,104])

# Task: produce X with columns [age, income, signup_year, city_LA, city_NY, city_SF]
# with missing imputed (median for numerics), and no NaNs.

Exercise 2 — Aggregate transactions with a 7-day window

Create customer features as of 2023-03-08: total_spend_7d, n_tx_7d, avg_ticket_lifetime, and spend per category A/B/C.

import pandas as pd

tx = pd.DataFrame({
    'user_id':[1,1,1,2,2,3],
    'amount':[20,35,15,50,10,5],
    'category':['A','B','A','A','C','B'],
    'ts': pd.to_datetime(['2023-03-01','2023-03-05','2023-03-07','2023-03-02','2023-03-09','2023-03-08'])
})

# Task: filter by as-of date, build the requested features, return a DataFrame indexed by user_id.

Common mistakes and self-check

Leakage from the future

Fix: define an as-of timestamp per row; slice data to ts <= asof BEFORE aggregations.

Mismatched indices between X and y

Fix: align via X = X.loc[y.index]; assert X.index.equals(y.index).

Object dtype sneaking into X

Fix: one-hot or map to numbers; check X.select_dtypes(include='object').columns is empty.

Inconsistent columns across splits

Fix: keep a master list of columns; reindex test with missing columns filled with 0 and extra dropped.

Practical projects

Churn baseline: customer features from usage logs (counts, recency, distinct actions) and a simple logistic regression.
Basket value prediction: aggregate last-14-day spend, item diversity, and category one-hots; predict next order value.
Fraud signals: rolling sums, velocity features, and cross-features (amount x hour-of-day) for a gradient boosting model.

Mini challenge

You have sessions with columns [user_id, duration_sec, device, ts]. Build one row per user as of a given date with: total_duration_7d, sessions_7d, avg_duration_lifetime, device_one-hots (seen up to as-of). Ensure no NaNs and consistent columns if a device category is missing.

Learning path

Before: Data cleaning, joins, and indexing in pandas.
Now: Feature Matrix Construction (this lesson).
Next: Train/validation splits and evaluation; then feature scaling/selection; finally, modeling pipelines.

Next steps

Complete the exercises and reveal the solutions to compare.
Open the Quick Test at the end of this page to check mastery. The test is available to everyone; only logged-in users get saved progress.
Package your transformations as reusable functions for repeatability.

Menu

Feature Matrix Construction

Table of Contents

Why this matters

Who this is for

Prerequisites

Concept explained simply

Mental model

Practical recipe

Worked examples

Example 1 — Basic tabular features

Example 2 — Aggregating transactions to customer level

Example 3 — Simple text/date-derived features

Exercises

Common mistakes and self-check

Practical projects

Mini challenge

Learning path

Next steps

Practice Exercises

Build a clean feature matrix from mixed types

Instructions

Expected Output

Aggregate transactions with a 7-day window and categories

Feature Matrix Construction — Quick Test

Have questions about Feature Matrix Construction?

AI Assistant