What is Feature Engineering for Data Scientists?
Feature engineering is the practice of transforming raw data into informative inputs that make models simpler, faster, and more accurate. As a Data Scientist, you use it to convert messy, heterogeneous data (categorical, text, timestamps, numeric) into model-ready features—while avoiding leakage and keeping your pipeline reproducible.
What tasks does it unlock in a Data Scientist role?
- Build robust baselines quickly with standard scaling/encoding.
- Boost model signal using interactions, time-based lags, and rolling statistics.
- Handle outliers and skew for stable training.
- Create entity-level aggregates (per user, store, device) for personalized predictions.
- Prevent leakage to protect real-world performance.
- Reduce noise with feature selection and keep pipelines maintainable.
Who this is for
- Data Scientists and ML practitioners who build supervised or unsupervised models.
- Analysts moving into ML and wanting reliable, reproducible pipelines.
- Engineers who support model training and need clear, leak-free features.
Prerequisites
- Comfort with Python and pandas (joins, groupby, datetime operations).
- Basic scikit-learn (estimators, cross-validation, pipelines).
- Understanding of train/validation/test splits; for time series, time-aware splits.
Learning path (practical roadmap)
- Solidify the basics: clean splits, baseline model, simple encoders/scalers.
Mini tasks
- Create a train/validation/test split that reflects production timing.
- Fit a baseline LogisticRegression with only numeric features.
- Record baseline metrics and latency.
- Categorical and numeric transforms: one-hot, ordinal, target/frequency encoding; standard vs robust scaling.
Mini tasks
- Try OneHotEncoder with handle_unknown="ignore".
- Compare StandardScaler vs RobustScaler on a skewed feature.
- Implement target encoding with proper CV to avoid leakage.
- Interactions and crosses: polynomial terms, business-logic crosses (e.g., price_per_unit = price/quantity).
- Time-based features: lags, rolling windows, recency, seasonality flags; use only past information.
- Text features (TF-IDF): tokenize, n-grams, max_features; combine with numeric pipelines.
- Aggregations by entity: per-user/store/device stats for personalization.
- Outliers and robustness: trimming, winsorizing, robust scalers, transformations (log/Box-Cox).
- Feature selection: filter (corr, MI), wrapper (RFE), embedded (L1, tree importances).
- Leakage prevention and audit: feature timelines, separation of train/validation flows, cross-checks.
- Packaging: scikit-learn Pipeline + ColumnTransformer; version features and document assumptions.
Worked examples (ready-to-run patterns)
1) Encoding categoricals and scaling numerics with a single pipeline
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
# df: columns ["age","income","country","device","target"]
X = df.drop(columns=["target"])
y = df["target"]
num_cols = ["age","income"]
cat_cols = ["country","device"]
pre = ColumnTransformer([
("num", StandardScaler(), num_cols),
("cat", OneHotEncoder(handle_unknown="ignore"), cat_cols)
])
pipe = Pipeline([
("pre", pre),
("clf", LogisticRegression(max_iter=1000))
])
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)
pipe.fit(X_tr, y_tr)
print("Test AUC:", pipe.score(X_te, y_te))
Why this works
ColumnTransformer keeps preprocessing inside the pipeline, ensuring no data leakage from preprocessing fitted on full data.
2) Target encoding (with CV to avoid leakage)
import pandas as pd
import numpy as np
from sklearn.model_selection import KFold
# Single column target encoding using out-of-fold mean
cat = "country"
K = 5
kf = KFold(n_splits=K, shuffle=True, random_state=42)
train_te = pd.Series(index=df.index, dtype=float)
for tr_idx, val_idx in kf.split(df):
tr_mean = df.iloc[tr_idx].groupby(cat)["target"].mean()
train_te.iloc[val_idx] = df.iloc[val_idx][cat].map(tr_mean)
# For unseen categories, use global mean
global_mean = df["target"].mean()
train_te = train_te.fillna(global_mean)
df[cat+"_te"] = train_te
Key rule
Never compute encodings using the full target column. Use out-of-fold means or compute encodings only from the training fold in each split.
3) Useful feature crosses and engineered ratios
df["price_per_unit"] = df["price"] / (df["quantity"].replace(0, np.nan))
df["price_per_unit"] = df["price_per_unit"].fillna(df["price"]) # safe fallback
# Polynomial interactions for selected numeric columns
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(df[["age","income"]])
When to use
Use business-meaningful crosses (ratios, differences). Polynomial features can help linear models but may overfit; regularize and select carefully.
4) Time-based features: lags and rolling windows
# df sorted by ["entity_id","timestamp"] with a target like next_purchase
import pandas as pd
def add_lag_roll(g, col, lags=(1,7), windows=(7,28)):
g = g.copy()
for L in lags:
g[f"{col}_lag{L}"] = g[col].shift(L)
for W in windows:
g[f"{col}_rollmean{W}"] = g[col].shift(1).rolling(W).mean()
g[f"{col}_rollstd{W}"] = g[col].shift(1).rolling(W).std()
return g
features = df.sort_values(["entity_id","timestamp"]).groupby("entity_id", group_keys=False).apply(
lambda g: add_lag_roll(g, col="sales")
)
Leakage guard
Shift before rolling and never peek at the current row when computing aggregates.
5) Text TF-IDF basics (with dimensionality control)
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
text_pipe = Pipeline([
("tfidf", TfidfVectorizer(
ngram_range=(1,2),
max_features=20000,
min_df=3
)),
("clf", LogisticRegression(max_iter=1000))
])
text_pipe.fit(train_texts, train_labels)
print(text_pipe.score(val_texts, val_labels))
Tips
- Adjust max_features to control memory and overfitting.
- Use n-grams to capture short phrases; start with (1,2).
6) Aggregation features by entity (personalization)
# Compute per-customer purchase stats using only historical rows
hist = df.sort_values(["customer_id","date"])
agg = hist.groupby("customer_id").agg({
"amount":["mean","sum","count"],
"discount":["mean"]
})
agg.columns = ["_".join(c) for c in agg.columns]
features = hist.join(agg, on="customer_id", how="left", rsuffix="_cust")
Avoid future info
When training per-timestamp, compute aggregates up to the current date only. Pre-compute rolling aggregates within each entity group.
Drills and exercises
- Build a pipeline that handles both numeric (scaled) and categorical (one-hot) features and reproduces a baseline metric.
- Implement out-of-fold target encoding for one categorical column and compare to one-hot.
- Create at least two ratio features and one interaction term; evaluate with regularization.
- For a time series, add lag-1, lag-7, and 7-day rolling mean features without leakage.
- Vectorize text with TF-IDF (1–2 grams) and cap at 20k features; measure memory and training time.
- Run a simple feature selection (L1 or tree importances) and remove 10% least useful features; note metric change.
Common mistakes and debugging tips
- Data leakage via encodings or aggregates: Always compute statistics (means, encodings, scalers) on training folds only. For time series, use past-only windows.
Debug
Re-run with a strict time-aware split. If validation performance drops sharply vs random CV, you likely had leakage.
- Over-scaling or scaling after split merge: Fit scalers inside the pipeline so they don’t see validation data.
- Exploding cardinality with one-hot: Cap rare categories (e.g., replace counts < 10 with "OTHER"). Consider target/frequency encoding with CV.
- Outliers breaking the model: Use RobustScaler or log-transform skewed features. Winsorize extreme tails carefully.
- Too many interactions: Start small; monitor validation metrics and training time. Use L1 or feature importance to prune.
- Ignoring text sparsity: Limit TF-IDF features and tune min_df. Use linear models that handle sparse matrices well.
Mini project: Personalized purchase prediction
Goal: Predict whether a customer will make a purchase next week using transactional, categorical, text (optional), and time features.
Data sketch (you can use your own CSV or generate synthetic)
Columns: [customer_id, date, amount, discount, store, device, notes_text (optional), target_next_week]- Baseline: Split by time (train: older dates; validation: newer). Fit a LogisticRegression on numeric features only.
- Add categorical encodings: One-hot store and device. Compare to target/frequency encoding (CV-based).
- Time features: For each customer, compute lag-1 amount, 7-day rolling mean/std, and recency (days since last purchase).
- Aggregations by entity: Per-customer lifetime mean, sum, and purchase count to date (no future info).
- Robustness: Apply RobustScaler to skewed monetary features; winsorize 1st/99th percentiles if needed.
- (Optional) Text: Use TF-IDF on notes_text with max_features=10k and add to the model.
- Selection: Use L1-regularized LogisticRegression to prune weak features.
- Report: AUC/PR, calibration, and top-10 most influential features (by absolute coef for linear models or tree importances).
Deliverables checklist
- Reproducible Pipeline/ColumnTransformer code.
- Strict time-based split and leakage checks.
- Comparison table: baseline vs engineered features.
- Short write-up: 5–10 bullet insights.
Practical projects
- Customer churn prediction with entity aggregates and text-based support notes.
- House price modeling using interactions and robust scaling for skewed prices.
- News/topic classification with TF-IDF and feature selection for speed.
- Sensor fault detection using lag/rolling statistics and outlier-robust transforms.
Subskills
- Encoding Categorical Variables: One-hot, ordinal, target/frequency encoding with CV to avoid leakage.
- Scaling And Normalization: Standard, MinMax, Robust scaling; log transforms for skew.
- Feature Crosses And Interactions: Ratios, differences, domain-driven crosses, and polynomial terms.
- Text Features TF-IDF Basics: Tokenization, n-grams, max_features, handling sparsity.
- Time-Based Features: Lags & Rolling Windows: Shifted lags, rolling means/std, recency and seasonality flags.
- Aggregation Features By Entity: Per-entity counts, means, sums; historical-only computation.
- Handling Outliers And Robust Features: Winsorizing, trimming, robust scalers, transformations.
- Feature Selection Basics: Filter, wrapper, embedded methods; interpretability and speed gains.
- Leakage Prevention: Split discipline, timeline validation, out-of-fold stats, audit checks.
Next steps
- Package your preprocessing into a single scikit-learn Pipeline for deployment.
- Evaluate stability: retrain with different time windows and seeds; check metric variance.
- Explore interpretability (permutation importance, SHAP) to validate feature logic.
- Move to the next skill in your path after you pass the exam below.