Who this is for
You build ML models with mixed data types and want clean, repeatable preprocessing that avoids data leakage and works end-to-end in training, validation, and production.
Prerequisites
- Python basics (functions, lists, dicts)
- Familiarity with NumPy/Pandas data structures
- Basic ML concepts (fit/transform, train/test split, classification/regression)
Learning path
- Understand the problem: why preprocessing must be consistent.
- Learn Pipeline and ColumnTransformer mental models.
- Build a basic numeric + categorical pipeline.
- Extend to text, dates, custom transformers.
- Integrate with cross-validation and grid search safely.
- Package and evaluate end-to-end.
Why this matters
- Prevents data leakage: all preprocessing happens inside the CV folds and on the correct split.
- Reproducibility: one object represents the full workflow from raw columns to predictions.
- Production readiness: the same steps used in training are applied to new data without manual hacks.
- Speed: iterate safely with grid search without rewriting preprocessing code.
Real tasks you will do as an ML Engineer
- Train a churn model mixing numeric usage metrics and categorical plan types.
- Score leads using text notes + structured CRM fields.
- Ship a fraud model where the same preprocessing must run in batch and online.
Concept explained simply
A Pipeline chains steps: transformers first (imputation, scaling, encoding), then an estimator (model). A ColumnTransformer applies different transformers to different column subsets at once. Together, they let you define “how to turn raw columns into model-ready features” in one place.
Mental model
Imagine a factory conveyor belt. Each station changes the product. A ColumnTransformer is a switchboard that routes different columns to different stations (e.g., numbers to a scaler, categories to an encoder). The final station is your model.
Key building blocks
- Pipeline(steps=[('name', transformer_or_estimator), ...])
- ColumnTransformer(transformers=[('name', transformer, [cols])], remainder='drop' or 'passthrough')
- Common transformers: SimpleImputer, StandardScaler, OneHotEncoder, FunctionTransformer, TfidfVectorizer
- Safety options: OneHotEncoder(handle_unknown='ignore'), ColumnTransformer(remainder='passthrough')
- Feature names: pipeline.get_feature_names_out() after fit (scikit-learn 1.0+ compatible transformers)
Worked examples
Example 1: Mixed numeric + categorical for classification
Goal: Predict churn using age (float, has missing), income (float), city (category), is_member (bool).
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.linear_model import LogisticRegression
num_cols = ['age', 'income']
cat_cols = ['city', 'is_member']
numeric_pre = Pipeline(steps=[
('imputer', SimpleImputer(strategy='mean')),
('scaler', StandardScaler())
])
categorical_pre = Pipeline(steps=[
('imputer', SimpleImputer(strategy='most_frequent')),
('ohe', OneHotEncoder(handle_unknown='ignore'))
])
preprocess = ColumnTransformer(
transformers=[
('num', numeric_pre, num_cols),
('cat', categorical_pre, cat_cols)
]
)
clf = Pipeline(steps=[
('preprocess', preprocess),
('model', LogisticRegression(max_iter=1000))
])
# clf.fit(X_train, y_train); clf.predict(X_test)
- Why it works: Each column set gets its own transformations. No leakage if you fit inside CV.
- Tip: Set solver/liblinear or class_weight if you see convergence issues.
Example 2: Text + numeric features
Goal: Predict sentiment using review text and review_length (numeric).
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
text_col = 'review'
num_cols = ['review_length']
text_pre = TfidfVectorizer(ngram_range=(1,2), min_df=2)
num_pre = Pipeline([
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())
])
preprocess = ColumnTransformer([
('txt', text_pre, text_col),
('num', num_pre, num_cols)
], remainder='drop')
clf = Pipeline([
('preprocess', preprocess),
('model', LinearSVC())
])
# Works with cross_val_score(clf, X, y, cv=5)
Note: TfidfVectorizer consumes raw text; ColumnTransformer routes that single text column directly.
Example 3: Custom transform + grid search
Goal: Log-transform skewed income and tune model hyperparameters.
import numpy as np
from sklearn.preprocessing import FunctionTransformer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
log1p = FunctionTransformer(lambda x: np.log1p(x), feature_names_out='one-to-one')
num_cols = ['age', 'income']
cat_cols = ['city']
numeric_pre = Pipeline([
('imputer', SimpleImputer()),
('log_income', ColumnTransformer([
('log', log1p, ['income']),
('pass_age', 'passthrough', ['age'])
], remainder='drop')),
('scaler', StandardScaler(with_mean=False))
])
categorical_pre = Pipeline([
('imputer', SimpleImputer(strategy='most_frequent')),
('ohe', OneHotEncoder(handle_unknown='ignore'))
])
preprocess = ColumnTransformer([
('num', numeric_pre, num_cols),
('cat', categorical_pre, cat_cols)
])
pipe = Pipeline([
('preprocess', preprocess),
('model', RandomForestClassifier(random_state=0))
])
params = {
'model__n_estimators': [100, 300],
'model__max_depth': [None, 10]
}
gs = GridSearchCV(pipe, params, cv=5, n_jobs=-1)
# gs.fit(X, y); gs.best_params_
The custom FunctionTransformer allows safe, named log transform within the pipeline. Grid search tunes only the estimator here, but you can also tune preprocessing parameters.
Common mistakes and self-check
- Mistake: Scaling/encoding before splitting data. Self-check: Ensure all preprocessing lives inside Pipeline; you call fit on the pipeline, not on raw transformers outside CV.
- Mistake: OneHotEncoder without handle_unknown='ignore' causing errors on new categories. Self-check: Try a small holdout with unseen categories and confirm prediction runs.
- Mistake: Column order mismatches when manually concatenating arrays. Self-check: Use ColumnTransformer; after fit, verify get_feature_names_out() shape and order.
- Mistake: Using fit_transform on test data. Self-check: Only call transform on validation/test splits; the Pipeline handles this if you only call fit on training data.
- Mistake: Dropping important columns via remainder='drop' unintentionally. Self-check: Intentionally set remainder='passthrough' when needed and verify feature count.
Exercises
Do these hands-on tasks. The Quick Test at the end is available to everyone; log in to save your progress.
Exercise 1: Mixed-type preprocessing pipeline
Create a pipeline that imputes and scales numeric columns and one-hot encodes categoricals, then trains LogisticRegression.
- Numeric: age (missing values), income
- Categorical: city, is_member
Starter code idea
# Build numeric_pre, categorical_pre, preprocess, then clf = Pipeline([...])
Exercise 2: Remove leakage with a Pipeline
Refactor code that currently imputes and scales the full dataset before train/test split. Move all steps into a Pipeline and evaluate with cross_val_score.
Exercise 3: Text + numeric ColumnTransformer
Build a pipeline that vectorizes a text column with TfidfVectorizer and scales a numeric length column, then trains LinearSVC.
Checklist
Practical projects
- Customer churn scoring: Combine demographics (numeric), plan info (categorical), and last ticket text (short text) into one pipeline and evaluate with ROC AUC.
- Credit risk baseline: Impute, scale, and encode application features; add a custom transformer for log-transforming skewed amounts; tune model with grid search.
- Support triage: TfidfVectorizer for issue description + categorical priority + numeric time-open; train a classifier and export the fitted pipeline for inference.
Mini challenge
Design a single Pipeline to predict whether a rider will reorder:
- Numeric: trips_last_30d (skewed), avg_basket_value (missing)
- Categorical: city_tier, device_type (may have unseen values)
- Text: last_feedback (short free text)
Requirements:
- Impute numeric, log-transform trips_last_30d, scale numeric
- One-hot encode categoricals safely
- TF–IDF for text
- Evaluate with 5-fold CV without leakage
Hint
Use a ColumnTransformer with three branches (numeric, categorical, text) inside a Pipeline feeding your estimator. Verify with get_feature_names_out() after fit.
Next steps
- Add feature selection (e.g., SelectKBest) as a Pipeline step and tune k.
- Use FunctionTransformer for date to cyclical encoding (month, weekday) and include in ColumnTransformer.
- Export and load the fitted pipeline with joblib for consistent production scoring.
Quick Test
Anyone can take the test for free. Log in to save your progress and resume later.