luvv to helpDiscover the Best Free Online Tools
Topic 2 of 10

Preprocessing Pipelines And Column Transformers

Learn Preprocessing Pipelines And Column Transformers for free with explanations, exercises, and a quick test (for Machine Learning Engineer).

Published: January 1, 2026 | Updated: January 1, 2026

Who this is for

You build ML models with mixed data types and want clean, repeatable preprocessing that avoids data leakage and works end-to-end in training, validation, and production.

Prerequisites

  • Python basics (functions, lists, dicts)
  • Familiarity with NumPy/Pandas data structures
  • Basic ML concepts (fit/transform, train/test split, classification/regression)

Learning path

  1. Understand the problem: why preprocessing must be consistent.
  2. Learn Pipeline and ColumnTransformer mental models.
  3. Build a basic numeric + categorical pipeline.
  4. Extend to text, dates, custom transformers.
  5. Integrate with cross-validation and grid search safely.
  6. Package and evaluate end-to-end.

Why this matters

  • Prevents data leakage: all preprocessing happens inside the CV folds and on the correct split.
  • Reproducibility: one object represents the full workflow from raw columns to predictions.
  • Production readiness: the same steps used in training are applied to new data without manual hacks.
  • Speed: iterate safely with grid search without rewriting preprocessing code.
Real tasks you will do as an ML Engineer
  • Train a churn model mixing numeric usage metrics and categorical plan types.
  • Score leads using text notes + structured CRM fields.
  • Ship a fraud model where the same preprocessing must run in batch and online.

Concept explained simply

A Pipeline chains steps: transformers first (imputation, scaling, encoding), then an estimator (model). A ColumnTransformer applies different transformers to different column subsets at once. Together, they let you define “how to turn raw columns into model-ready features” in one place.

Mental model

Imagine a factory conveyor belt. Each station changes the product. A ColumnTransformer is a switchboard that routes different columns to different stations (e.g., numbers to a scaler, categories to an encoder). The final station is your model.

Key building blocks

  • Pipeline(steps=[('name', transformer_or_estimator), ...])
  • ColumnTransformer(transformers=[('name', transformer, [cols])], remainder='drop' or 'passthrough')
  • Common transformers: SimpleImputer, StandardScaler, OneHotEncoder, FunctionTransformer, TfidfVectorizer
  • Safety options: OneHotEncoder(handle_unknown='ignore'), ColumnTransformer(remainder='passthrough')
  • Feature names: pipeline.get_feature_names_out() after fit (scikit-learn 1.0+ compatible transformers)

Worked examples

Example 1: Mixed numeric + categorical for classification

Goal: Predict churn using age (float, has missing), income (float), city (category), is_member (bool).

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.linear_model import LogisticRegression

num_cols = ['age', 'income']
cat_cols = ['city', 'is_member']

numeric_pre = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

categorical_pre = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('ohe', OneHotEncoder(handle_unknown='ignore'))
])

preprocess = ColumnTransformer(
    transformers=[
        ('num', numeric_pre, num_cols),
        ('cat', categorical_pre, cat_cols)
    ]
)

clf = Pipeline(steps=[
    ('preprocess', preprocess),
    ('model', LogisticRegression(max_iter=1000))
])

# clf.fit(X_train, y_train); clf.predict(X_test)
  • Why it works: Each column set gets its own transformations. No leakage if you fit inside CV.
  • Tip: Set solver/liblinear or class_weight if you see convergence issues.
Example 2: Text + numeric features

Goal: Predict sentiment using review text and review_length (numeric).

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC

text_col = 'review'
num_cols = ['review_length']

text_pre = TfidfVectorizer(ngram_range=(1,2), min_df=2)
num_pre = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

preprocess = ColumnTransformer([
    ('txt', text_pre, text_col),
    ('num', num_pre, num_cols)
], remainder='drop')

clf = Pipeline([
    ('preprocess', preprocess),
    ('model', LinearSVC())
])

# Works with cross_val_score(clf, X, y, cv=5)

Note: TfidfVectorizer consumes raw text; ColumnTransformer routes that single text column directly.

Example 3: Custom transform + grid search

Goal: Log-transform skewed income and tune model hyperparameters.

import numpy as np
from sklearn.preprocessing import FunctionTransformer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

log1p = FunctionTransformer(lambda x: np.log1p(x), feature_names_out='one-to-one')

num_cols = ['age', 'income']
cat_cols = ['city']

numeric_pre = Pipeline([
    ('imputer', SimpleImputer()),
    ('log_income', ColumnTransformer([
        ('log', log1p, ['income']),
        ('pass_age', 'passthrough', ['age'])
    ], remainder='drop')),
    ('scaler', StandardScaler(with_mean=False))
])

categorical_pre = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('ohe', OneHotEncoder(handle_unknown='ignore'))
])

preprocess = ColumnTransformer([
    ('num', numeric_pre, num_cols),
    ('cat', categorical_pre, cat_cols)
])

pipe = Pipeline([
    ('preprocess', preprocess),
    ('model', RandomForestClassifier(random_state=0))
])

params = {
    'model__n_estimators': [100, 300],
    'model__max_depth': [None, 10]
}

gs = GridSearchCV(pipe, params, cv=5, n_jobs=-1)
# gs.fit(X, y); gs.best_params_

The custom FunctionTransformer allows safe, named log transform within the pipeline. Grid search tunes only the estimator here, but you can also tune preprocessing parameters.

Common mistakes and self-check

  • Mistake: Scaling/encoding before splitting data. Self-check: Ensure all preprocessing lives inside Pipeline; you call fit on the pipeline, not on raw transformers outside CV.
  • Mistake: OneHotEncoder without handle_unknown='ignore' causing errors on new categories. Self-check: Try a small holdout with unseen categories and confirm prediction runs.
  • Mistake: Column order mismatches when manually concatenating arrays. Self-check: Use ColumnTransformer; after fit, verify get_feature_names_out() shape and order.
  • Mistake: Using fit_transform on test data. Self-check: Only call transform on validation/test splits; the Pipeline handles this if you only call fit on training data.
  • Mistake: Dropping important columns via remainder='drop' unintentionally. Self-check: Intentionally set remainder='passthrough' when needed and verify feature count.

Exercises

Do these hands-on tasks. The Quick Test at the end is available to everyone; log in to save your progress.

Exercise 1: Mixed-type preprocessing pipeline

Create a pipeline that imputes and scales numeric columns and one-hot encodes categoricals, then trains LogisticRegression.

  • Numeric: age (missing values), income
  • Categorical: city, is_member
Starter code idea
# Build numeric_pre, categorical_pre, preprocess, then clf = Pipeline([...])

Exercise 2: Remove leakage with a Pipeline

Refactor code that currently imputes and scales the full dataset before train/test split. Move all steps into a Pipeline and evaluate with cross_val_score.

Exercise 3: Text + numeric ColumnTransformer

Build a pipeline that vectorizes a text column with TfidfVectorizer and scales a numeric length column, then trains LinearSVC.

Checklist

Practical projects

  • Customer churn scoring: Combine demographics (numeric), plan info (categorical), and last ticket text (short text) into one pipeline and evaluate with ROC AUC.
  • Credit risk baseline: Impute, scale, and encode application features; add a custom transformer for log-transforming skewed amounts; tune model with grid search.
  • Support triage: TfidfVectorizer for issue description + categorical priority + numeric time-open; train a classifier and export the fitted pipeline for inference.

Mini challenge

Design a single Pipeline to predict whether a rider will reorder:

  • Numeric: trips_last_30d (skewed), avg_basket_value (missing)
  • Categorical: city_tier, device_type (may have unseen values)
  • Text: last_feedback (short free text)

Requirements:

  • Impute numeric, log-transform trips_last_30d, scale numeric
  • One-hot encode categoricals safely
  • TF–IDF for text
  • Evaluate with 5-fold CV without leakage
Hint

Use a ColumnTransformer with three branches (numeric, categorical, text) inside a Pipeline feeding your estimator. Verify with get_feature_names_out() after fit.

Next steps

  • Add feature selection (e.g., SelectKBest) as a Pipeline step and tune k.
  • Use FunctionTransformer for date to cyclical encoding (month, weekday) and include in ColumnTransformer.
  • Export and load the fitted pipeline with joblib for consistent production scoring.

Quick Test

Anyone can take the test for free. Log in to save your progress and resume later.

Practice Exercises

3 exercises to complete

Instructions

Create a scikit-learn Pipeline that:

  • Imputes numeric age (mean) and income (mean), then scales them.
  • Imputes categorical city and is_member (most_frequent), then OneHotEncoder(handle_unknown='ignore').
  • Combines with ColumnTransformer and trains LogisticRegression(max_iter=1000).
  • After fit, call get_feature_names_out() on the pipeline and check it returns names.
Expected Output
The pipeline fits without errors; calling pipeline.get_feature_names_out() returns a list of feature names combining numeric and one-hot columns.

Preprocessing Pipelines And Column Transformers — Quick Test

Test your knowledge with 10 questions. Pass with 70% or higher.

10 questions70% to pass

Have questions about Preprocessing Pipelines And Column Transformers?

AI Assistant

Ask questions about this tool