How to learn Preprocessing Pipelines And Column Transformers for ML Frameworks in Machine Learning Engineer for free

Who this is for

You build ML models with mixed data types and want clean, repeatable preprocessing that avoids data leakage and works end-to-end in training, validation, and production.

Prerequisites

Python basics (functions, lists, dicts)
Familiarity with NumPy/Pandas data structures
Basic ML concepts (fit/transform, train/test split, classification/regression)

Learning path

Understand the problem: why preprocessing must be consistent.
Learn Pipeline and ColumnTransformer mental models.
Build a basic numeric + categorical pipeline.
Extend to text, dates, custom transformers.
Integrate with cross-validation and grid search safely.
Package and evaluate end-to-end.

Why this matters

Prevents data leakage: all preprocessing happens inside the CV folds and on the correct split.
Reproducibility: one object represents the full workflow from raw columns to predictions.
Production readiness: the same steps used in training are applied to new data without manual hacks.
Speed: iterate safely with grid search without rewriting preprocessing code.

Real tasks you will do as an ML Engineer

Train a churn model mixing numeric usage metrics and categorical plan types.
Score leads using text notes + structured CRM fields.
Ship a fraud model where the same preprocessing must run in batch and online.

Concept explained simply

A Pipeline chains steps: transformers first (imputation, scaling, encoding), then an estimator (model). A ColumnTransformer applies different transformers to different column subsets at once. Together, they let you define “how to turn raw columns into model-ready features” in one place.

Mental model

Imagine a factory conveyor belt. Each station changes the product. A ColumnTransformer is a switchboard that routes different columns to different stations (e.g., numbers to a scaler, categories to an encoder). The final station is your model.

Key building blocks

Pipeline(steps=[('name', transformer_or_estimator), ...])
ColumnTransformer(transformers=[('name', transformer, [cols])], remainder='drop' or 'passthrough')
Common transformers: SimpleImputer, StandardScaler, OneHotEncoder, FunctionTransformer, TfidfVectorizer
Safety options: OneHotEncoder(handle_unknown='ignore'), ColumnTransformer(remainder='passthrough')
Feature names: pipeline.get_feature_names_out() after fit (scikit-learn 1.0+ compatible transformers)

Worked examples

Example 1: Mixed numeric + categorical for classification

Goal: Predict churn using age (float, has missing), income (float), city (category), is_member (bool).

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.linear_model import LogisticRegression

num_cols = ['age', 'income']
cat_cols = ['city', 'is_member']

numeric_pre = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

categorical_pre = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('ohe', OneHotEncoder(handle_unknown='ignore'))
])

preprocess = ColumnTransformer(
    transformers=[
        ('num', numeric_pre, num_cols),
        ('cat', categorical_pre, cat_cols)
    ]
)

clf = Pipeline(steps=[
    ('preprocess', preprocess),
    ('model', LogisticRegression(max_iter=1000))
])

# clf.fit(X_train, y_train); clf.predict(X_test)

Why it works: Each column set gets its own transformations. No leakage if you fit inside CV.
Tip: Set solver/liblinear or class_weight if you see convergence issues.

Example 2: Text + numeric features

Goal: Predict sentiment using review text and review_length (numeric).

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC

text_col = 'review'
num_cols = ['review_length']

text_pre = TfidfVectorizer(ngram_range=(1,2), min_df=2)
num_pre = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

preprocess = ColumnTransformer([
    ('txt', text_pre, text_col),
    ('num', num_pre, num_cols)
], remainder='drop')

clf = Pipeline([
    ('preprocess', preprocess),
    ('model', LinearSVC())
])

# Works with cross_val_score(clf, X, y, cv=5)

Note: TfidfVectorizer consumes raw text; ColumnTransformer routes that single text column directly.

Example 3: Custom transform + grid search

Goal: Log-transform skewed income and tune model hyperparameters.

import numpy as np
from sklearn.preprocessing import FunctionTransformer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

log1p = FunctionTransformer(lambda x: np.log1p(x), feature_names_out='one-to-one')

num_cols = ['age', 'income']
cat_cols = ['city']

numeric_pre = Pipeline([
    ('imputer', SimpleImputer()),
    ('log_income', ColumnTransformer([
        ('log', log1p, ['income']),
        ('pass_age', 'passthrough', ['age'])
    ], remainder='drop')),
    ('scaler', StandardScaler(with_mean=False))
])

categorical_pre = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('ohe', OneHotEncoder(handle_unknown='ignore'))
])

preprocess = ColumnTransformer([
    ('num', numeric_pre, num_cols),
    ('cat', categorical_pre, cat_cols)
])

pipe = Pipeline([
    ('preprocess', preprocess),
    ('model', RandomForestClassifier(random_state=0))
])

params = {
    'model__n_estimators': [100, 300],
    'model__max_depth': [None, 10]
}

gs = GridSearchCV(pipe, params, cv=5, n_jobs=-1)
# gs.fit(X, y); gs.best_params_

The custom FunctionTransformer allows safe, named log transform within the pipeline. Grid search tunes only the estimator here, but you can also tune preprocessing parameters.

Common mistakes and self-check

Mistake: Scaling/encoding before splitting data. Self-check: Ensure all preprocessing lives inside Pipeline; you call fit on the pipeline, not on raw transformers outside CV.
Mistake: OneHotEncoder without handle_unknown='ignore' causing errors on new categories. Self-check: Try a small holdout with unseen categories and confirm prediction runs.
Mistake: Column order mismatches when manually concatenating arrays. Self-check: Use ColumnTransformer; after fit, verify get_feature_names_out() shape and order.
Mistake: Using fit_transform on test data. Self-check: Only call transform on validation/test splits; the Pipeline handles this if you only call fit on training data.
Mistake: Dropping important columns via remainder='drop' unintentionally. Self-check: Intentionally set remainder='passthrough' when needed and verify feature count.

Exercises

Do these hands-on tasks. The Quick Test at the end is available to everyone; log in to save your progress.

Exercise 1: Mixed-type preprocessing pipeline

Create a pipeline that imputes and scales numeric columns and one-hot encodes categoricals, then trains LogisticRegression.

Numeric: age (missing values), income
Categorical: city, is_member

Starter code idea

# Build numeric_pre, categorical_pre, preprocess, then clf = Pipeline([...])

Exercise 2: Remove leakage with a Pipeline

Refactor code that currently imputes and scales the full dataset before train/test split. Move all steps into a Pipeline and evaluate with cross_val_score.

Exercise 3: Text + numeric ColumnTransformer

Build a pipeline that vectorizes a text column with TfidfVectorizer and scales a numeric length column, then trains LinearSVC.

Checklist

All preprocessing happens inside a Pipeline.
ColumnTransformer correctly routes numeric vs categorical vs text.
OneHotEncoder uses handle_unknown='ignore'.
Verified feature shape and names after fit.
Cross-validation runs without leakage or errors.

Practical projects

Customer churn scoring: Combine demographics (numeric), plan info (categorical), and last ticket text (short text) into one pipeline and evaluate with ROC AUC.
Credit risk baseline: Impute, scale, and encode application features; add a custom transformer for log-transforming skewed amounts; tune model with grid search.
Support triage: TfidfVectorizer for issue description + categorical priority + numeric time-open; train a classifier and export the fitted pipeline for inference.

Mini challenge

Design a single Pipeline to predict whether a rider will reorder:

Numeric: trips_last_30d (skewed), avg_basket_value (missing)
Categorical: city_tier, device_type (may have unseen values)
Text: last_feedback (short free text)

Requirements:

Impute numeric, log-transform trips_last_30d, scale numeric
One-hot encode categoricals safely
TF–IDF for text
Evaluate with 5-fold CV without leakage

Hint

Use a ColumnTransformer with three branches (numeric, categorical, text) inside a Pipeline feeding your estimator. Verify with get_feature_names_out() after fit.

Next steps

Add feature selection (e.g., SelectKBest) as a Pipeline step and tune k.
Use FunctionTransformer for date to cyclical encoding (month, weekday) and include in ColumnTransformer.
Export and load the fitted pipeline with joblib for consistent production scoring.

Quick Test

Anyone can take the test for free. Log in to save your progress and resume later.

Menu

Preprocessing Pipelines And Column Transformers

Table of Contents

Who this is for

Prerequisites

Learning path

Why this matters

Concept explained simply

Mental model

Key building blocks

Worked examples

Common mistakes and self-check

Exercises

Exercise 1: Mixed-type preprocessing pipeline

Exercise 2: Remove leakage with a Pipeline

Exercise 3: Text + numeric ColumnTransformer

Checklist

Practical projects

Mini challenge

Next steps

Quick Test

Practice Exercises

Build a mixed-type preprocessing pipeline

Instructions

Expected Output

Refactor to remove leakage using a Pipeline

Text + numeric with ColumnTransformer

Preprocessing Pipelines And Column Transformers — Quick Test

Have questions about Preprocessing Pipelines And Column Transformers?

AI Assistant