luvv to helpDiscover the Best Free Online Tools
Topic 1 of 10

Training Pipelines With Sklearn

Learn Training Pipelines With Sklearn for free with explanations, exercises, and a quick test (for Machine Learning Engineer).

Published: January 1, 2026 | Updated: January 1, 2026

Who this is for

You build or maintain ML models and want reliable, repeatable training that avoids data leakage. Ideal for Machine Learning Engineers moving from notebooks to production-ready workflows.

Prerequisites

  • Comfortable with Python and NumPy/Pandas basics
  • Know how to train simple scikit-learn models (e.g., LogisticRegression, RandomForest)
  • Basic understanding of cross-validation and metrics (accuracy, F1)

Why this matters

In real teams, models fail not because of algorithms but because of messy preprocessing, inconsistent training vs inference steps, and hidden data leakage. Pipelines in scikit-learn bundle preprocessing, feature engineering, and modeling into one consistent object you can tune, validate, and deploy safely.

  • Hiring task examples: Build a pipeline for mixed-type data; tune hyperparameters without leakage.
  • Production task examples: Version a pipeline; retrain with new data; export a single artifact that is predict-ready.

Concept explained simply

A Pipeline is like a conveyor belt. Raw data enters on one side; a sequence of steps (impute, encode, scale, model) runs in stable order; predictions come out the other side. You fit the entire belt once and use it everywhere.

Mental model

  • ColumnTransformer = routes: send each column type to the right preprocessing lane.
  • Pipeline = steps: chain transformations and the final estimator.
  • GridSearchCV/RandomizedSearchCV = automatic tuner: tries parameter combos across the whole pipeline safely.
  • Joblib = shrink-wrap: save and load the whole conveyor belt.

Core components you will use

  • Pipeline(steps=[("prep", ...), ("model", ...)])
  • ColumnTransformer(transformers=[("num", num_pipe, num_cols), ("cat", cat_pipe, cat_cols), ("text", text_pipe, text_cols)])
  • FunctionTransformer or custom TransformerMixin for custom features
  • GridSearchCV/RandomizedSearchCV for tuning
  • cross_val_score/cross_validate for honest evaluation
  • joblib.dump/joblib.load for persistence

Worked examples

Example 1: Mixed numeric + categorical preprocessing with a single predict-ready object

Show code
import numpy as np
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Fake dataset
rng = np.random.RandomState(0)
N = 500
X = pd.DataFrame({
    "age": rng.normal(40, 12, size=N),
    "income": rng.normal(60000, 15000, size=N),
    "city": rng.choice(["NY", "SF", "LA"], size=N),
    "plan": rng.choice(["basic", "pro"], size=N)
})
y = (X["income"] + 1000*(X["plan"] == "pro") + rng.normal(0, 5000, size=N) > 60000).astype(int)

num_cols = ["age", "income"]
cat_cols = ["city", "plan"]

num_pipe = Pipeline([
    ("impute", SimpleImputer(strategy="median")),
    ("scale", StandardScaler())
])

cat_pipe = Pipeline([
    ("impute", SimpleImputer(strategy="most_frequent")),
    ("onehot", OneHotEncoder(handle_unknown="ignore"))
])

prep = ColumnTransformer([
    ("num", num_pipe, num_cols),
    ("cat", cat_pipe, cat_cols)
])

pipe = Pipeline([
    ("prep", prep),
    ("clf", LogisticRegression(max_iter=1000))
])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
pipe.fit(X_train, y_train)
pred = pipe.predict(X_test)
print("Accuracy:", accuracy_score(y_test, pred))

Why this is good: imputing, encoding, scaling, and modeling are all applied consistently during fit and predict, reducing errors.

Example 2: Hyperparameter tuning without leakage

Show code
from sklearn.model_selection import GridSearchCV, StratifiedKFold
from sklearn.metrics import f1_score, make_scorer

param_grid = {
    "clf__C": [0.1, 1, 10],
    "clf__penalty": ["l2"],
    "clf__solver": ["lbfgs"]
}

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

grid = GridSearchCV(
    estimator=pipe,
    param_grid=param_grid,
    scoring=make_scorer(f1_score),
    cv=cv,
    n_jobs=-1,
    refit=True
)

grid.fit(X, y)
print("Best params:", grid.best_params_)
print("Best CV F1:", grid.best_score_)

best_model = grid.best_estimator_
# best_model is still a Pipeline; you can call .predict on raw X

Leakage is prevented because the imputer/encoder/scaler refit inside each CV fold only on that fold’s training split.

Example 3: Combining text and structured features

Show code
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier

# Sample data with text
X = pd.DataFrame({
    "review": ["great battery", "poor screen", "excellent camera", "battery ok"],
    "price": [299, 199, 499, 249],
    "brand": ["A", "B", "A", "C"]
})
y = np.array([1, 0, 1, 0])

text_col = ["review"]
num_cols = ["price"]
cat_cols = ["brand"]

text_pipe = Pipeline([
    ("tfidf", TfidfVectorizer())
])

num_pipe = Pipeline([
    ("impute", SimpleImputer(strategy="median")),
    ("scale", StandardScaler())
])

cat_pipe = Pipeline([
    ("impute", SimpleImputer(strategy="most_frequent")),
    ("onehot", OneHotEncoder(handle_unknown="ignore"))
])

prep = ColumnTransformer([
    ("text", text_pipe, "review"),
    ("num", num_pipe, num_cols),
    ("cat", cat_pipe, cat_cols)
])

pipe = Pipeline([
    ("prep", prep),
    ("clf", RandomForestClassifier(random_state=0))
])

pipe.fit(X, y)
print(pipe.predict(X))

Different modalities are handled in a single ColumnTransformer. The RandomForest sees a combined sparse/dense feature space seamlessly.

How to build robust sklearn pipelines (steps)

  1. List your raw inputs: numeric, categorical, text, dates, images (if any).
  2. Decide per-type preprocessing: impute, encode, scale, tokenize.
  3. Create per-type sub-pipelines.
  4. Assemble with ColumnTransformer.
  5. Append a final estimator in a Pipeline.
  6. Tune with GridSearchCV/RandomizedSearchCV using step-prefixed param names (e.g., clf__n_estimators).
  7. Evaluate with cross-validation on the full pipeline.
  8. Persist the fitted pipeline with joblib and reuse for inference.

Exercises

These mirror the graded exercise below. Do them directly in your IDE or notebook.

Exercise 1: Build and tune a leak-free pipeline

Goal: Create a pipeline for a synthetic binary classification dataset with mixed features; tune C for LogisticRegression; report CV accuracy and the best params.

  • Numeric: two columns with missing values and scaling
  • Categorical: two columns with unknown categories at test time

Checklist:

  • [ ] Use SimpleImputer for both numeric and categorical
  • [ ] Use StandardScaler for numeric
  • [ ] Use OneHotEncoder(handle_unknown="ignore") for categorical
  • [ ] Use Pipeline + ColumnTransformer
  • [ ] Tune clf__C over [0.1, 1, 10] with StratifiedKFold(5)
  • [ ] Print best params and mean CV accuracy

Common mistakes and how to self-check

  • Fitting preprocessing on the full dataset before CV. Self-check: Does your code call fit only on the pipeline object and not separately on transformers?
  • Forgetting handle_unknown="ignore" in OneHotEncoder. Self-check: Can your pipeline predict on data with unseen categories without errors?
  • Scaling after the model or outside the pipeline. Self-check: Is scale inside the Pipeline before the estimator?
  • Using column names that do not exist at inference. Self-check: Keep the same schema and order; prefer DataFrame inputs with explicit column lists.
  • Not setting max_iter for solvers. Self-check: Avoid convergence warnings by setting reasonable max_iter.

Practical projects

  • Customer churn: Mixed numeric/categorical features; compare LogisticRegression vs XGBoost (via sklearn API) with a shared preprocessing pipeline.
  • Text + metadata sentiment model: Tfidf on review text plus price/brand features in a ColumnTransformer.
  • Time-sliced retraining: Fit a pipeline on months 1–6, validate on month 7; package with joblib and load for batch scoring.

Learning path

  1. Refresh scikit-learn transformers and estimators
  2. Master ColumnTransformer and Pipeline
  3. Practice leak-free cross-validation over full pipelines
  4. Hyperparameter tuning with GridSearchCV/RandomizedSearchCV
  5. Custom transformers (FunctionTransformer or TransformerMixin)
  6. Model persistence, inference scripts, and schema checks

Next steps

  • Wrap your pipeline in a simple script that loads a CSV, outputs predictions
  • Add input validation (column presence, dtypes)
  • Benchmark two models under the same preprocessing

Mini challenge

Build a pipeline that handles:

  • Numeric: impute median, scale
  • Categorical: impute most_frequent, one-hot with unknowns
  • Text: Tfidf with min_df=2

Train a linear model (e.g., LogisticRegression) and report cross-validated F1. Acceptance: You can call predict on a small test batch containing a new category and it runs without errors.

Quick Test

The quick test is available to everyone. Only logged-in users will have their progress saved.

Practice Exercises

1 exercises to complete

Instructions

Create a synthetic dataset and build a full Pipeline with ColumnTransformer. Tune LogisticRegression C. Report best params and CV accuracy.

Starter code
import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.model_selection import GridSearchCV, StratifiedKFold
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# 1) Data
data_X, y = make_classification(n_samples=800, n_features=2, n_informative=2, random_state=0)
numA = data_X[:, 0]
numB = data_X[:, 1]
rng = np.random.RandomState(0)
# Inject some missing values
mask = rng.rand(numA.shape[0]) < 0.1
numA[mask] = np.nan

cat1 = rng.choice(["red","green","blue"], size=800)
cat2 = rng.choice(["basic","pro"], size=800)

X = pd.DataFrame({"numA": numA, "numB": numB, "cat1": cat1, "cat2": cat2})

# 2) TODO: Build numeric and categorical pipelines
num_cols = ["numA", "numB"]
cat_cols = ["cat1", "cat2"]

num_pipe = Pipeline([
    # fill here
])

cat_pipe = Pipeline([
    # fill here
])

prep = ColumnTransformer([
    ("num", num_pipe, num_cols),
    ("cat", cat_pipe, cat_cols)
])

pipe = Pipeline([
    ("prep", prep),
    ("clf", LogisticRegression(max_iter=1000))
])

# 3) Tune C
param_grid = {"clf__C": [0.1, 1, 10]}
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

search = GridSearchCV(pipe, param_grid=param_grid, scoring="accuracy", cv=cv, n_jobs=-1, refit=True)
search.fit(X, y)

print("Best params:", search.best_params_)
print("Best CV accuracy:", search.best_score_)
Expected Output
Printed best_params_ with a value of clf__C among [0.1, 1, 10] and a mean CV accuracy score (typically > 0.70 on this synthetic data).

Training Pipelines With Sklearn — Quick Test

Test your knowledge with 7 questions. Pass with 70% or higher.

7 questions70% to pass

Have questions about Training Pipelines With Sklearn?

AI Assistant

Ask questions about this tool