How to learn Training Pipelines With Sklearn for ML Frameworks in Machine Learning Engineer for free

Who this is for

You build or maintain ML models and want reliable, repeatable training that avoids data leakage. Ideal for Machine Learning Engineers moving from notebooks to production-ready workflows.

Prerequisites

Comfortable with Python and NumPy/Pandas basics
Know how to train simple scikit-learn models (e.g., LogisticRegression, RandomForest)
Basic understanding of cross-validation and metrics (accuracy, F1)

Why this matters

In real teams, models fail not because of algorithms but because of messy preprocessing, inconsistent training vs inference steps, and hidden data leakage. Pipelines in scikit-learn bundle preprocessing, feature engineering, and modeling into one consistent object you can tune, validate, and deploy safely.

Hiring task examples: Build a pipeline for mixed-type data; tune hyperparameters without leakage.
Production task examples: Version a pipeline; retrain with new data; export a single artifact that is predict-ready.

Concept explained simply

A Pipeline is like a conveyor belt. Raw data enters on one side; a sequence of steps (impute, encode, scale, model) runs in stable order; predictions come out the other side. You fit the entire belt once and use it everywhere.

Mental model

ColumnTransformer = routes: send each column type to the right preprocessing lane.
Pipeline = steps: chain transformations and the final estimator.
GridSearchCV/RandomizedSearchCV = automatic tuner: tries parameter combos across the whole pipeline safely.
Joblib = shrink-wrap: save and load the whole conveyor belt.

Core components you will use

Pipeline(steps=[("prep", ...), ("model", ...)])
ColumnTransformer(transformers=[("num", num_pipe, num_cols), ("cat", cat_pipe, cat_cols), ("text", text_pipe, text_cols)])
FunctionTransformer or custom TransformerMixin for custom features
GridSearchCV/RandomizedSearchCV for tuning
cross_val_score/cross_validate for honest evaluation
joblib.dump/joblib.load for persistence

Worked examples

Example 1: Mixed numeric + categorical preprocessing with a single predict-ready object

Show code

import numpy as np
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Fake dataset
rng = np.random.RandomState(0)
N = 500
X = pd.DataFrame({
    "age": rng.normal(40, 12, size=N),
    "income": rng.normal(60000, 15000, size=N),
    "city": rng.choice(["NY", "SF", "LA"], size=N),
    "plan": rng.choice(["basic", "pro"], size=N)
})
y = (X["income"] + 1000*(X["plan"] == "pro") + rng.normal(0, 5000, size=N) > 60000).astype(int)

num_cols = ["age", "income"]
cat_cols = ["city", "plan"]

num_pipe = Pipeline([
    ("impute", SimpleImputer(strategy="median")),
    ("scale", StandardScaler())
])

cat_pipe = Pipeline([
    ("impute", SimpleImputer(strategy="most_frequent")),
    ("onehot", OneHotEncoder(handle_unknown="ignore"))
])

prep = ColumnTransformer([
    ("num", num_pipe, num_cols),
    ("cat", cat_pipe, cat_cols)
])

pipe = Pipeline([
    ("prep", prep),
    ("clf", LogisticRegression(max_iter=1000))
])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
pipe.fit(X_train, y_train)
pred = pipe.predict(X_test)
print("Accuracy:", accuracy_score(y_test, pred))

Why this is good: imputing, encoding, scaling, and modeling are all applied consistently during fit and predict, reducing errors.

Example 2: Hyperparameter tuning without leakage

Show code

from sklearn.model_selection import GridSearchCV, StratifiedKFold
from sklearn.metrics import f1_score, make_scorer

param_grid = {
    "clf__C": [0.1, 1, 10],
    "clf__penalty": ["l2"],
    "clf__solver": ["lbfgs"]
}

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

grid = GridSearchCV(
    estimator=pipe,
    param_grid=param_grid,
    scoring=make_scorer(f1_score),
    cv=cv,
    n_jobs=-1,
    refit=True
)

grid.fit(X, y)
print("Best params:", grid.best_params_)
print("Best CV F1:", grid.best_score_)

best_model = grid.best_estimator_
# best_model is still a Pipeline; you can call .predict on raw X

Leakage is prevented because the imputer/encoder/scaler refit inside each CV fold only on that fold’s training split.

Example 3: Combining text and structured features

Show code

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier

# Sample data with text
X = pd.DataFrame({
    "review": ["great battery", "poor screen", "excellent camera", "battery ok"],
    "price": [299, 199, 499, 249],
    "brand": ["A", "B", "A", "C"]
})
y = np.array([1, 0, 1, 0])

text_col = ["review"]
num_cols = ["price"]
cat_cols = ["brand"]

text_pipe = Pipeline([
    ("tfidf", TfidfVectorizer())
])

num_pipe = Pipeline([
    ("impute", SimpleImputer(strategy="median")),
    ("scale", StandardScaler())
])

cat_pipe = Pipeline([
    ("impute", SimpleImputer(strategy="most_frequent")),
    ("onehot", OneHotEncoder(handle_unknown="ignore"))
])

prep = ColumnTransformer([
    ("text", text_pipe, "review"),
    ("num", num_pipe, num_cols),
    ("cat", cat_pipe, cat_cols)
])

pipe = Pipeline([
    ("prep", prep),
    ("clf", RandomForestClassifier(random_state=0))
])

pipe.fit(X, y)
print(pipe.predict(X))

Different modalities are handled in a single ColumnTransformer. The RandomForest sees a combined sparse/dense feature space seamlessly.

How to build robust sklearn pipelines (steps)

List your raw inputs: numeric, categorical, text, dates, images (if any).
Decide per-type preprocessing: impute, encode, scale, tokenize.
Create per-type sub-pipelines.
Assemble with ColumnTransformer.
Append a final estimator in a Pipeline.
Tune with GridSearchCV/RandomizedSearchCV using step-prefixed param names (e.g., clf__n_estimators).
Evaluate with cross-validation on the full pipeline.
Persist the fitted pipeline with joblib and reuse for inference.

Exercises

These mirror the graded exercise below. Do them directly in your IDE or notebook.

Exercise 1: Build and tune a leak-free pipeline

Goal: Create a pipeline for a synthetic binary classification dataset with mixed features; tune C for LogisticRegression; report CV accuracy and the best params.

Numeric: two columns with missing values and scaling
Categorical: two columns with unknown categories at test time

Checklist:

[ ] Use SimpleImputer for both numeric and categorical
[ ] Use StandardScaler for numeric
[ ] Use OneHotEncoder(handle_unknown="ignore") for categorical
[ ] Use Pipeline + ColumnTransformer
[ ] Tune clf__C over [0.1, 1, 10] with StratifiedKFold(5)
[ ] Print best params and mean CV accuracy

Common mistakes and how to self-check

Fitting preprocessing on the full dataset before CV. Self-check: Does your code call fit only on the pipeline object and not separately on transformers?
Forgetting handle_unknown="ignore" in OneHotEncoder. Self-check: Can your pipeline predict on data with unseen categories without errors?
Scaling after the model or outside the pipeline. Self-check: Is scale inside the Pipeline before the estimator?
Using column names that do not exist at inference. Self-check: Keep the same schema and order; prefer DataFrame inputs with explicit column lists.
Not setting max_iter for solvers. Self-check: Avoid convergence warnings by setting reasonable max_iter.

Practical projects

Customer churn: Mixed numeric/categorical features; compare LogisticRegression vs XGBoost (via sklearn API) with a shared preprocessing pipeline.
Text + metadata sentiment model: Tfidf on review text plus price/brand features in a ColumnTransformer.
Time-sliced retraining: Fit a pipeline on months 1–6, validate on month 7; package with joblib and load for batch scoring.

Learning path

Refresh scikit-learn transformers and estimators
Master ColumnTransformer and Pipeline
Practice leak-free cross-validation over full pipelines
Hyperparameter tuning with GridSearchCV/RandomizedSearchCV
Custom transformers (FunctionTransformer or TransformerMixin)
Model persistence, inference scripts, and schema checks

Next steps

Wrap your pipeline in a simple script that loads a CSV, outputs predictions
Add input validation (column presence, dtypes)
Benchmark two models under the same preprocessing

Mini challenge

Build a pipeline that handles:

Numeric: impute median, scale
Categorical: impute most_frequent, one-hot with unknowns
Text: Tfidf with min_df=2

Train a linear model (e.g., LogisticRegression) and report cross-validated F1. Acceptance: You can call predict on a small test batch containing a new category and it runs without errors.

Quick Test

The quick test is available to everyone. Only logged-in users will have their progress saved.

Instructions

Create a synthetic dataset and build a full Pipeline with ColumnTransformer. Tune LogisticRegression C. Report best params and CV accuracy.

Starter code

import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.model_selection import GridSearchCV, StratifiedKFold
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# 1) Data
data_X, y = make_classification(n_samples=800, n_features=2, n_informative=2, random_state=0)
numA = data_X[:, 0]
numB = data_X[:, 1]
rng = np.random.RandomState(0)
# Inject some missing values
mask = rng.rand(numA.shape[0]) < 0.1
numA[mask] = np.nan

cat1 = rng.choice(["red","green","blue"], size=800)
cat2 = rng.choice(["basic","pro"], size=800)

X = pd.DataFrame({"numA": numA, "numB": numB, "cat1": cat1, "cat2": cat2})

# 2) TODO: Build numeric and categorical pipelines
num_cols = ["numA", "numB"]
cat_cols = ["cat1", "cat2"]

num_pipe = Pipeline([
    # fill here
])

cat_pipe = Pipeline([
    # fill here
])

prep = ColumnTransformer([
    ("num", num_pipe, num_cols),
    ("cat", cat_pipe, cat_cols)
])

pipe = Pipeline([
    ("prep", prep),
    ("clf", LogisticRegression(max_iter=1000))
])

# 3) Tune C
param_grid = {"clf__C": [0.1, 1, 10]}
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

search = GridSearchCV(pipe, param_grid=param_grid, scoring="accuracy", cv=cv, n_jobs=-1, refit=True)
search.fit(X, y)

print("Best params:", search.best_params_)
print("Best CV accuracy:", search.best_score_)

Menu

Training Pipelines With Sklearn

Table of Contents

Who this is for

Prerequisites

Why this matters

Concept explained simply

Mental model

Core components you will use

Worked examples

Example 1: Mixed numeric + categorical preprocessing with a single predict-ready object

Example 2: Hyperparameter tuning without leakage

Example 3: Combining text and structured features

How to build robust sklearn pipelines (steps)

Exercises

Common mistakes and how to self-check

Practical projects

Learning path

Next steps

Mini challenge

Quick Test

Practice Exercises

Build and tune a leak-free sklearn pipeline

Instructions

Expected Output

Training Pipelines With Sklearn — Quick Test

Have questions about Training Pipelines With Sklearn?

AI Assistant