luvv to helpDiscover the Best Free Online Tools
Topic 8 of 10

Integrating Custom Transformers

Learn Integrating Custom Transformers for free with explanations, exercises, and a quick test (for Machine Learning Engineer).

Published: January 1, 2026 | Updated: January 1, 2026

Why this matters

Custom transformers let you put your domain logic directly into ML pipelines. That means cleaner training code, consistent preprocessing in production, and easier experimentation.

  • Automate feature cleaning, encoding, and scaling inside Pipelines.
  • Ensure the exact same transforms run during training, validation, and inference.
  • Package domain logic so teammates can reuse it safely.
  • Measure impact of a transform with cross-validation and model registry workflows.

Who this is for

  • Machine Learning Engineers who need reproducible preprocessing in pipelines.
  • Data Scientists turning notebooks into deployable pipelines.
  • ML Ops engineers standardizing data flows across training and serving.

Prerequisites

  • Comfortable with Python and NumPy.
  • Basic experience with at least one ML framework (scikit-learn, PyTorch, or TensorFlow).
  • Know how to train a simple model and evaluate it.

Concept explained simply

A transformer is a small, reusable component that takes in data and returns transformed data. In scikit-learn, it implements fit (optional) and transform; in PyTorch, it is a callable object used in transforms.Compose; in TensorFlow, it is typically a function used with tf.data or a custom Keras Layer.

Deep dive: stateful vs stateless
  • Stateless: transform does not depend on the dataset (e.g., log1p, clipping). fit just returns self.
  • Stateful: transform needs learned parameters from data (e.g., computing means, vocabulary). fit stores state, transform uses it.

Mental model

Think of your pipeline as a conveyor belt. Each transformer is a station on the belt. Every station must accept the same number of items coming in as going out (same number of rows) and should clearly label what it produces (feature names or shape). If a station needs to learn something (like the average), it learns during fit and reuses it later without peeking at labels or test data.

Design checklist for production-grade transformers

  • API: For scikit-learn, implement __init__(with only simple attributes), fit(X, y=None), transform(X), and optionally get_feature_names_out.
  • Shapes: Preserve number of rows; document the number of output columns.
  • Types: Accept NumPy arrays and pandas DataFrames where reasonable; return NumPy arrays unless integrating with pandas explicitly.
  • Deterministic: Avoid global random state; accept random_state and seed local generators.
  • Vectorized: Prefer vectorized NumPy/TF operations over Python loops.
  • Serializable: scikit-learn objects should be pickle/joblib safe; store only arrays and scalars.
  • Performance: Avoid unnecessary copies; use in-place operations when safe; batch operations in tf.data and PyTorch.
  • Naming: Provide get_feature_names_out when the number of features matches input; otherwise document outputs clearly.
  • Validation: Raise informative errors on invalid inputs (NaNs, wrong dtypes).

Worked examples

Example 1 — scikit-learn: Custom Log1p transformer in a ColumnTransformer

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
import numpy as np
import pandas as pd

class Log1p(BaseEstimator, TransformerMixin):
    def __init__(self, eps=1e-6):
        self.eps = eps
    def fit(self, X, y=None):
        X = np.asarray(X)
        self.n_features_in_ = X.shape[1] if X.ndim == 2 else 1
        return self
    def transform(self, X):
        X = np.asarray(X, dtype=float)
        X = np.clip(X, self.eps, None)
        return np.log1p(X)
    def get_feature_names_out(self, input_features=None):
        if input_features is None:
            return np.array([f"log1p_{i}" for i in range(getattr(self, 'n_features_in_', 0))])
        return np.array([f"log1p_{c}" for c in input_features])

# Sample data
X = pd.DataFrame({
    'price': [100.0, 200.0, 0.0, 50.0],
    'area': [30.0, 60.0, 45.0, 10.0],
    'city': ['A', 'B', 'A', 'C']
})
y = np.array([1, 0, 1, 0])

num_features = ['price', 'area']
cat_features = ['city']

num_pipe = Pipeline([
    ('impute', SimpleImputer(strategy='median')),
    ('log', Log1p()),
    ('scale', StandardScaler())
])

cat_pipe = Pipeline([
    ('impute', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

pre = ColumnTransformer([
    ('num', num_pipe, num_features),
    ('cat', cat_pipe, cat_features)
])

Z = pre.fit_transform(X, y)
print(Z.shape)  # rows preserved, columns = 2 (num) + 3 (cat) = 5
What to notice
  • Log1p is stateless; fit returns self but sets n_features_in_ for names.
  • It works inside a Pipeline and ColumnTransformer.
  • Rows are preserved and output is well-defined.

Example 2 — PyTorch: Custom transform in a Compose

import torch
from torch.utils.data import Dataset, DataLoader
from torchvision import transforms

class ToFloatTensor:
    def __call__(self, sample):
        return torch.as_tensor(sample, dtype=torch.float32)

class Standardize:
    def __init__(self, mean, std, eps=1e-6):
        self.mean = torch.tensor(mean)
        self.std = torch.tensor(std)
        self.eps = eps
    def __call__(self, x):
        return (x - self.mean) / (self.std + self.eps)

class ToyDataset(Dataset):
    def __init__(self, X, transform=None):
        self.X = X
        self.transform = transform
    def __len__(self):
        return len(self.X)
    def __getitem__(self, idx):
        x = self.X[idx]
        if self.transform:
            x = self.transform(x)
        return x

X = [[1.0, 10.0], [2.0, 20.0], [3.0, 30.0]]
transform = transforms.Compose([
    ToFloatTensor(),
    Standardize(mean=[2.0, 20.0], std=[1.0, 10.0])
])

ds = ToyDataset(X, transform=transform)
loader = DataLoader(ds, batch_size=2)

for batch in loader:
    print(batch.shape)  # torch.Size([2, 2])
What to notice
  • Transforms are callables chained left-to-right.
  • They should be fast, pure functions; keep randomness controlled.

Example 3 — TensorFlow: Custom tf.data map

import tensorflow as tf

X = tf.constant([[1.0, 10.0], [2.0, 20.0], [3.0, 30.0]], dtype=tf.float32)
mean = tf.constant([2.0, 20.0], dtype=tf.float32)
std = tf.constant([1.0, 10.0], dtype=tf.float32)

def standardize(x):
    eps = tf.constant(1e-6, dtype=tf.float32)
    return (x - mean) / (std + eps)

@tf.function
def pipeline(batch):
    return standardize(batch)

ds = tf.data.Dataset.from_tensor_slices(X).batch(2)
ds = ds.map(pipeline, num_parallel_calls=tf.data.AUTOTUNE).prefetch(tf.data.AUTOTUNE)

for b in ds:
    tf.print(tf.shape(b))  # [2, 2]
What to notice
  • Use tf.function for graph execution.
  • Map is parallelizable; add cache/prefetch for performance.

Exercises

Complete these hands-on tasks. They mirror the graded exercises below and help you prepare for the quick test.

  • Exercise ex1: Build a stateful scikit-learn transformer and integrate it in a Pipeline.
  • Exercise ex2: Build a PyTorch transform chain and verify batch shapes.
Checklist before you run
  • Input and output row counts match.
  • No use of global random state.
  • Works with simple synthetic data first.

Common mistakes and self-checks

  • Changing row counts. Self-check: compare len(X) before/after transform.
  • Leaking target info in fit. Self-check: ensure fit ignores y or uses it only when intended (rare for transformers).
  • Using Python loops instead of vectorization. Self-check: time a small benchmark; prefer NumPy/TF ops.
  • Not handling dtypes or NaNs. Self-check: add explicit dtype casts and clear errors for invalid inputs.
  • Non-serializable attributes. Self-check: store only arrays, numbers, strings; avoid open file handles or lambdas in attributes.

Practical projects

  • Numeric features kit: Implement and package a set of transformers (Winsorize, Log1p, RobustScale) and benchmark them on a regression dataset.
    • Acceptance: Each transformer passes unit tests for shapes, NaN handling, and serialization.
  • Text preprocessing pipeline: Custom tokenizer transformer + character filters integrated before TF-IDF.
    • Acceptance: Cross-validation shows measurable improvement vs baseline tokenizer.
  • Image augmentation bundle (PyTorch or TF): Compose deterministic and stochastic transforms with a seed.
    • Acceptance: Reproducible augmentations given a fixed seed; training runs are comparable.

Learning path

  1. Rebuild the examples above from scratch and run them.
  2. Create one stateless and one stateful transformer for scikit-learn.
  3. Add serialization and unit tests (shape, dtype, determinism).
  4. Integrate into an end-to-end pipeline and evaluate with cross-validation.
  5. Port the idea to PyTorch (transforms.Compose) or TensorFlow (tf.data).

Next steps

  • Refactor your current ML project to move preprocessing into a Pipeline or tf.data.
  • Add logging around transformers (input/output shapes, timing).
  • Create a small internal library of reusable transformers.

Quick Test

Take the quick test to check your understanding. Available to everyone; only logged-in users get saved progress.

Mini challenge

Implement a scikit-learn transformer that caps outliers using the 5th and 95th percentiles computed on training data, then logs and scales the result in a Pipeline. Verify cross-validation works and the transformer is serializable.

Practice Exercises

2 exercises to complete

Instructions

  1. Implement a PercentileCapper transformer with parameters lower=5, upper=95.
  2. fit(X) computes per-column lower/upper percentiles and stores them.
  3. transform(X) caps values to [lower_p, upper_p] and returns a NumPy array.
  4. Integrate it into a Pipeline with SimpleImputer(strategy='median') and StandardScaler.
  5. Run cross_val_score with a LogisticRegression on synthetic numeric data; print mean score.
Expected Output
Printed cross-validation mean score and a transformed array with the same number of rows as input.

Integrating Custom Transformers — Quick Test

Test your knowledge with 8 questions. Pass with 70% or higher.

8 questions70% to pass

Have questions about Integrating Custom Transformers?

AI Assistant

Ask questions about this tool