How to learn Integrating Custom Transformers for ML Frameworks in Machine Learning Engineer for free

Why this matters

Custom transformers let you put your domain logic directly into ML pipelines. That means cleaner training code, consistent preprocessing in production, and easier experimentation.

Automate feature cleaning, encoding, and scaling inside Pipelines.
Ensure the exact same transforms run during training, validation, and inference.
Package domain logic so teammates can reuse it safely.
Measure impact of a transform with cross-validation and model registry workflows.

Who this is for

Machine Learning Engineers who need reproducible preprocessing in pipelines.
Data Scientists turning notebooks into deployable pipelines.
ML Ops engineers standardizing data flows across training and serving.

Prerequisites

Comfortable with Python and NumPy.
Basic experience with at least one ML framework (scikit-learn, PyTorch, or TensorFlow).
Know how to train a simple model and evaluate it.

Concept explained simply

A transformer is a small, reusable component that takes in data and returns transformed data. In scikit-learn, it implements fit (optional) and transform; in PyTorch, it is a callable object used in transforms.Compose; in TensorFlow, it is typically a function used with tf.data or a custom Keras Layer.

Deep dive: stateful vs stateless

Stateless: transform does not depend on the dataset (e.g., log1p, clipping). fit just returns self.
Stateful: transform needs learned parameters from data (e.g., computing means, vocabulary). fit stores state, transform uses it.

Mental model

Think of your pipeline as a conveyor belt. Each transformer is a station on the belt. Every station must accept the same number of items coming in as going out (same number of rows) and should clearly label what it produces (feature names or shape). If a station needs to learn something (like the average), it learns during fit and reuses it later without peeking at labels or test data.

Design checklist for production-grade transformers

API: For scikit-learn, implement __init__(with only simple attributes), fit(X, y=None), transform(X), and optionally get_feature_names_out.
Shapes: Preserve number of rows; document the number of output columns.
Types: Accept NumPy arrays and pandas DataFrames where reasonable; return NumPy arrays unless integrating with pandas explicitly.
Deterministic: Avoid global random state; accept random_state and seed local generators.
Vectorized: Prefer vectorized NumPy/TF operations over Python loops.
Serializable: scikit-learn objects should be pickle/joblib safe; store only arrays and scalars.
Performance: Avoid unnecessary copies; use in-place operations when safe; batch operations in tf.data and PyTorch.
Naming: Provide get_feature_names_out when the number of features matches input; otherwise document outputs clearly.
Validation: Raise informative errors on invalid inputs (NaNs, wrong dtypes).

Worked examples

Example 1 — scikit-learn: Custom Log1p transformer in a ColumnTransformer

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
import numpy as np
import pandas as pd

class Log1p(BaseEstimator, TransformerMixin):
    def __init__(self, eps=1e-6):
        self.eps = eps
    def fit(self, X, y=None):
        X = np.asarray(X)
        self.n_features_in_ = X.shape[1] if X.ndim == 2 else 1
        return self
    def transform(self, X):
        X = np.asarray(X, dtype=float)
        X = np.clip(X, self.eps, None)
        return np.log1p(X)
    def get_feature_names_out(self, input_features=None):
        if input_features is None:
            return np.array([f"log1p_{i}" for i in range(getattr(self, 'n_features_in_', 0))])
        return np.array([f"log1p_{c}" for c in input_features])

# Sample data
X = pd.DataFrame({
    'price': [100.0, 200.0, 0.0, 50.0],
    'area': [30.0, 60.0, 45.0, 10.0],
    'city': ['A', 'B', 'A', 'C']
})
y = np.array([1, 0, 1, 0])

num_features = ['price', 'area']
cat_features = ['city']

num_pipe = Pipeline([
    ('impute', SimpleImputer(strategy='median')),
    ('log', Log1p()),
    ('scale', StandardScaler())
])

cat_pipe = Pipeline([
    ('impute', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

pre = ColumnTransformer([
    ('num', num_pipe, num_features),
    ('cat', cat_pipe, cat_features)
])

Z = pre.fit_transform(X, y)
print(Z.shape)  # rows preserved, columns = 2 (num) + 3 (cat) = 5

What to notice

Log1p is stateless; fit returns self but sets n_features_in_ for names.
It works inside a Pipeline and ColumnTransformer.
Rows are preserved and output is well-defined.

Example 2 — PyTorch: Custom transform in a Compose

import torch
from torch.utils.data import Dataset, DataLoader
from torchvision import transforms

class ToFloatTensor:
    def __call__(self, sample):
        return torch.as_tensor(sample, dtype=torch.float32)

class Standardize:
    def __init__(self, mean, std, eps=1e-6):
        self.mean = torch.tensor(mean)
        self.std = torch.tensor(std)
        self.eps = eps
    def __call__(self, x):
        return (x - self.mean) / (self.std + self.eps)

class ToyDataset(Dataset):
    def __init__(self, X, transform=None):
        self.X = X
        self.transform = transform
    def __len__(self):
        return len(self.X)
    def __getitem__(self, idx):
        x = self.X[idx]
        if self.transform:
            x = self.transform(x)
        return x

X = [[1.0, 10.0], [2.0, 20.0], [3.0, 30.0]]
transform = transforms.Compose([
    ToFloatTensor(),
    Standardize(mean=[2.0, 20.0], std=[1.0, 10.0])
])

ds = ToyDataset(X, transform=transform)
loader = DataLoader(ds, batch_size=2)

for batch in loader:
    print(batch.shape)  # torch.Size([2, 2])

What to notice

Transforms are callables chained left-to-right.
They should be fast, pure functions; keep randomness controlled.

Example 3 — TensorFlow: Custom tf.data map

import tensorflow as tf

X = tf.constant([[1.0, 10.0], [2.0, 20.0], [3.0, 30.0]], dtype=tf.float32)
mean = tf.constant([2.0, 20.0], dtype=tf.float32)
std = tf.constant([1.0, 10.0], dtype=tf.float32)

def standardize(x):
    eps = tf.constant(1e-6, dtype=tf.float32)
    return (x - mean) / (std + eps)

@tf.function
def pipeline(batch):
    return standardize(batch)

ds = tf.data.Dataset.from_tensor_slices(X).batch(2)
ds = ds.map(pipeline, num_parallel_calls=tf.data.AUTOTUNE).prefetch(tf.data.AUTOTUNE)

for b in ds:
    tf.print(tf.shape(b))  # [2, 2]

What to notice

Use tf.function for graph execution.
Map is parallelizable; add cache/prefetch for performance.

Exercises

Complete these hands-on tasks. They mirror the graded exercises below and help you prepare for the quick test.

Exercise ex1: Build a stateful scikit-learn transformer and integrate it in a Pipeline.
Exercise ex2: Build a PyTorch transform chain and verify batch shapes.

Checklist before you run

Input and output row counts match.
No use of global random state.
Works with simple synthetic data first.

Common mistakes and self-checks

Changing row counts. Self-check: compare len(X) before/after transform.
Leaking target info in fit. Self-check: ensure fit ignores y or uses it only when intended (rare for transformers).
Using Python loops instead of vectorization. Self-check: time a small benchmark; prefer NumPy/TF ops.
Not handling dtypes or NaNs. Self-check: add explicit dtype casts and clear errors for invalid inputs.
Non-serializable attributes. Self-check: store only arrays, numbers, strings; avoid open file handles or lambdas in attributes.

Practical projects

Numeric features kit: Implement and package a set of transformers (Winsorize, Log1p, RobustScale) and benchmark them on a regression dataset.
- Acceptance: Each transformer passes unit tests for shapes, NaN handling, and serialization.
Text preprocessing pipeline: Custom tokenizer transformer + character filters integrated before TF-IDF.
- Acceptance: Cross-validation shows measurable improvement vs baseline tokenizer.
Image augmentation bundle (PyTorch or TF): Compose deterministic and stochastic transforms with a seed.
- Acceptance: Reproducible augmentations given a fixed seed; training runs are comparable.

Learning path

Rebuild the examples above from scratch and run them.
Create one stateless and one stateful transformer for scikit-learn.
Add serialization and unit tests (shape, dtype, determinism).
Integrate into an end-to-end pipeline and evaluate with cross-validation.
Port the idea to PyTorch (transforms.Compose) or TensorFlow (tf.data).

Next steps

Refactor your current ML project to move preprocessing into a Pipeline or tf.data.
Add logging around transformers (input/output shapes, timing).
Create a small internal library of reusable transformers.

Quick Test

Take the quick test to check your understanding. Available to everyone; only logged-in users get saved progress.

Mini challenge

Implement a scikit-learn transformer that caps outliers using the 5th and 95th percentiles computed on training data, then logs and scales the result in a Pipeline. Verify cross-validation works and the transformer is serializable.

Menu

Integrating Custom Transformers

Table of Contents

Why this matters

Who this is for

Prerequisites

Concept explained simply

Mental model

Design checklist for production-grade transformers

Worked examples

Example 1 — scikit-learn: Custom Log1p transformer in a ColumnTransformer

Example 2 — PyTorch: Custom transform in a Compose

Example 3 — TensorFlow: Custom tf.data map

Exercises

Common mistakes and self-checks

Practical projects

Learning path

Next steps

Quick Test

Mini challenge

Practice Exercises

Stateful scikit-learn transformer in a Pipeline

Instructions

Expected Output

PyTorch transform chain for tabular tensors

Integrating Custom Transformers — Quick Test

Have questions about Integrating Custom Transformers?

AI Assistant