Why this matters
Custom transformers let you put your domain logic directly into ML pipelines. That means cleaner training code, consistent preprocessing in production, and easier experimentation.
- Automate feature cleaning, encoding, and scaling inside Pipelines.
- Ensure the exact same transforms run during training, validation, and inference.
- Package domain logic so teammates can reuse it safely.
- Measure impact of a transform with cross-validation and model registry workflows.
Who this is for
- Machine Learning Engineers who need reproducible preprocessing in pipelines.
- Data Scientists turning notebooks into deployable pipelines.
- ML Ops engineers standardizing data flows across training and serving.
Prerequisites
- Comfortable with Python and NumPy.
- Basic experience with at least one ML framework (scikit-learn, PyTorch, or TensorFlow).
- Know how to train a simple model and evaluate it.
Concept explained simply
A transformer is a small, reusable component that takes in data and returns transformed data. In scikit-learn, it implements fit (optional) and transform; in PyTorch, it is a callable object used in transforms.Compose; in TensorFlow, it is typically a function used with tf.data or a custom Keras Layer.
Deep dive: stateful vs stateless
- Stateless: transform does not depend on the dataset (e.g., log1p, clipping). fit just returns self.
- Stateful: transform needs learned parameters from data (e.g., computing means, vocabulary). fit stores state, transform uses it.
Mental model
Think of your pipeline as a conveyor belt. Each transformer is a station on the belt. Every station must accept the same number of items coming in as going out (same number of rows) and should clearly label what it produces (feature names or shape). If a station needs to learn something (like the average), it learns during fit and reuses it later without peeking at labels or test data.
Design checklist for production-grade transformers
- API: For scikit-learn, implement __init__(with only simple attributes), fit(X, y=None), transform(X), and optionally get_feature_names_out.
- Shapes: Preserve number of rows; document the number of output columns.
- Types: Accept NumPy arrays and pandas DataFrames where reasonable; return NumPy arrays unless integrating with pandas explicitly.
- Deterministic: Avoid global random state; accept random_state and seed local generators.
- Vectorized: Prefer vectorized NumPy/TF operations over Python loops.
- Serializable: scikit-learn objects should be pickle/joblib safe; store only arrays and scalars.
- Performance: Avoid unnecessary copies; use in-place operations when safe; batch operations in tf.data and PyTorch.
- Naming: Provide get_feature_names_out when the number of features matches input; otherwise document outputs clearly.
- Validation: Raise informative errors on invalid inputs (NaNs, wrong dtypes).
Worked examples
Example 1 — scikit-learn: Custom Log1p transformer in a ColumnTransformer
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
import numpy as np
import pandas as pd
class Log1p(BaseEstimator, TransformerMixin):
def __init__(self, eps=1e-6):
self.eps = eps
def fit(self, X, y=None):
X = np.asarray(X)
self.n_features_in_ = X.shape[1] if X.ndim == 2 else 1
return self
def transform(self, X):
X = np.asarray(X, dtype=float)
X = np.clip(X, self.eps, None)
return np.log1p(X)
def get_feature_names_out(self, input_features=None):
if input_features is None:
return np.array([f"log1p_{i}" for i in range(getattr(self, 'n_features_in_', 0))])
return np.array([f"log1p_{c}" for c in input_features])
# Sample data
X = pd.DataFrame({
'price': [100.0, 200.0, 0.0, 50.0],
'area': [30.0, 60.0, 45.0, 10.0],
'city': ['A', 'B', 'A', 'C']
})
y = np.array([1, 0, 1, 0])
num_features = ['price', 'area']
cat_features = ['city']
num_pipe = Pipeline([
('impute', SimpleImputer(strategy='median')),
('log', Log1p()),
('scale', StandardScaler())
])
cat_pipe = Pipeline([
('impute', SimpleImputer(strategy='most_frequent')),
('onehot', OneHotEncoder(handle_unknown='ignore'))
])
pre = ColumnTransformer([
('num', num_pipe, num_features),
('cat', cat_pipe, cat_features)
])
Z = pre.fit_transform(X, y)
print(Z.shape) # rows preserved, columns = 2 (num) + 3 (cat) = 5
What to notice
- Log1p is stateless; fit returns self but sets n_features_in_ for names.
- It works inside a Pipeline and ColumnTransformer.
- Rows are preserved and output is well-defined.
Example 2 — PyTorch: Custom transform in a Compose
import torch
from torch.utils.data import Dataset, DataLoader
from torchvision import transforms
class ToFloatTensor:
def __call__(self, sample):
return torch.as_tensor(sample, dtype=torch.float32)
class Standardize:
def __init__(self, mean, std, eps=1e-6):
self.mean = torch.tensor(mean)
self.std = torch.tensor(std)
self.eps = eps
def __call__(self, x):
return (x - self.mean) / (self.std + self.eps)
class ToyDataset(Dataset):
def __init__(self, X, transform=None):
self.X = X
self.transform = transform
def __len__(self):
return len(self.X)
def __getitem__(self, idx):
x = self.X[idx]
if self.transform:
x = self.transform(x)
return x
X = [[1.0, 10.0], [2.0, 20.0], [3.0, 30.0]]
transform = transforms.Compose([
ToFloatTensor(),
Standardize(mean=[2.0, 20.0], std=[1.0, 10.0])
])
ds = ToyDataset(X, transform=transform)
loader = DataLoader(ds, batch_size=2)
for batch in loader:
print(batch.shape) # torch.Size([2, 2])
What to notice
- Transforms are callables chained left-to-right.
- They should be fast, pure functions; keep randomness controlled.
Example 3 — TensorFlow: Custom tf.data map
import tensorflow as tf
X = tf.constant([[1.0, 10.0], [2.0, 20.0], [3.0, 30.0]], dtype=tf.float32)
mean = tf.constant([2.0, 20.0], dtype=tf.float32)
std = tf.constant([1.0, 10.0], dtype=tf.float32)
def standardize(x):
eps = tf.constant(1e-6, dtype=tf.float32)
return (x - mean) / (std + eps)
@tf.function
def pipeline(batch):
return standardize(batch)
ds = tf.data.Dataset.from_tensor_slices(X).batch(2)
ds = ds.map(pipeline, num_parallel_calls=tf.data.AUTOTUNE).prefetch(tf.data.AUTOTUNE)
for b in ds:
tf.print(tf.shape(b)) # [2, 2]
What to notice
- Use tf.function for graph execution.
- Map is parallelizable; add cache/prefetch for performance.
Exercises
Complete these hands-on tasks. They mirror the graded exercises below and help you prepare for the quick test.
- Exercise ex1: Build a stateful scikit-learn transformer and integrate it in a Pipeline.
- Exercise ex2: Build a PyTorch transform chain and verify batch shapes.
Checklist before you run
- Input and output row counts match.
- No use of global random state.
- Works with simple synthetic data first.
Common mistakes and self-checks
- Changing row counts. Self-check: compare len(X) before/after transform.
- Leaking target info in fit. Self-check: ensure fit ignores y or uses it only when intended (rare for transformers).
- Using Python loops instead of vectorization. Self-check: time a small benchmark; prefer NumPy/TF ops.
- Not handling dtypes or NaNs. Self-check: add explicit dtype casts and clear errors for invalid inputs.
- Non-serializable attributes. Self-check: store only arrays, numbers, strings; avoid open file handles or lambdas in attributes.
Practical projects
- Numeric features kit: Implement and package a set of transformers (Winsorize, Log1p, RobustScale) and benchmark them on a regression dataset.
- Acceptance: Each transformer passes unit tests for shapes, NaN handling, and serialization.
- Text preprocessing pipeline: Custom tokenizer transformer + character filters integrated before TF-IDF.
- Acceptance: Cross-validation shows measurable improvement vs baseline tokenizer.
- Image augmentation bundle (PyTorch or TF): Compose deterministic and stochastic transforms with a seed.
- Acceptance: Reproducible augmentations given a fixed seed; training runs are comparable.
Learning path
- Rebuild the examples above from scratch and run them.
- Create one stateless and one stateful transformer for scikit-learn.
- Add serialization and unit tests (shape, dtype, determinism).
- Integrate into an end-to-end pipeline and evaluate with cross-validation.
- Port the idea to PyTorch (transforms.Compose) or TensorFlow (tf.data).
Next steps
- Refactor your current ML project to move preprocessing into a Pipeline or tf.data.
- Add logging around transformers (input/output shapes, timing).
- Create a small internal library of reusable transformers.
Quick Test
Take the quick test to check your understanding. Available to everyone; only logged-in users get saved progress.
Mini challenge
Implement a scikit-learn transformer that caps outliers using the 5th and 95th percentiles computed on training data, then logs and scales the result in a Pipeline. Verify cross-validation works and the transformer is serializable.