Why ML Frameworks matter for Machine Learning Engineers
ML frameworks turn ideas into reliable, production-ready systems. As a Machine Learning Engineer, you use frameworks like scikit-learn, PyTorch, and TensorFlow to build reproducible pipelines, tune models, deploy artifacts, and explain predictions. Mastering them means you can move fast, avoid data leakage, and deliver models that are measurable, maintainable, and ready for real users.
- Ship end-to-end pipelines with consistent preprocessing and inference.
- Run cross-validation at scale and compare models fairly.
- Handle imbalanced data and tune thresholds for real metrics.
- Persist artifacts for deployment and governance.
- Offer feature importance and basic explainability to stakeholders.
Practical roadmap
- Build a simple sklearn Pipeline with preprocessing + a baseline model. Keep it end-to-end.
- Add ColumnTransformer to handle numeric, categorical, and date features correctly.
- Introduce cross-validation and pick metrics that reflect business goals (e.g., ROC AUC, PR AUC, F1).
- Tune hyperparameters with GridSearchCV or RandomizedSearchCV. Refit on the best score.
- Address class imbalance with class weights, resampling, and threshold tuning.
- Persist artifacts (pipeline + metadata) with joblib. Track versions.
- Explain predictions via coefficients/permutation importance and sanity checks.
- Integrate custom transformers to encapsulate domain logic.
- Reproducibility: set seeds, fix CV splits, log parameters, and capture environment info.
Milestone checklist
- Pipeline runs end-to-end from raw dataframe to predictions.
- No data leakage across train/test or CV folds.
- Model selection is metric-driven with cross-validation.
- Artifacts saved with versions and metadata.
- Basic feature importance and threshold selection documented.
Worked examples (copy–paste runnable)
1) End-to-end Pipeline with ColumnTransformer
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import roc_auc_score, f1_score
from sklearn.base import BaseEstimator, TransformerMixin
# Custom transformer: group rare categories
class RareCategoryGrouper(BaseEstimator, TransformerMixin):
def __init__(self, min_freq=20):
self.min_freq = min_freq
self.rare_map_ = {}
def fit(self, X, y=None):
# X is pandas DataFrame of categorical columns
self.rare_map_ = {}
for col in X.columns:
counts = X[col].value_counts()
keep = counts[counts >= self.min_freq].index
self.rare_map_[col] = set(keep)
return self
def transform(self, X):
X = X.copy()
for col in X.columns:
keep = self.rare_map_[col]
X[col] = np.where(X[col].isin(keep), X[col], 'OTHER')
return X
# Mock dataset
rng = np.random.default_rng(42)
N = 2000
num1 = rng.normal(0, 1, N)
num2 = rng.normal(5, 2, N)
cat = rng.choice(["A","B","C","D","E","F","G"], size=N, p=[.3,.25,.2,.1,.05,.05,.05])
y = (0.3*num1 + 0.1*num2 + (cat == 'A').astype(int) + rng.normal(0,0.5,N) > 0.7).astype(int)
df = pd.DataFrame({"num1": num1, "num2": num2, "cat": cat})
num_features = ["num1", "num2"]
cat_features = ["cat"]
preprocess = ColumnTransformer(
transformers=[
("num", StandardScaler(), num_features),
("cat", Pipeline(steps=[
("rare", RareCategoryGrouper(min_freq=50)),
("ohe", OneHotEncoder(handle_unknown="ignore"))
]), cat_features)
]
)
clf = Pipeline(steps=[
("preprocess", preprocess),
("model", LogisticRegression(max_iter=1000))
])
X_train, X_test, y_train, y_test = train_test_split(df, y, test_size=0.2, stratify=y, random_state=42)
clf.fit(X_train, y_train)
proba = clf.predict_proba(X_test)[:,1]
roc = roc_auc_score(y_test, proba)
pred = (proba > 0.35).astype(int) # tuned threshold example
f1 = f1_score(y_test, pred)
print({"roc_auc": round(roc,3), "f1@0.35": round(f1,3)})
2) Model selection with cross-validation and GridSearchCV
from sklearn.model_selection import StratifiedKFold, GridSearchCV
from sklearn.linear_model import LogisticRegression
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
param_grid = {
"model__C": [0.1, 0.5, 1.0, 2.0],
"model__penalty": ["l2"],
"model__solver": ["lbfgs"]
}
# Reuse 'clf' from the previous example (pipeline with 'model' as LogisticRegression)
grid = GridSearchCV(
estimator=clf,
param_grid=param_grid,
scoring={"roc_auc": "roc_auc", "f1": "f1"},
refit="roc_auc",
cv=cv,
n_jobs=-1,
return_train_score=True
)
grid.fit(X_train, y_train)
print("Best params:", grid.best_params_)
print("CV best roc_auc:", round(grid.best_score_, 3))
# Use best_estimator_ directly
best_clf = grid.best_estimator_
print("Test roc_auc:", round(roc_auc_score(y_test, best_clf.predict_proba(X_test)[:,1]), 3))
3) Handling imbalanced data: class weights and threshold tuning
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import precision_recall_curve, f1_score, average_precision_score
X, y = make_classification(n_samples=8000, n_features=20, weights=[0.95, 0.05], random_state=42)
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)
# Class weights counter imbalance
model = LogisticRegression(max_iter=1000, class_weight="balanced")
model.fit(X_tr, y_tr)
proba = model.predict_proba(X_te)[:,1]
print("AP:", round(average_precision_score(y_te, proba), 3)) # PR-AUC
# Tune threshold for F1
prec, rec, thr = precision_recall_curve(y_te, proba)
f1s = 2 * (prec*rec) / (prec + rec + 1e-12)
best_idx = f1s.argmax()
best_thr = thr[max(best_idx-1, 0)] if best_idx < len(thr) else 0.5
pred = (proba > best_thr).astype(int)
print({"best_threshold": round(float(best_thr),3), "F1": round(f1_score(y_te, pred),3)})
# Optional (requires imbalanced-learn): SMOTE or undersampling
# from imblearn.over_sampling import SMOTE
# X_res, y_res = SMOTE(random_state=42).fit_resample(X_tr, y_tr)
# model.fit(X_res, y_res)
4) Persisting the trained pipeline with metadata
import os, time, json
from joblib import dump
# Suppose 'best_clf' is your final pipeline
run_ts = time.strftime("%Y%m%d-%H%M%S")
out_dir = f"artifacts/run-{run_ts}"
os.makedirs(out_dir, exist_ok=True)
# Save model
dump(best_clf, os.path.join(out_dir, "model.joblib"))
# Save run metadata
meta = {
"framework": "scikit-learn",
"version": "<record sklearn.__version__ here>",
"cv": "StratifiedKFold(n_splits=5, shuffle=True, random_state=42)",
"best_params": getattr(grid, "best_params_", None),
"metrics": {"roc_auc_test": float(roc_auc_score(y_test, best_clf.predict_proba(X_test)[:,1]))}
}
with open(os.path.join(out_dir, "meta.json"), "w") as f:
json.dump(meta, f, indent=2)
print("Saved to:", out_dir)
5) Feature importance: coefficients and permutation importance
from sklearn.inspection import permutation_importance
import numpy as np
# Coefficients (model-level) for LogisticRegression
logreg = best_clf.named_steps["model"]
print("Model coefficients shape:", logreg.coef_.shape)
# Permutation importance on the whole pipeline (model-agnostic)
r = permutation_importance(best_clf, X_test, y_test, n_repeats=10, random_state=42, scoring="roc_auc")
# ColumnTransformer expands features: get feature names
pre = best_clf.named_steps["preprocess"]
num_names = pre.transformers_[0][2]
cat_pipeline = pre.transformers_[1][1]
cat_ohe = cat_pipeline.named_steps["ohe"]
cat_names = list(cat_ohe.get_feature_names_out(pre.transformers_[1][2]))
all_names = list(num_names) + cat_names
imp = sorted(zip(all_names, r.importances_mean), key=lambda x: x[1], reverse=True)[:10]
print("Top 10 features by permutation importance:")
for name, val in imp:
print(name, round(float(val), 4))
Drills and exercises
- Convert a notebook model into a single sklearn Pipeline that runs .predict on raw DataFrame input.
- Add ColumnTransformer: scale numeric, one-hot encode categorical, and engineer at least one date or text length feature.
- Run 5-fold Stratified CV and report mean ± std for ROC AUC and F1.
- Try GridSearchCV and RandomizedSearchCV on at least two model families (LogisticRegression, RandomForest).
- Handle imbalance: compare class_weight and resampling; tune decision threshold for F1 and for recall ≥ desired level.
- Export model.joblib and meta.json; reload and score on a fresh holdout split.
- Implement a custom transformer (e.g., RareCategoryGrouper, Log1pTransformer) and reuse it in the pipeline.
- Fix a seed and verify you can reproduce results across reruns (within tolerance).
Common mistakes and debugging tips
- Data leakage: fitting scalers/encoders on full data before CV. Fix by putting all preprocessing inside Pipeline/ColumnTransformer.
- Wrong CV for classification: using KFold instead of StratifiedKFold. Prefer stratification for class balance in folds.
- Using accuracy on imbalanced tasks: prefer PR AUC, ROC AUC, F1, or cost-weighted metrics.
- Mismatched feature names after OHE: always retrieve get_feature_names_out from the encoder to align with importance plots.
- Non-reproducible results: forgot random_state in CV, model, or resampling. Set seeds consistently.
- Overfitting in tuning: reusing test set during GridSearch. Keep a final untouched holdout or use nested CV.
- Forgetting to persist metadata: store versions, params, and metrics together with the model file.
Quick debugging checklist
- Check that .fit is called only on training folds within CV.
- Verify class distribution per fold.
- Plot precision–recall curve and pick threshold aligned to business cost.
- Ensure pipeline .predict works on raw, messy input (unexpected categories should not crash).
Mini project: Credit default risk scorer
Goal: Build a reproducible binary classifier that flags potential defaults with high recall and acceptable precision.
- Data prep: create numeric and categorical features; include at least one domain rule as a custom transformer.
- Pipeline: ColumnTransformer with scaling and OHE; include custom transformer.
- CV and tuning: Stratified 5-fold; optimize PR AUC; pick hyperparameters via RandomizedSearchCV.
- Imbalance: compare class_weight versus SMOTE (optional); choose the approach that yields higher PR AUC on CV.
- Threshold: select threshold to achieve recall ≥ 0.85 with maximal precision.
- Persistence: save model.joblib and meta.json with CV scores and chosen threshold.
- Explainability: compute permutation importance; list the top 5 drivers.
Acceptance criteria
- Reproducible: rerun produces similar CV scores (± small variance).
- Generalizable: performance on holdout within 5% of CV mean.
- Operational: single pipeline .predict on raw DataFrame.
- Documented: meta.json contains params, versions, metrics, and threshold.
Subskills
- Training Pipelines With Sklearn — Build end-to-end pipelines that wrap preprocessing and models for safe, reusable training and inference. Est. 60–90 min.
- Preprocessing Pipelines And Column Transformers — Cleanly handle numeric, categorical, and custom features with ColumnTransformer. Est. 60–90 min.
- Model Selection And Hyperparameter Tuning — Use GridSearchCV/RandomizedSearchCV with proper scoring and refit. Est. 60–120 min.
- Cross Validation And Metrics — Choose CV schemes and business-aligned metrics; report mean ± std. Est. 45–90 min.
- Handling Imbalanced Data — Apply class weights, resampling, and threshold tuning for PR AUC/F1. Est. 60–120 min.
- Model Persistence And Serialization — Save/load pipelines with joblib and versioned metadata. Est. 30–60 min.
- Feature Importance And Explainability Basics — Use coefficients and permutation importance responsibly. Est. 45–90 min.
- Integrating Custom Transformers — Encapsulate domain logic via sklearn-compatible transformers. Est. 45–90 min.
- Reproducible Training Runs — Control seeds, CV splits, and record environment info. Est. 30–60 min.
- Managing Random Seeds — Systematically set seeds for numpy, sklearn, and any resampling. Est. 20–40 min.
Who this is for
- Early-career ML Engineers building their first production-ready models.
- Data Scientists moving beyond notebooks to reproducible pipelines.
- Software Engineers integrating ML into products.
Prerequisites
- Comfortable with Python, NumPy, pandas.
- Basic ML knowledge: train/test split, overfitting, classification/regression.
- Familiarity with Jupyter or scripts and virtual environments.
Learning path
- Wrap your current model in a Pipeline.
- Add ColumnTransformer and re-run CV.
- Introduce hyperparameter search; keep a clean holdout.
- Handle imbalance and choose a decision threshold.
- Persist artifacts + metadata and rehearse reload + inference.
- Add a custom transformer and permutation importance.
Next steps
- Automate training: parameterize scripts and lock random seeds.
- Containerize inference: load model.joblib in a minimal service.
- Expand explainability: compare permutation importance across folds to check stability.