Why this matters
As a Machine Learning Engineer, you must choose models that generalize well and tune them efficiently. Real tasks include selecting the right algorithm under time/budget constraints, preventing data leakage in pipelines, and optimizing hyperparameters to meet business metrics (e.g., AUC for fraud detection, F1 for imbalanced churn). Doing this well improves reliability, reduces costs, and speeds delivery.
Who this is for
- Engineers building production ML systems who need reliable, reproducible model evaluation.
- Data scientists moving from notebooks to pipelines with clear selection criteria.
- Analysts upgrading models with systematic tuning instead of trial-and-error.
Prerequisites
- Python and basic NumPy/Pandas.
- Intro ML: train/test split, overfitting, common metrics.
- Comfort with scikit-learn Pipelines and model APIs.
Concept explained simply
Model selection decides which algorithm and settings work best for your problem. Hyperparameter tuning searches the space of configuration knobs (e.g., tree depth, learning rate) to squeeze the most performance without overfitting.
Mental model
Think of the process as nested loops:
- Inner loop: Try hyperparameters and pick the best using cross-validation on training folds.
- Outer loop: Evaluate that choice on unseen folds (or a final holdout) to estimate true performance.
This separation prevents you from fooling yourself with optimistic scores.
Key techniques you will use
- Stratified k-fold cross-validation for robust estimates (classification).
- Proper metric choice: AUC/PR-AUC for imbalance, RMSE/MAE for regression.
- Search methods: GridSearchCV, RandomizedSearchCV, and Bayesian/early-stopping strategies (concepts).
- Validation and learning curves to spot under/overfitting.
- Pipelines to avoid leakage (scaling/encoding done inside CV).
- Nested cross-validation for unbiased model comparison.
- Early stopping and regularization to control variance.
Leakage checklist
- All preprocessing inside a Pipeline.
- Stratified splits for classification.
- No target leakage features (post-outcome signals).
- Time series: use time-aware splits, not random k-fold.
Worked examples
Example 1: Logistic Regression vs Random Forest (classification)
Goal: Choose the better model by ROC AUC using stratified 5-fold CV on the built-in breast_cancer dataset.
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import make_scorer, roc_auc_score
X, y = load_breast_cancer(return_X_y=True)
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
pipe_lr = Pipeline([
("scaler", StandardScaler()),
("clf", LogisticRegression(max_iter=2000))
])
pipe_rf = Pipeline([
("clf", RandomForestClassifier(random_state=42))
])
auc = make_scorer(roc_auc_score, needs_threshold=True)
scores_lr = cross_val_score(pipe_lr, X, y, cv=cv, scoring=auc)
scores_rf = cross_val_score(pipe_rf, X, y, cv=cv, scoring=auc)
print("LR AUC:", scores_lr.mean(), "+-", scores_lr.std())
print("RF AUC:", scores_rf.mean(), "+-", scores_rf.std())
Pick the higher mean AUC. Note standard deviation to assess stability.
Example 2: Randomized search for Random Forest
Goal: Tune key hyperparameters quickly.
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint
param_dist = {
"clf__n_estimators": randint(100, 600),
"clf__max_depth": randint(2, 20),
"clf__min_samples_split": randint(2, 20),
"clf__min_samples_leaf": randint(1, 10),
"clf__max_features": ["sqrt", "log2", None]
}
pipe_rf = Pipeline([
("clf", RandomForestClassifier(random_state=42))
])
search = RandomizedSearchCV(
pipe_rf, param_distributions=param_dist, n_iter=40,
scoring=auc, cv=cv, n_jobs=-1, random_state=42
)
search.fit(X, y)
print(search.best_params_)
print("Best CV AUC:", search.best_score_)
Random search covers broad spaces efficiently; increase n_iter if time allows.
Example 3: Nested CV with SVC pipeline
Goal: Unbiased performance estimate while tuning C and gamma.
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV, cross_val_score
inner = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)
outer = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
pipe_svc = Pipeline([
("scaler", StandardScaler()),
("clf", SVC(probability=True))
])
param_grid = {"clf__C": [0.1, 1, 10], "clf__gamma": [0.01, 0.1, 1]}
inner_search = GridSearchCV(pipe_svc, param_grid=param_grid, scoring=auc, cv=inner)
nested_scores = cross_val_score(inner_search, X, y, cv=outer, scoring=auc)
print("Nested CV AUC:", nested_scores.mean(), "+-", nested_scores.std())
Use nested CV when comparing families of models or when you need a reliable performance estimate for reporting.
Hands-on steps
- Define your success metric. For imbalance, prefer ROC AUC or PR AUC.
- Build a Pipeline that includes all preprocessing.
- Choose a CV strategy: StratifiedKFold for classification, KFold for regression, or time-based splits.
- Start with RandomizedSearchCV for breadth; switch to GridSearchCV near the promising region.
- Validate with learning/validation curves if scores plateau or diverge.
- Report mean ± std across folds and selected hyperparameters. Keep seeds for reproducibility.
Mini tasks
- Create a Pipeline for a gradient boosting model with StandardScaler only if needed (trees do not need scaling).
- Write a function that returns a scorer for PR AUC using average_precision_score.
- Plot a learning curve to test if more data helps.
Exercises
Do these in order. A progress note: everyone can take the exercises and quick test; only logged-in users will have progress saved.
- [ex1] Compare LogisticRegression vs RandomForest on breast_cancer with Stratified 5-fold CV using ROC AUC. Report mean ± std and the winner.
- [ex2] Use RandomizedSearchCV to tune RandomForest. Report best params and CV AUC. Compare to untuned baseline.
- [ex3] Perform nested CV for an SVC with a scaling pipeline and a small grid. Report nested mean ± std AUC.
Self-check checklist
- All preprocessing inside a Pipeline.
- Used StratifiedKFold for classification.
- Reported mean and std across folds.
- Separated inner tuning from outer evaluation when required.
- Fixed random seeds where applicable.
Common mistakes and how to self-check
- Data leakage: Scaling or encoding done before CV. Fix: Put everything inside a Pipeline.
- Wrong metric: Using accuracy on imbalanced data. Fix: Use ROC AUC or PR AUC.
- Over-tuning: Too many rounds on the same validation split. Fix: Use CV or nested CV and stop early when improvements stall.
- Ignoring variance: Reporting only the best fold. Fix: Always include mean ± std.
- Comparing models on different splits. Fix: Use the same CV object for fair comparison.
DIY audit
- Print shapes of folds to ensure stratification.
- Check Pipeline steps: no transformation applied outside it.
- Log search space and random seeds for reproducibility.
Practical projects
- Churn prediction: Build pipelines for logistic regression, random forest, and gradient boosting; tune with randomized search; choose based on PR AUC.
- Credit risk scoring: Evaluate imbalanced metrics (ROC AUC, PR AUC), calibrate probabilities, and document thresholds.
- Time-series demand forecasting: Use time-aware splits (rolling origin), compare linear models vs tree-based regressors; report MAPE/RMSE.
Learning path
- Refresh metrics and CV strategies.
- Master Pipelines to eliminate leakage.
- Practice RandomizedSearchCV; then targeted GridSearchCV.
- Use validation/learning curves to diagnose bias/variance.
- Adopt nested CV for final comparison and reporting.
- Automate: reusable functions for CV, scoring, and reporting.
Next steps
- Extend to Bayesian optimization and early stopping in gradient boosting.
- Add calibration curves when decisions depend on probability estimates.
- Monitor drift and re-run selection periodically as data changes.
Take the Quick Test
Ready to check your understanding? Take the short test below. Everyone can access it; only logged-in users will have their progress saved.