luvv to helpDiscover the Best Free Online Tools
Topic 3 of 10

Model Selection And Hyperparameter Tuning

Learn Model Selection And Hyperparameter Tuning for free with explanations, exercises, and a quick test (for Machine Learning Engineer).

Published: January 1, 2026 | Updated: January 1, 2026

Why this matters

As a Machine Learning Engineer, you must choose models that generalize well and tune them efficiently. Real tasks include selecting the right algorithm under time/budget constraints, preventing data leakage in pipelines, and optimizing hyperparameters to meet business metrics (e.g., AUC for fraud detection, F1 for imbalanced churn). Doing this well improves reliability, reduces costs, and speeds delivery.

Who this is for

  • Engineers building production ML systems who need reliable, reproducible model evaluation.
  • Data scientists moving from notebooks to pipelines with clear selection criteria.
  • Analysts upgrading models with systematic tuning instead of trial-and-error.

Prerequisites

  • Python and basic NumPy/Pandas.
  • Intro ML: train/test split, overfitting, common metrics.
  • Comfort with scikit-learn Pipelines and model APIs.

Concept explained simply

Model selection decides which algorithm and settings work best for your problem. Hyperparameter tuning searches the space of configuration knobs (e.g., tree depth, learning rate) to squeeze the most performance without overfitting.

Mental model

Think of the process as nested loops:

  • Inner loop: Try hyperparameters and pick the best using cross-validation on training folds.
  • Outer loop: Evaluate that choice on unseen folds (or a final holdout) to estimate true performance.

This separation prevents you from fooling yourself with optimistic scores.

Key techniques you will use

  • Stratified k-fold cross-validation for robust estimates (classification).
  • Proper metric choice: AUC/PR-AUC for imbalance, RMSE/MAE for regression.
  • Search methods: GridSearchCV, RandomizedSearchCV, and Bayesian/early-stopping strategies (concepts).
  • Validation and learning curves to spot under/overfitting.
  • Pipelines to avoid leakage (scaling/encoding done inside CV).
  • Nested cross-validation for unbiased model comparison.
  • Early stopping and regularization to control variance.
Leakage checklist
  • All preprocessing inside a Pipeline.
  • Stratified splits for classification.
  • No target leakage features (post-outcome signals).
  • Time series: use time-aware splits, not random k-fold.

Worked examples

Example 1: Logistic Regression vs Random Forest (classification)

Goal: Choose the better model by ROC AUC using stratified 5-fold CV on the built-in breast_cancer dataset.

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import make_scorer, roc_auc_score

X, y = load_breast_cancer(return_X_y=True)
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

pipe_lr = Pipeline([
    ("scaler", StandardScaler()),
    ("clf", LogisticRegression(max_iter=2000))
])
pipe_rf = Pipeline([
    ("clf", RandomForestClassifier(random_state=42))
])

auc = make_scorer(roc_auc_score, needs_threshold=True)

scores_lr = cross_val_score(pipe_lr, X, y, cv=cv, scoring=auc)
scores_rf = cross_val_score(pipe_rf, X, y, cv=cv, scoring=auc)

print("LR AUC:", scores_lr.mean(), "+-", scores_lr.std())
print("RF AUC:", scores_rf.mean(), "+-", scores_rf.std())

Pick the higher mean AUC. Note standard deviation to assess stability.

Example 2: Randomized search for Random Forest

Goal: Tune key hyperparameters quickly.

from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint

param_dist = {
    "clf__n_estimators": randint(100, 600),
    "clf__max_depth": randint(2, 20),
    "clf__min_samples_split": randint(2, 20),
    "clf__min_samples_leaf": randint(1, 10),
    "clf__max_features": ["sqrt", "log2", None]
}

pipe_rf = Pipeline([
    ("clf", RandomForestClassifier(random_state=42))
])

search = RandomizedSearchCV(
    pipe_rf, param_distributions=param_dist, n_iter=40,
    scoring=auc, cv=cv, n_jobs=-1, random_state=42
)
search.fit(X, y)
print(search.best_params_)
print("Best CV AUC:", search.best_score_)

Random search covers broad spaces efficiently; increase n_iter if time allows.

Example 3: Nested CV with SVC pipeline

Goal: Unbiased performance estimate while tuning C and gamma.

from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV, cross_val_score

inner = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)
outer = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

pipe_svc = Pipeline([
    ("scaler", StandardScaler()),
    ("clf", SVC(probability=True))
])

param_grid = {"clf__C": [0.1, 1, 10], "clf__gamma": [0.01, 0.1, 1]}
inner_search = GridSearchCV(pipe_svc, param_grid=param_grid, scoring=auc, cv=inner)

nested_scores = cross_val_score(inner_search, X, y, cv=outer, scoring=auc)
print("Nested CV AUC:", nested_scores.mean(), "+-", nested_scores.std())

Use nested CV when comparing families of models or when you need a reliable performance estimate for reporting.

Hands-on steps

  1. Define your success metric. For imbalance, prefer ROC AUC or PR AUC.
  2. Build a Pipeline that includes all preprocessing.
  3. Choose a CV strategy: StratifiedKFold for classification, KFold for regression, or time-based splits.
  4. Start with RandomizedSearchCV for breadth; switch to GridSearchCV near the promising region.
  5. Validate with learning/validation curves if scores plateau or diverge.
  6. Report mean ± std across folds and selected hyperparameters. Keep seeds for reproducibility.
Mini tasks
  • Create a Pipeline for a gradient boosting model with StandardScaler only if needed (trees do not need scaling).
  • Write a function that returns a scorer for PR AUC using average_precision_score.
  • Plot a learning curve to test if more data helps.

Exercises

Do these in order. A progress note: everyone can take the exercises and quick test; only logged-in users will have progress saved.

  • [ex1] Compare LogisticRegression vs RandomForest on breast_cancer with Stratified 5-fold CV using ROC AUC. Report mean ± std and the winner.
  • [ex2] Use RandomizedSearchCV to tune RandomForest. Report best params and CV AUC. Compare to untuned baseline.
  • [ex3] Perform nested CV for an SVC with a scaling pipeline and a small grid. Report nested mean ± std AUC.
Self-check checklist
  • All preprocessing inside a Pipeline.
  • Used StratifiedKFold for classification.
  • Reported mean and std across folds.
  • Separated inner tuning from outer evaluation when required.
  • Fixed random seeds where applicable.

Common mistakes and how to self-check

  • Data leakage: Scaling or encoding done before CV. Fix: Put everything inside a Pipeline.
  • Wrong metric: Using accuracy on imbalanced data. Fix: Use ROC AUC or PR AUC.
  • Over-tuning: Too many rounds on the same validation split. Fix: Use CV or nested CV and stop early when improvements stall.
  • Ignoring variance: Reporting only the best fold. Fix: Always include mean ± std.
  • Comparing models on different splits. Fix: Use the same CV object for fair comparison.
DIY audit
  • Print shapes of folds to ensure stratification.
  • Check Pipeline steps: no transformation applied outside it.
  • Log search space and random seeds for reproducibility.

Practical projects

  • Churn prediction: Build pipelines for logistic regression, random forest, and gradient boosting; tune with randomized search; choose based on PR AUC.
  • Credit risk scoring: Evaluate imbalanced metrics (ROC AUC, PR AUC), calibrate probabilities, and document thresholds.
  • Time-series demand forecasting: Use time-aware splits (rolling origin), compare linear models vs tree-based regressors; report MAPE/RMSE.

Learning path

  1. Refresh metrics and CV strategies.
  2. Master Pipelines to eliminate leakage.
  3. Practice RandomizedSearchCV; then targeted GridSearchCV.
  4. Use validation/learning curves to diagnose bias/variance.
  5. Adopt nested CV for final comparison and reporting.
  6. Automate: reusable functions for CV, scoring, and reporting.

Next steps

  • Extend to Bayesian optimization and early stopping in gradient boosting.
  • Add calibration curves when decisions depend on probability estimates.
  • Monitor drift and re-run selection periodically as data changes.

Take the Quick Test

Ready to check your understanding? Take the short test below. Everyone can access it; only logged-in users will have their progress saved.

Practice Exercises

3 exercises to complete

Instructions

Use the breast_cancer dataset. Build two Pipelines: one with StandardScaler + LogisticRegression(max_iter=2000), and one with RandomForestClassifier(random_state=42). Use StratifiedKFold(n_splits=5, shuffle=True, random_state=42) and scoring=ROC AUC (needs_threshold=True). Report mean ± std for each and declare the winner.

Deliverables:

  • Mean ± std AUC for both models.
  • Chosen model and a one-sentence justification.
Expected Output
Two AUC numbers with standard deviations and a clear statement selecting the higher mean AUC model. Reason mentions both performance and stability.

Model Selection And Hyperparameter Tuning — Quick Test

Test your knowledge with 10 questions. Pass with 70% or higher.

10 questions70% to pass

Have questions about Model Selection And Hyperparameter Tuning?

AI Assistant

Ask questions about this tool