luvv to helpDiscover the Best Free Online Tools
Topic 4 of 10

Cross Validation And Metrics

Learn Cross Validation And Metrics for free with explanations, exercises, and a quick test (for Machine Learning Engineer).

Published: January 1, 2026 | Updated: January 1, 2026

Why this matters

As a Machine Learning Engineer, you need reliable estimates of model performance before shipping. Cross-validation (CV) prevents overfitting to a single split, and metrics translate model behavior into business-relevant numbers. You will use CV and metrics to:

  • Select models and hyperparameters confidently.
  • Guard against data leakage and optimistic results.
  • Communicate performance trade-offs (e.g., precision vs. recall) to stakeholders.
  • Build repeatable evaluation pipelines in ML frameworks.
Real tasks you will face
  • Estimate uplift from a new classifier on imbalanced data with StratifiedKFold and F1.
  • Compare regression models using MAE vs. RMSE to align with business penalties.
  • Set up TimeSeriesSplit for forecasting without peeking into the future.
  • Run nested CV to avoid leakage during hyperparameter tuning.

Note: The quick test on this page is available to everyone. Only logged-in users have their progress saved.

Who this is for and prerequisites

Who this is for

  • Engineers deploying models and needing trustworthy evaluation.
  • Data scientists moving from notebooks to production-ready pipelines.
  • Analysts learning to interpret model metrics properly.

Prerequisites

  • Basic Python and machine learning concepts (features, target, train/validation/test).
  • Familiarity with a framework like scikit-learn; OK if you are still learning.
  • Comfort with arrays, dataframes, and simple model training loops.

Concept explained simply

Cross-validation splits your data into multiple folds, trains on some folds, and validates on the remaining fold. Repeat across folds and average the metric. This reduces the luck of a single split.

Mental model

Imagine testing a running shoe on different tracks. If it performs well across many tracks, you trust it more. CV is your set of tracks; metrics are your stopwatch numbers.

Core techniques

Common CV patterns and when to use

  • K-Fold: General purpose; shuffle data if IID assumptions hold.
  • Stratified K-Fold: For classification with class imbalance; preserves class ratios.
  • Group K-Fold: When samples share a group (e.g., multiple rows per user); prevents leakage across groups.
  • Leave-One-Out (LOOCV): High-variance estimate; rarely needed for large datasets.
  • TimeSeriesSplit (rolling/expanding window): For temporal data; avoids training on the future.
  • Nested CV: Use inner CV for tuning and outer CV for unbiased performance estimation.

Metric selection quick guide

Classification
  • Accuracy: Only if classes are balanced and errors are equally costly.
  • Precision: Of the predicted positives, how many are truly positive.
  • Recall: Of all true positives, how many did we catch.
  • F1: Harmonic mean of precision and recall; good for imbalanced classes.
  • ROC-AUC: Ranking quality across thresholds; can be optimistic on heavy imbalance.
  • PR-AUC: Better than ROC-AUC when positives are rare.
Regression
  • MAE: Average absolute error; robust to outliers; easy to explain.
  • MSE / RMSE: Squared error penalizes large mistakes more; RMSE is in target units.
  • R²: Proportion of variance explained; can be misleading if baseline is poor.
Other considerations
  • Macro vs. micro averaging: Macro treats classes equally; micro is instance-weighted.
  • Threshold tuning: Pick a threshold by optimizing a metric on validation folds, then lock it for test.
  • Report mean ± standard deviation across folds, and keep fold-wise results for debugging.

Framework blueprints (sklearn-style)

Reusable pattern
# Classification with stratified CV and pipeline
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import make_scorer, f1_score

pipe = Pipeline([
    ("scaler", StandardScaler()),
    ("clf", LogisticRegression(max_iter=200))
])

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
score = cross_val_score(pipe, X, y, cv=cv, scoring=make_scorer(f1_score))
print(score.mean(), score.std())
Time series pattern
from sklearn.model_selection import TimeSeriesSplit
from sklearn.metrics import mean_absolute_error

cv = TimeSeriesSplit(n_splits=5)
mae_scores = []
for train_idx, val_idx in cv.split(X):
    model = ...  # your regressor
    model.fit(X[train_idx], y[train_idx])
    pred = model.predict(X[val_idx])
    mae_scores.append(mean_absolute_error(y[val_idx], pred))
print(sum(mae_scores)/len(mae_scores))

Worked examples

Example 1: Imbalanced classification with F1 and PR-AUC

  1. Use StratifiedKFold (5 folds, shuffle).
  2. Evaluate F1 and PR-AUC; compare to accuracy.
  3. Pick the model with best mean F1; report std.
from sklearn.metrics import f1_score, average_precision_score
from sklearn.model_selection import StratifiedKFold

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=0)
f1s, pras = [], []
for tr, va in cv.split(X, y):
    m = ...  # e.g., gradient boosting
    m.fit(X[tr], y[tr])
    p = m.predict(X[va])
    proba = getattr(m, "predict_proba", None)
    f1s.append(f1_score(y[va], p))
    if proba:
        pras.append(average_precision_score(y[va], m.predict_proba(X[va])[:,1]))
print("F1:", sum(f1s)/len(f1s), "PR-AUC:", sum(pras)/len(pras))

Self-check: If accuracy is high but F1 is low, your positives are likely rare and being missed.

Example 2: Regression model choice with MAE vs. RMSE

  1. Compare Linear Regression vs. Random Forest using 5-fold CV.
  2. Compute MAE and RMSE per fold.
  3. Pick metric based on business: are large errors much worse? Choose RMSE if yes; MAE if costs are linear.
from sklearn.model_selection import KFold
from sklearn.metrics import mean_absolute_error, mean_squared_error
import math

cv = KFold(n_splits=5, shuffle=True, random_state=1)
mae_lr, rmse_lr = [], []
for tr, va in cv.split(X):
    lr = ...  # LinearRegression()
    lr.fit(X[tr], y[tr])
    p = lr.predict(X[va])
    mae_lr.append(mean_absolute_error(y[va], p))
    rmse_lr.append(math.sqrt(mean_squared_error(y[va], p)))
print("LR MAE mean:", sum(mae_lr)/len(mae_lr))
print("LR RMSE mean:", sum(rmse_lr)/len(rmse_lr))

Self-check: If RMSE is much larger than MAE, a few large errors dominate; inspect outliers.

Example 3: Time series with expanding window

  1. Ensure each validation fold is strictly ahead of training in time.
  2. Use TimeSeriesSplit; compute MAE or sMAPE.
  3. Aggregate mean score across folds; avoid shuffling.
from sklearn.model_selection import TimeSeriesSplit

cv = TimeSeriesSplit(n_splits=4)
for tr, va in cv.split(X):
    model = ...
    model.fit(X[tr], y[tr])
    pred = model.predict(X[va])
    # compute MAE/sMAPE

Self-check: If you see unusually high validation performance, verify you did not leak future features (e.g., target lag alignment).

Example 4: Nested CV for unbiased performance

  1. Outer CV (e.g., 5-fold) estimates generalization.
  2. Inner CV tunes hyperparameters (e.g., GridSearchCV).
  3. Only report outer fold scores.
from sklearn.model_selection import GridSearchCV, StratifiedKFold

outer = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
inner = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)

params = {"clf__C": [0.1, 1, 10]}
outer_scores = []
for tr, va in outer.split(X, y):
    pipe = ...  # pipeline
    gs = GridSearchCV(pipe, param_grid=params, cv=inner, scoring="f1")
    gs.fit(X[tr], y[tr])
    best = gs.best_estimator_
    p = best.predict(X[va])
    outer_scores.append(f1_score(y[va], p))
print(sum(outer_scores)/len(outer_scores))

Self-check: If you tune hyperparameters on the full dataset and then cross-validate, that is leakage. Tune only within inner folds.

Example 5: Group-aware CV to prevent leakage

  1. When multiple rows belong to the same user/order, use GroupKFold.
  2. Ensure groups do not appear in both train and validation of the same fold.
  3. Evaluate chosen metric per fold.
from sklearn.model_selection import GroupKFold

cv = GroupKFold(n_splits=5)
for tr, va in cv.split(X, y, groups=user_id):
    model = ...
    model.fit(X[tr], y[tr])
    pred = model.predict(X[va])
    # compute metric

Self-check: Confirm no group ID appears in both train and validation indices within a fold.

How to run a solid evaluation (step-by-step)

  1. Freeze the problem: Define target, positive class, metric(s), and deployment threshold needs.
  2. Pick CV scheme: IID → KFold/Stratified; temporal → TimeSeriesSplit; grouped → GroupKFold; tuning → Nested CV.
  3. Build a pipeline: Include preprocessing inside the pipeline so CV fits it on training folds only.
  4. Tune inside CV: Use GridSearchCV/RandomizedSearchCV with an inner CV.
  5. Aggregate and inspect: Report mean ± std; keep fold-wise metrics to spot instability.
  6. Lock decisions: Freeze threshold/hyperparameters; finally evaluate once on a held-out test set.

Common mistakes and self-check

  • Data leakage: Fitting scalers/encoders outside the pipeline leaks info. Self-check: Does each fold fit preprocessing only on training indices?
  • Wrong CV for time series: Shuffling breaks temporal order. Self-check: Validation timestamps must be strictly after training.
  • Over-relying on accuracy: On imbalance, use F1/PR-AUC. Self-check: Compare accuracy vs. F1; if they diverge, accuracy is misleading.
  • Choosing metrics misaligned with costs: If big errors are expensive, prefer RMSE. Self-check: Map errors to money and choose accordingly.
  • Reporting only best fold: Always report average and spread. Self-check: Keep fold-wise results.

Practical projects

  • Credit default classifier: StratifiedKFold with F1/PR-AUC; tune threshold for a fixed precision.
  • House prices regressor: Compare MAE vs. RMSE; report which aligns with business penalties.
  • Sales forecasting: TimeSeriesSplit with expanding window; compare naive baseline vs. model.
  • User-level churn model: GroupKFold by user; show how standard K-Fold overestimates performance.

Learning path

  • Start: K-Fold and accuracy/MAE basics.
  • Next: StratifiedKFold, F1, PR-AUC for imbalance.
  • Then: Pipelines and nested CV for tuning.
  • Advanced: TimeSeriesSplit, GroupKFold, custom scoring, threshold calibration.

Exercises

Do the tasks below. They mirror the exercise section on this page. Check your work using the solution toggles.

Exercise 1 — Compare two classifiers with stratified 5-fold CV

  1. Prepare a binary classification dataset (real or simulated). Ensure class imbalance (~10–20% positives) if possible.
  2. Build a pipeline with scaling and a linear model (e.g., logistic regression).
  3. Build a second model (e.g., tree-based).
  4. Use StratifiedKFold(n_splits=5, shuffle=True, random_state=42).
  5. Compute per-fold precision, recall, F1, and PR-AUC for both models.
  6. Report mean ± std for each metric. Pick the winner by F1. Briefly justify.
  • Checklist:
    • Preprocessing inside pipeline.
    • No leakage across folds.
    • Stratified splits used.
    • At least two metrics compared beyond accuracy.
Need a nudge?
  • If probabilities are available, compute PR-AUC via average_precision_score.
  • For precision/recall/F1, use predictions from a 0.5 threshold.
  • Report both mean and standard deviation across folds.

Mini challenge

You must deliver a fraud model with at least 0.80 precision and as high recall as possible. Using stratified 5-fold CV, design a procedure to choose a probability threshold and report expected precision/recall after deployment. Write your steps and which plots/metrics you will inspect. Keep it to 6–8 bullet points.

Next steps

  • Apply nested CV to one of your existing projects and compare results to your previous single split.
  • Add threshold tuning inside each fold and lock it before final test evaluation.
  • Create a simple reporting template to always show mean ± std and fold-wise metrics.

Practice Exercises

1 exercises to complete

Instructions

Goal: Evaluate two classifiers on an imbalanced dataset using StratifiedKFold and pick the winner by mean F1.

  1. Create or load a binary dataset (target with ~10–20% positives if possible). If needed, simulate with a standard generator.
  2. Build Model A: Pipeline(StandardScaler, LogisticRegression).
  3. Build Model B: Pipeline(StandardScaler, GradientBoosting or RandomForest).
  4. Use StratifiedKFold(n_splits=5, shuffle=True, random_state=42).
  5. For each model and fold, compute: precision, recall, F1, and PR-AUC.
  6. Report mean ± std for each metric. Select the model with higher mean F1 (break ties with PR-AUC).
  • Deliverables:
    • Per-fold metrics (concise).
    • Overall mean ± std.
    • 1–2 sentences justifying the winner.
Expected Output
Example: Model A — F1: 0.61 ± 0.03, PR-AUC: 0.58 ± 0.04; Model B — F1: 0.66 ± 0.02, PR-AUC: 0.65 ± 0.03. Winner: Model B by F1; more stable across folds.

Cross Validation And Metrics — Quick Test

Test your knowledge with 8 questions. Pass with 70% or higher.

8 questions70% to pass

Have questions about Cross Validation And Metrics?

AI Assistant

Ask questions about this tool