How to learn Cross Validation And Metrics for ML Frameworks in Machine Learning Engineer for free

Why this matters

As a Machine Learning Engineer, you need reliable estimates of model performance before shipping. Cross-validation (CV) prevents overfitting to a single split, and metrics translate model behavior into business-relevant numbers. You will use CV and metrics to:

Select models and hyperparameters confidently.
Guard against data leakage and optimistic results.
Communicate performance trade-offs (e.g., precision vs. recall) to stakeholders.
Build repeatable evaluation pipelines in ML frameworks.

Real tasks you will face

Estimate uplift from a new classifier on imbalanced data with StratifiedKFold and F1.
Compare regression models using MAE vs. RMSE to align with business penalties.
Set up TimeSeriesSplit for forecasting without peeking into the future.
Run nested CV to avoid leakage during hyperparameter tuning.

Note: The quick test on this page is available to everyone. Only logged-in users have their progress saved.

Who this is for and prerequisites

Who this is for

Engineers deploying models and needing trustworthy evaluation.
Data scientists moving from notebooks to production-ready pipelines.
Analysts learning to interpret model metrics properly.

Prerequisites

Basic Python and machine learning concepts (features, target, train/validation/test).
Familiarity with a framework like scikit-learn; OK if you are still learning.
Comfort with arrays, dataframes, and simple model training loops.

Concept explained simply

Cross-validation splits your data into multiple folds, trains on some folds, and validates on the remaining fold. Repeat across folds and average the metric. This reduces the luck of a single split.

Mental model

Imagine testing a running shoe on different tracks. If it performs well across many tracks, you trust it more. CV is your set of tracks; metrics are your stopwatch numbers.

Core techniques

Common CV patterns and when to use

K-Fold: General purpose; shuffle data if IID assumptions hold.
Stratified K-Fold: For classification with class imbalance; preserves class ratios.
Group K-Fold: When samples share a group (e.g., multiple rows per user); prevents leakage across groups.
Leave-One-Out (LOOCV): High-variance estimate; rarely needed for large datasets.
TimeSeriesSplit (rolling/expanding window): For temporal data; avoids training on the future.
Nested CV: Use inner CV for tuning and outer CV for unbiased performance estimation.

Metric selection quick guide

Classification

Accuracy: Only if classes are balanced and errors are equally costly.
Precision: Of the predicted positives, how many are truly positive.
Recall: Of all true positives, how many did we catch.
F1: Harmonic mean of precision and recall; good for imbalanced classes.
ROC-AUC: Ranking quality across thresholds; can be optimistic on heavy imbalance.
PR-AUC: Better than ROC-AUC when positives are rare.

Regression

MAE: Average absolute error; robust to outliers; easy to explain.
MSE / RMSE: Squared error penalizes large mistakes more; RMSE is in target units.
R²: Proportion of variance explained; can be misleading if baseline is poor.

Other considerations

Macro vs. micro averaging: Macro treats classes equally; micro is instance-weighted.
Threshold tuning: Pick a threshold by optimizing a metric on validation folds, then lock it for test.
Report mean ± standard deviation across folds, and keep fold-wise results for debugging.

Framework blueprints (sklearn-style)

Reusable pattern

# Classification with stratified CV and pipeline
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import make_scorer, f1_score

pipe = Pipeline([
    ("scaler", StandardScaler()),
    ("clf", LogisticRegression(max_iter=200))
])

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
score = cross_val_score(pipe, X, y, cv=cv, scoring=make_scorer(f1_score))
print(score.mean(), score.std())

Time series pattern

from sklearn.model_selection import TimeSeriesSplit
from sklearn.metrics import mean_absolute_error

cv = TimeSeriesSplit(n_splits=5)
mae_scores = []
for train_idx, val_idx in cv.split(X):
    model = ...  # your regressor
    model.fit(X[train_idx], y[train_idx])
    pred = model.predict(X[val_idx])
    mae_scores.append(mean_absolute_error(y[val_idx], pred))
print(sum(mae_scores)/len(mae_scores))

Worked examples

Example 1: Imbalanced classification with F1 and PR-AUC

Use StratifiedKFold (5 folds, shuffle).
Evaluate F1 and PR-AUC; compare to accuracy.
Pick the model with best mean F1; report std.

from sklearn.metrics import f1_score, average_precision_score
from sklearn.model_selection import StratifiedKFold

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=0)
f1s, pras = [], []
for tr, va in cv.split(X, y):
    m = ...  # e.g., gradient boosting
    m.fit(X[tr], y[tr])
    p = m.predict(X[va])
    proba = getattr(m, "predict_proba", None)
    f1s.append(f1_score(y[va], p))
    if proba:
        pras.append(average_precision_score(y[va], m.predict_proba(X[va])[:,1]))
print("F1:", sum(f1s)/len(f1s), "PR-AUC:", sum(pras)/len(pras))

Self-check: If accuracy is high but F1 is low, your positives are likely rare and being missed.

Example 2: Regression model choice with MAE vs. RMSE

Compare Linear Regression vs. Random Forest using 5-fold CV.
Compute MAE and RMSE per fold.
Pick metric based on business: are large errors much worse? Choose RMSE if yes; MAE if costs are linear.

from sklearn.model_selection import KFold
from sklearn.metrics import mean_absolute_error, mean_squared_error
import math

cv = KFold(n_splits=5, shuffle=True, random_state=1)
mae_lr, rmse_lr = [], []
for tr, va in cv.split(X):
    lr = ...  # LinearRegression()
    lr.fit(X[tr], y[tr])
    p = lr.predict(X[va])
    mae_lr.append(mean_absolute_error(y[va], p))
    rmse_lr.append(math.sqrt(mean_squared_error(y[va], p)))
print("LR MAE mean:", sum(mae_lr)/len(mae_lr))
print("LR RMSE mean:", sum(rmse_lr)/len(rmse_lr))

Self-check: If RMSE is much larger than MAE, a few large errors dominate; inspect outliers.

Example 3: Time series with expanding window

Ensure each validation fold is strictly ahead of training in time.
Use TimeSeriesSplit; compute MAE or sMAPE.
Aggregate mean score across folds; avoid shuffling.

from sklearn.model_selection import TimeSeriesSplit

cv = TimeSeriesSplit(n_splits=4)
for tr, va in cv.split(X):
    model = ...
    model.fit(X[tr], y[tr])
    pred = model.predict(X[va])
    # compute MAE/sMAPE

Self-check: If you see unusually high validation performance, verify you did not leak future features (e.g., target lag alignment).

Example 4: Nested CV for unbiased performance

Outer CV (e.g., 5-fold) estimates generalization.
Inner CV tunes hyperparameters (e.g., GridSearchCV).
Only report outer fold scores.

from sklearn.model_selection import GridSearchCV, StratifiedKFold

outer = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
inner = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)

params = {"clf__C": [0.1, 1, 10]}
outer_scores = []
for tr, va in outer.split(X, y):
    pipe = ...  # pipeline
    gs = GridSearchCV(pipe, param_grid=params, cv=inner, scoring="f1")
    gs.fit(X[tr], y[tr])
    best = gs.best_estimator_
    p = best.predict(X[va])
    outer_scores.append(f1_score(y[va], p))
print(sum(outer_scores)/len(outer_scores))

Self-check: If you tune hyperparameters on the full dataset and then cross-validate, that is leakage. Tune only within inner folds.

Example 5: Group-aware CV to prevent leakage

When multiple rows belong to the same user/order, use GroupKFold.
Ensure groups do not appear in both train and validation of the same fold.
Evaluate chosen metric per fold.

from sklearn.model_selection import GroupKFold

cv = GroupKFold(n_splits=5)
for tr, va in cv.split(X, y, groups=user_id):
    model = ...
    model.fit(X[tr], y[tr])
    pred = model.predict(X[va])
    # compute metric

Self-check: Confirm no group ID appears in both train and validation indices within a fold.

How to run a solid evaluation (step-by-step)

Freeze the problem: Define target, positive class, metric(s), and deployment threshold needs.
Pick CV scheme: IID → KFold/Stratified; temporal → TimeSeriesSplit; grouped → GroupKFold; tuning → Nested CV.
Build a pipeline: Include preprocessing inside the pipeline so CV fits it on training folds only.
Tune inside CV: Use GridSearchCV/RandomizedSearchCV with an inner CV.
Aggregate and inspect: Report mean ± std; keep fold-wise metrics to spot instability.
Lock decisions: Freeze threshold/hyperparameters; finally evaluate once on a held-out test set.

Common mistakes and self-check

Data leakage: Fitting scalers/encoders outside the pipeline leaks info. Self-check: Does each fold fit preprocessing only on training indices?
Wrong CV for time series: Shuffling breaks temporal order. Self-check: Validation timestamps must be strictly after training.
Over-relying on accuracy: On imbalance, use F1/PR-AUC. Self-check: Compare accuracy vs. F1; if they diverge, accuracy is misleading.
Choosing metrics misaligned with costs: If big errors are expensive, prefer RMSE. Self-check: Map errors to money and choose accordingly.
Reporting only best fold: Always report average and spread. Self-check: Keep fold-wise results.

Practical projects

Credit default classifier: StratifiedKFold with F1/PR-AUC; tune threshold for a fixed precision.
House prices regressor: Compare MAE vs. RMSE; report which aligns with business penalties.
Sales forecasting: TimeSeriesSplit with expanding window; compare naive baseline vs. model.
User-level churn model: GroupKFold by user; show how standard K-Fold overestimates performance.

Learning path

Start: K-Fold and accuracy/MAE basics.
Next: StratifiedKFold, F1, PR-AUC for imbalance.
Then: Pipelines and nested CV for tuning.
Advanced: TimeSeriesSplit, GroupKFold, custom scoring, threshold calibration.

Exercises

Do the tasks below. They mirror the exercise section on this page. Check your work using the solution toggles.

Exercise 1 — Compare two classifiers with stratified 5-fold CV

Prepare a binary classification dataset (real or simulated). Ensure class imbalance (~10–20% positives) if possible.
Build a pipeline with scaling and a linear model (e.g., logistic regression).
Build a second model (e.g., tree-based).
Use StratifiedKFold(n_splits=5, shuffle=True, random_state=42).
Compute per-fold precision, recall, F1, and PR-AUC for both models.
Report mean ± std for each metric. Pick the winner by F1. Briefly justify.

Checklist:
- Preprocessing inside pipeline.
- No leakage across folds.
- Stratified splits used.
- At least two metrics compared beyond accuracy.

Need a nudge?

If probabilities are available, compute PR-AUC via average_precision_score.
For precision/recall/F1, use predictions from a 0.5 threshold.
Report both mean and standard deviation across folds.

Mini challenge

You must deliver a fraud model with at least 0.80 precision and as high recall as possible. Using stratified 5-fold CV, design a procedure to choose a probability threshold and report expected precision/recall after deployment. Write your steps and which plots/metrics you will inspect. Keep it to 6–8 bullet points.

Next steps

Apply nested CV to one of your existing projects and compare results to your previous single split.
Add threshold tuning inside each fold and lock it before final test evaluation.
Create a simple reporting template to always show mean ± std and fold-wise metrics.

Instructions

Goal: Evaluate two classifiers on an imbalanced dataset using StratifiedKFold and pick the winner by mean F1.

Create or load a binary dataset (target with ~10–20% positives if possible). If needed, simulate with a standard generator.
Build Model A: Pipeline(StandardScaler, LogisticRegression).
Build Model B: Pipeline(StandardScaler, GradientBoosting or RandomForest).
Use StratifiedKFold(n_splits=5, shuffle=True, random_state=42).
For each model and fold, compute: precision, recall, F1, and PR-AUC.
Report mean ± std for each metric. Select the model with higher mean F1 (break ties with PR-AUC).

Deliverables:
- Per-fold metrics (concise).
- Overall mean ± std.
- 1–2 sentences justifying the winner.

Menu

Cross Validation And Metrics

Table of Contents