Why this matters
As a Machine Learning Engineer, you need reliable estimates of model performance before shipping. Cross-validation (CV) prevents overfitting to a single split, and metrics translate model behavior into business-relevant numbers. You will use CV and metrics to:
- Select models and hyperparameters confidently.
- Guard against data leakage and optimistic results.
- Communicate performance trade-offs (e.g., precision vs. recall) to stakeholders.
- Build repeatable evaluation pipelines in ML frameworks.
Real tasks you will face
- Estimate uplift from a new classifier on imbalanced data with StratifiedKFold and F1.
- Compare regression models using MAE vs. RMSE to align with business penalties.
- Set up TimeSeriesSplit for forecasting without peeking into the future.
- Run nested CV to avoid leakage during hyperparameter tuning.
Note: The quick test on this page is available to everyone. Only logged-in users have their progress saved.
Who this is for and prerequisites
Who this is for
- Engineers deploying models and needing trustworthy evaluation.
- Data scientists moving from notebooks to production-ready pipelines.
- Analysts learning to interpret model metrics properly.
Prerequisites
- Basic Python and machine learning concepts (features, target, train/validation/test).
- Familiarity with a framework like scikit-learn; OK if you are still learning.
- Comfort with arrays, dataframes, and simple model training loops.
Concept explained simply
Cross-validation splits your data into multiple folds, trains on some folds, and validates on the remaining fold. Repeat across folds and average the metric. This reduces the luck of a single split.
Mental model
Imagine testing a running shoe on different tracks. If it performs well across many tracks, you trust it more. CV is your set of tracks; metrics are your stopwatch numbers.
Core techniques
Common CV patterns and when to use
- K-Fold: General purpose; shuffle data if IID assumptions hold.
- Stratified K-Fold: For classification with class imbalance; preserves class ratios.
- Group K-Fold: When samples share a group (e.g., multiple rows per user); prevents leakage across groups.
- Leave-One-Out (LOOCV): High-variance estimate; rarely needed for large datasets.
- TimeSeriesSplit (rolling/expanding window): For temporal data; avoids training on the future.
- Nested CV: Use inner CV for tuning and outer CV for unbiased performance estimation.
Metric selection quick guide
Classification
- Accuracy: Only if classes are balanced and errors are equally costly.
- Precision: Of the predicted positives, how many are truly positive.
- Recall: Of all true positives, how many did we catch.
- F1: Harmonic mean of precision and recall; good for imbalanced classes.
- ROC-AUC: Ranking quality across thresholds; can be optimistic on heavy imbalance.
- PR-AUC: Better than ROC-AUC when positives are rare.
Regression
- MAE: Average absolute error; robust to outliers; easy to explain.
- MSE / RMSE: Squared error penalizes large mistakes more; RMSE is in target units.
- R²: Proportion of variance explained; can be misleading if baseline is poor.
Other considerations
- Macro vs. micro averaging: Macro treats classes equally; micro is instance-weighted.
- Threshold tuning: Pick a threshold by optimizing a metric on validation folds, then lock it for test.
- Report mean ± standard deviation across folds, and keep fold-wise results for debugging.
Framework blueprints (sklearn-style)
Reusable pattern
# Classification with stratified CV and pipeline
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import make_scorer, f1_score
pipe = Pipeline([
("scaler", StandardScaler()),
("clf", LogisticRegression(max_iter=200))
])
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
score = cross_val_score(pipe, X, y, cv=cv, scoring=make_scorer(f1_score))
print(score.mean(), score.std())Time series pattern
from sklearn.model_selection import TimeSeriesSplit
from sklearn.metrics import mean_absolute_error
cv = TimeSeriesSplit(n_splits=5)
mae_scores = []
for train_idx, val_idx in cv.split(X):
model = ... # your regressor
model.fit(X[train_idx], y[train_idx])
pred = model.predict(X[val_idx])
mae_scores.append(mean_absolute_error(y[val_idx], pred))
print(sum(mae_scores)/len(mae_scores))Worked examples
Example 1: Imbalanced classification with F1 and PR-AUC
- Use StratifiedKFold (5 folds, shuffle).
- Evaluate F1 and PR-AUC; compare to accuracy.
- Pick the model with best mean F1; report std.
from sklearn.metrics import f1_score, average_precision_score
from sklearn.model_selection import StratifiedKFold
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=0)
f1s, pras = [], []
for tr, va in cv.split(X, y):
m = ... # e.g., gradient boosting
m.fit(X[tr], y[tr])
p = m.predict(X[va])
proba = getattr(m, "predict_proba", None)
f1s.append(f1_score(y[va], p))
if proba:
pras.append(average_precision_score(y[va], m.predict_proba(X[va])[:,1]))
print("F1:", sum(f1s)/len(f1s), "PR-AUC:", sum(pras)/len(pras))Self-check: If accuracy is high but F1 is low, your positives are likely rare and being missed.
Example 2: Regression model choice with MAE vs. RMSE
- Compare Linear Regression vs. Random Forest using 5-fold CV.
- Compute MAE and RMSE per fold.
- Pick metric based on business: are large errors much worse? Choose RMSE if yes; MAE if costs are linear.
from sklearn.model_selection import KFold
from sklearn.metrics import mean_absolute_error, mean_squared_error
import math
cv = KFold(n_splits=5, shuffle=True, random_state=1)
mae_lr, rmse_lr = [], []
for tr, va in cv.split(X):
lr = ... # LinearRegression()
lr.fit(X[tr], y[tr])
p = lr.predict(X[va])
mae_lr.append(mean_absolute_error(y[va], p))
rmse_lr.append(math.sqrt(mean_squared_error(y[va], p)))
print("LR MAE mean:", sum(mae_lr)/len(mae_lr))
print("LR RMSE mean:", sum(rmse_lr)/len(rmse_lr))Self-check: If RMSE is much larger than MAE, a few large errors dominate; inspect outliers.
Example 3: Time series with expanding window
- Ensure each validation fold is strictly ahead of training in time.
- Use TimeSeriesSplit; compute MAE or sMAPE.
- Aggregate mean score across folds; avoid shuffling.
from sklearn.model_selection import TimeSeriesSplit
cv = TimeSeriesSplit(n_splits=4)
for tr, va in cv.split(X):
model = ...
model.fit(X[tr], y[tr])
pred = model.predict(X[va])
# compute MAE/sMAPESelf-check: If you see unusually high validation performance, verify you did not leak future features (e.g., target lag alignment).
Example 4: Nested CV for unbiased performance
- Outer CV (e.g., 5-fold) estimates generalization.
- Inner CV tunes hyperparameters (e.g., GridSearchCV).
- Only report outer fold scores.
from sklearn.model_selection import GridSearchCV, StratifiedKFold
outer = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
inner = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)
params = {"clf__C": [0.1, 1, 10]}
outer_scores = []
for tr, va in outer.split(X, y):
pipe = ... # pipeline
gs = GridSearchCV(pipe, param_grid=params, cv=inner, scoring="f1")
gs.fit(X[tr], y[tr])
best = gs.best_estimator_
p = best.predict(X[va])
outer_scores.append(f1_score(y[va], p))
print(sum(outer_scores)/len(outer_scores))Self-check: If you tune hyperparameters on the full dataset and then cross-validate, that is leakage. Tune only within inner folds.
Example 5: Group-aware CV to prevent leakage
- When multiple rows belong to the same user/order, use GroupKFold.
- Ensure groups do not appear in both train and validation of the same fold.
- Evaluate chosen metric per fold.
from sklearn.model_selection import GroupKFold
cv = GroupKFold(n_splits=5)
for tr, va in cv.split(X, y, groups=user_id):
model = ...
model.fit(X[tr], y[tr])
pred = model.predict(X[va])
# compute metricSelf-check: Confirm no group ID appears in both train and validation indices within a fold.
How to run a solid evaluation (step-by-step)
- Freeze the problem: Define target, positive class, metric(s), and deployment threshold needs.
- Pick CV scheme: IID → KFold/Stratified; temporal → TimeSeriesSplit; grouped → GroupKFold; tuning → Nested CV.
- Build a pipeline: Include preprocessing inside the pipeline so CV fits it on training folds only.
- Tune inside CV: Use GridSearchCV/RandomizedSearchCV with an inner CV.
- Aggregate and inspect: Report mean ± std; keep fold-wise metrics to spot instability.
- Lock decisions: Freeze threshold/hyperparameters; finally evaluate once on a held-out test set.
Common mistakes and self-check
- Data leakage: Fitting scalers/encoders outside the pipeline leaks info. Self-check: Does each fold fit preprocessing only on training indices?
- Wrong CV for time series: Shuffling breaks temporal order. Self-check: Validation timestamps must be strictly after training.
- Over-relying on accuracy: On imbalance, use F1/PR-AUC. Self-check: Compare accuracy vs. F1; if they diverge, accuracy is misleading.
- Choosing metrics misaligned with costs: If big errors are expensive, prefer RMSE. Self-check: Map errors to money and choose accordingly.
- Reporting only best fold: Always report average and spread. Self-check: Keep fold-wise results.
Practical projects
- Credit default classifier: StratifiedKFold with F1/PR-AUC; tune threshold for a fixed precision.
- House prices regressor: Compare MAE vs. RMSE; report which aligns with business penalties.
- Sales forecasting: TimeSeriesSplit with expanding window; compare naive baseline vs. model.
- User-level churn model: GroupKFold by user; show how standard K-Fold overestimates performance.
Learning path
- Start: K-Fold and accuracy/MAE basics.
- Next: StratifiedKFold, F1, PR-AUC for imbalance.
- Then: Pipelines and nested CV for tuning.
- Advanced: TimeSeriesSplit, GroupKFold, custom scoring, threshold calibration.
Exercises
Do the tasks below. They mirror the exercise section on this page. Check your work using the solution toggles.
Exercise 1 — Compare two classifiers with stratified 5-fold CV
- Prepare a binary classification dataset (real or simulated). Ensure class imbalance (~10–20% positives) if possible.
- Build a pipeline with scaling and a linear model (e.g., logistic regression).
- Build a second model (e.g., tree-based).
- Use StratifiedKFold(n_splits=5, shuffle=True, random_state=42).
- Compute per-fold precision, recall, F1, and PR-AUC for both models.
- Report mean ± std for each metric. Pick the winner by F1. Briefly justify.
- Checklist:
- Preprocessing inside pipeline.
- No leakage across folds.
- Stratified splits used.
- At least two metrics compared beyond accuracy.
Need a nudge?
- If probabilities are available, compute PR-AUC via average_precision_score.
- For precision/recall/F1, use predictions from a 0.5 threshold.
- Report both mean and standard deviation across folds.
Mini challenge
You must deliver a fraud model with at least 0.80 precision and as high recall as possible. Using stratified 5-fold CV, design a procedure to choose a probability threshold and report expected precision/recall after deployment. Write your steps and which plots/metrics you will inspect. Keep it to 6–8 bullet points.
Next steps
- Apply nested CV to one of your existing projects and compare results to your previous single split.
- Add threshold tuning inside each fold and lock it before final test evaluation.
- Create a simple reporting template to always show mean ± std and fold-wise metrics.