Table of Contents
- Train set: used to fit model parameters.
- Validation set: used to tune choices (hyperparameters, thresholds, features). You may reuse it during development.
- Test set: used once at the end to report final performance. Keep it fully untouched during model development.
Choosing a split strategy
Pick the strategy that matches your data and goal:
- Classic holdout (i.i.d. data, enough samples): 70/15/15 or 60/20/20 train/val/test.
- Imbalanced classification: use stratified splits so class ratios are similar across sets.
- Grouped data (e.g., multiple rows per user/patient): use group-aware splitting to keep all samples from a group in the same set.
- Time series: split by time. Validation/test must be later than train. Use rolling/expanding window CV if possible.
- Small datasets: prefer k-fold cross-validation (e.g., 5-fold) for model selection; keep a small final test if feasible.
- Hyperparameter tuning: use cross-validation on the train set or a dedicated validation set; for rigorous estimates, use nested cross-validation.
- Feature preprocessing: fit scalers/encoders only on train; apply to val/test to avoid leakage.
- Multiple models: compare on the same validation protocol; only evaluate the final choice on test.
Rules of thumb
- Keep test set sacred — evaluate once at the end.
- If data is skewed or rare events exist, stratify or use grouped splits.
- For time series, never shuffle across time.
- When in doubt, simulate your real-world scenario in your split.
Worked examples
Example 1 — Balanced tabular classification (i.i.d.)
Data: 50,000 rows, 10 numeric features, balanced classes.
- Split: 70% train, 15% validation, 15% test (stratified = optional).
- Flow: Fit model on train; tune hyperparameters on validation; once finalized, retrain on train+validation; evaluate once on test.
- Why: Plenty of data and i.i.d. makes holdout reliable.
Example 2 — Imbalanced fraud detection
Data: 200,000 transactions, 0.5% fraud.
- Split: Stratified 70/15/15 to keep positive rate stable.
- Training: Use class weights or focal loss; threshold tuning on validation to optimize F1 or recall at fixed precision.
- Sanity checks: No user overlap leakage; no target leakage via post-transaction features.
Example 3 — Time series demand forecasting
Data: Daily sales 2019-01-01 to 2024-06-30.
- Split by time: Train = 2019-01-01 to 2023-12-31; Validation = 2024-01-01 to 2024-03-31; Test = 2024-04-01 to 2024-06-30.
- Rolling CV: Multiple validation windows (e.g., Jan, Feb, Mar 2024) to choose hyperparameters robustly.
- Leakage guard: Only use features available up to prediction time (lagged features, moving averages fitted on past only).
Example 4 — Grouped medical data
Data: 12,000 records from 2,000 patients, multiple visits each.
- Split: Group-aware by patient IDs so each patient appears in exactly one set.
- Ratios: 70/15/15 by patients, not by rows.
- Reason: Prevent the model from seeing the same patient in train and validation, which would inflate metrics.
Step-by-step: implement your split
- Define evaluation goal. What metric and deployment scenario? Classification threshold? Forecast horizon?
- Identify data structure. i.i.d., time-ordered, groups, imbalance.
- Choose split method. Holdout, stratified, group, time-based, k-fold, or nested CV.
- Prevent leakage. Fit preprocessing on train only. Keep temporal order. Isolate groups.
- Tune and select. Use validation (or CV) to choose hyperparameters and thresholds.
- Finalize. Retrain on train+validation with chosen settings; evaluate once on test; record metrics and confidence intervals where possible.
Mini checklist before you evaluate
- Test set untouched until final evaluation
- Preprocessing fitted only on train
- Correct split type for data structure (i.i.d./time/group)
- Metrics match business goal
- Threshold/feature selection decided using validation only
Exercises
Do these to lock in understanding. The same tasks are listed below with solutions you can reveal.
Exercise 1: Design a split plan for imbalanced credit defaults (ID: ex1)
Goal: Propose ratios, split type, tuning approach, and leakage checks.
- Data: 500,000 customers, 2% default rate, multiple loans per customer.
- Target metric: Recall at 10% false positive rate.
Hints
- Imbalance + multiple loans per customer → stratification and grouping.
- Threshold tuning belongs on validation.
Exercise 2: Time-based split for energy forecasting (ID: ex2)
Goal: Create a rolling validation plan for a 7-day ahead forecast.
- Data: Hourly energy usage, 2022-01-01 to 2024-12-31.
- Target metric: MAPE on the next 7 days.
Hints
- Use expanding or sliding windows.
- Validation windows should simulate the 7-day horizon.
Exercises completion checklist
- You specified split ratios and method
- You named the evaluation metric and where to optimize it
- You listed leakage risks and controls
- You explained how you will finalize and test once
Common mistakes and self-check
- Peeking at test set. Self-check: Did you change anything after viewing test metrics? If yes, you invalidated the test.
- Random split for time series. Self-check: Do any training rows occur after validation rows in time? If yes, fix with a time-based split.
- Ignoring groups. Self-check: Can the same user/patient appear in both train and validation? If yes, use group-aware splitting.
- Leaky preprocessing. Self-check: Are scalers/imputers fitted on full data? Refit them on train only.
- Using accuracy on imbalanced data. Self-check: Does a dummy model score high? Choose metrics like ROC-AUC, PR-AUC, recall/precision at target FPR.
Practical projects
- Customer churn prediction: Group-aware stratified split by customer ID; compare threshold policies at fixed churn budget.
- Retail sales forecasting: Rolling window validation across seasons; choose horizon-specific features.
- Click-through prediction: Session-level grouping; evaluate PR-AUC and calibration drift between validation and test.
Mini challenge
You have 80,000 emails with 1% spam. Each sender may appear multiple times. Build a split plan and list three leakage checks you will perform. Keep the test absolutely untouched. Write your plan in 5 bullet points.
Who this is for
- Beginners who know basic supervised learning and want trustworthy evaluation.
- Practitioners switching to time series or grouped datasets.
- Anyone preparing for ML interviews or productionizing models.
Prerequisites
- Basic ML concepts: train vs. test, overfitting, common metrics.
- Comfort with data preprocessing (scaling, encoding, imputation).
- Awareness of your business metric (what matters in production).
Learning path
- Review evaluation metrics suited to your problem.
- Learn split strategies: holdout, stratified, group, time-based.
- Practice with k-fold and nested CV for tuning.
- Apply to a real dataset; document decisions and leakage checks.
- Finalize: retrain on train+val; evaluate once on test and report.
Next steps
- Implement the split strategy in your current project and log every decision.
- Run at least one alternative split (e.g., different validation windows) to test robustness.
- Proceed to the quick test below to confirm understanding. Everyone can take it; log in to save progress.
Quick Test
Take the short quiz now. Everyone can access it for free; sign in to save your progress.