Why this matters
As a Computer Vision Engineer, bad splits silently ruin model results. Data leakage, class imbalance, duplicates across sets, or mismatched distributions often create inflated metrics that fail in production. Good splits help you tune hyperparameters, compare models fairly, and estimate real performance.
- You will regularly design dataset splits for new projects.
- You will audit vendor-delivered datasets for leakage.
- You will justify split strategy to teammates and stakeholders.
Concept explained simply
We divide labeled data into three parts:
- Train: The model learns patterns here.
- Validation: You tune and select models here.
- Test: You evaluate final performance once, at the end.
Mental model
Think of a science fair: you practice at home (train), do trial runs with friends (validation), and present to judges once (test). Don’t show practice answers to the judges.
Core rules for splitting CV datasets
- Match the prediction unit: Split by the entity you must generalize to (e.g., patient, video, product, camera, location).
- Avoid leakage: No near-duplicates or correlated samples across sets (e.g., frames from the same video, images from the same patient, same scene with minor edits).
- Keep distributions realistic: The validation and test sets should reflect production (classes, lighting, devices, seasons, locations).
- Use stratification for class imbalance: Ensure minority classes exist in validation and test.
- Use group-wise splits: When multiple samples come from a single source (e.g., patient, video, store), keep all samples from a group in the same split.
- Use time-based splits if production is chronological: Train on past, validate on recent past, test on future.
- Freeze the test set: Don’t tune on it. Look at it at the end.
- Document everything: Ratios, grouping keys, random seeds, and checks you performed.
What ratios should I use?
- Common defaults: 70/15/15 or 80/10/10 (train/val/test).
- Small datasets: Consider k-fold cross-validation for model selection plus a small held-out test if possible.
- Very large datasets: You can reduce validation/test percentages (e.g., 90/5/5) while keeping absolute counts high.
Special cases (CV-specific)
- Object detection/segmentation: Prevent the same scene/object instance appearing in more than one split.
- Video tasks: Split by video (or by subject), not by frame.
- Medical imaging: Split by patient or study, not by image.
- Satellite/street-view: Split by location tiles to avoid spatial leakage.
Worked examples
Example 1 — Detection across factories (group + stratified)
Dataset: 8,000 images from 15 factories, 120,000 labeled boxes across classes {person: 80k, vest: 37.6k, helmet: 2.4k}. Goal: 70/15/15 split with factory-level grouping, ensure helmet appears in val and test.
- Choose grouping key: factory_id. All images from the same factory stay in one split.
- Target ratio: 70/15/15 by images. The number of factories per split can be uneven if some factories are larger.
- Stratify by class counts: Since helmet is rare (~2%), ensure val and test each contain at least 50–100 helmet instances.
- Plan: Train ≈ 5,600 images, Val ≈ 1,200, Test ≈ 1,200. With 2.4k helmets total, expected ≈ 360 helmets in val and ≈ 360 in test (meets the minimum).
- Check: No factory overlaps across splits; class presence confirmed in val/test; random seed documented.
Example 2 — Medical classification (patient-wise)
Dataset: 220 patients, each with 5 images. Task: Pneumonia classification.
- Prediction unit: patient. Split by patient_id.
- Ratio: 60/20/20 by patients → Train: 132 patients, Val: 44, Test: 44.
- Ensure class balance: Check positive/negative patient counts in each split; move whole patients if needed to balance.
- Check leakage: All images from each patient remain in the same split.
Example 3 — Video action recognition (time-based)
Dataset: 1,000 clips recorded 2019–2022. Goal: Estimate future performance.
- Prediction unit: video_id. Split by video, not frame.
- Time split: Train on 2019–2020, Validate on 2021, Test on 2022.
- Balance classes within each period to avoid class drift between sets.
- Check: No overlap of the same original footage across sets.
Practical workflow
- Define prediction unit: image, patient, video, location, etc.
- Pick split strategy: random, stratified, group-wise, time-based, or combinations (e.g., GroupStratified).
- Set ratios and random seed; document them.
- Apply de-duplication (e.g., perceptual hashing) before splitting to reduce near-duplicates.
- Run checks:
- Group isolation: No group appears in more than one split.
- Class presence: Every class appears in val/test (as needed).
- Distribution sanity: Compare histograms of classes, resolutions, sources, times of day.
- No augmentation leakage: Don’t copy augmented versions into other splits.
- Freeze test set; use only train+val for iteration.
If the dataset is small or imbalanced
- Use k-fold cross-validation for model selection; keep a tiny untouched test set if possible.
- Use stratified splits; for detection/segmentation, stratify by images containing the class or by instance counts.
- Apply oversampling/augmentation only in the training set.
Exercises
Do these now. You can check expected outputs and example solutions in the collapsible sections. Your progress in the quick test below is available to everyone; only logged-in users get saved progress.
Exercise 1 — Factory detection split plan
You manage an object detection dataset: 8,000 images, 15 factories, 120,000 boxes ({person: 80k, vest: 37.6k, helmet: 2.4k}). Create a 70/15/15 split that:
- Groups by factory_id (no factory appears in more than one split).
- Stratifies so helmet appears in val and test with at least 50 instances each.
- Documents estimated image counts and helmet counts per split.
Deliverable: A short plan listing split ratios, group rule, and checks you will run.
Exercise 2 — Patient-wise medical split
You have 220 patients (5 images each). Target is pneumonia classification with 30% positive patients overall. Build a 60/20/20 patient-wise split and estimate patient counts per class in each set. State how you will adjust if validation ends up with only 10% positives.
- Self-check checklist:
- Did you define the grouping key and stick to it?
- Did you ensure minority class presence in val and test?
- Did you freeze the test set and avoid peeking when tuning?
- Did you document seed, ratios, and checks?
Common mistakes and self-check
- Leakage via near-duplicates: Same scene cropped differently ends up in train and val. Fix: deduplicate before splitting; group by scene.
- Subject leakage: Same patient/vehicle/actor in multiple sets. Fix: group-wise splitting.
- Time leakage: Using future data for training. Fix: time-based split.
- Imbalance ignored: Minority class absent in val/test. Fix: stratified/group-stratified splitting and minimum count checks.
- Tuning on the test set: Re-using test for hyperparameter decisions. Fix: keep a strict final test; optionally create a dev-test separate from final test.
- Augmentation leakage: Augmented copies spread across sets. Fix: generate augmentations only within the training pipeline.
Mini challenge
You are building a street-sign detector with 20 cities, images collected across seasons and day/night. Propose a split strategy that tests generalization to new seasons and to unseen cities. Specify: grouping keys, ratios, how you ensure rare sign types appear in val/test, and the checks you would run.
Who this is for
- Computer Vision Engineers and Data Scientists preparing datasets.
- Labeling leads and MLEs auditing data quality.
- Students learning robust evaluation practices.
Prerequisites
- Basic understanding of supervised learning and evaluation metrics.
- Familiarity with your dataset schema (IDs, labels, metadata).
Learning path
- Understand prediction unit and leakage risks.
- Choose split strategy (random/stratified/group/time-based).
- Implement, verify distributions, and document seed and rules.
- Freeze test set and iterate on train/val only.
Practical projects
- Retail shelf detector: Group by store and ensure rare product classes appear in val/test.
- Bird species classifier: Group by photographer to avoid same shoot across splits; stratify rare species.
- Road segmentation: Tile imagery by geographic grid; split by tiles and time, verify class coverage.
Next steps
- Learn data labeling quality checks and consensus strategies.
- Practice data versioning and experiment tracking to keep splits reproducible.
- Study augmentation strategies that reduce overfitting without causing leakage.