Why this matters
In computer vision training, your GPU can sit idle if the data loader cannot decode, augment, and deliver batches fast enough. Efficient data loaders keep the GPU fed so you train faster, iterate more, and reduce compute costs. Real tasks include tuning PyTorch DataLoader or tf.data pipelines, selecting the right number of workers, pinning memory, ordering cache/shuffle/batch/prefetch correctly, and choosing file formats that minimize I/O overhead.
- Faster experiments: More epochs per day and quicker ablations.
- Stable training: Smooth input pipelines reduce training jitter and timeouts.
- Scalable pipelines: Same principles carry to multi-GPU and cluster training.
Concept explained simply
Think of your data pipeline like a factory line:
- Read bytes from storage (disk/remote)
- Decode image/video
- Apply transforms/augmentations
- Batch samples
- Transfer to GPU
The speed of the line equals the speed of the slowest step. You optimize by parallelizing CPU work, reducing I/O overhead, and overlapping CPU/GPU work.
Mental model
Imagine your model as a hungry GPU that must be fed N images every second. The CPU team (workers) prepares meals (decode/augment) and the runner (pinned memory transfer) delivers to the GPU table. If meals are late, the GPU starves; if meals are early, a small buffer (prefetch) prevents hiccups.
Throughput targets by model size (approximate)
- Small CNNs: 1–5k imgs/sec
- Mid models (ResNet50): 300–800 imgs/sec
- Large ViT: 100–300 imgs/sec
Use these only as sanity checks. Actual numbers depend on hardware, augmentations, and image sizes.
Key tools and knobs
- Parallelism: num_workers (PyTorch), num_parallel_calls and interleave (TensorFlow)
- Overlap: prefetch_factor (PyTorch), dataset.prefetch(AUTOTUNE) (TensorFlow)
- Transfer speed: pin_memory=True (PyTorch), tf.data on GPU-friendly tensors
- Stability: persistent_workers=True (PyTorch), deterministic options (TF) when needed
- Order of ops: read/decode → cache (if fits) → shuffle → map/augment → batch → prefetch
- File strategy: Prefer NVMe/SSD, avoid millions of tiny files; use sharded records (e.g., TFRecord) when possible
- Transforms: Move heavy aug to GPU where available; keep CPU aug vectorized
- Collation: Efficient collate_fn; avoid expensive Python object manipulation
How to choose num_workers (PyTorch)
- Start with num_workers = number of physical CPU cores / 2
- Measure images/sec for 1, 2, 4, 8, 12, 16 workers
- Pick the smallest value within 5% of the best throughput to save CPU
Use persistent_workers=True after you find a good value to avoid per-epoch startup overhead.
Worked examples
Example 1: PyTorch — baseline vs tuned
# Baseline DataLoader (slow)
from torch.utils.data import DataLoader
from torchvision import datasets, transforms
transform = transforms.Compose([
transforms.Resize(256),
transforms.CenterCrop(224),
transforms.ToTensor(),
])
train_ds = datasets.ImageFolder('data/train', transform=transform)
baseline = DataLoader(train_ds, batch_size=64, shuffle=True, num_workers=0)
# Tuned DataLoader
opt = DataLoader(
train_ds,
batch_size=128,
shuffle=True,
num_workers=8,
pin_memory=True,
prefetch_factor=4,
persistent_workers=True,
)
# Measuring throughput (no model, data-only)
import time
for loader in [baseline, opt]:
# warmup
it = iter(loader)
for _ in range(10):
next(it)
# measure
t0 = time.time()
n = 0
for i, batch in enumerate(loader):
n += batch[0].shape[0]
if i == 100: break
dt = time.time() - t0
print('imgs/sec:', n/dt)
Key improvements: higher batch size (fits memory), multiple workers, pinned memory, and prefetching. Always ensure your GPU transfer uses non_blocking=True when calling .to(device).
# Example transfer inside training loop
images, labels = images.to(device, non_blocking=True), labels.to(device, non_blocking=True)
Example 2: TensorFlow — tf.data with AUTOTUNE
import tensorflow as tf
IMG_SIZE = (224, 224)
BATCH = 128
files = tf.data.Dataset.list_files('data/train/*/*.jpg', shuffle=True)
def load_decode(path):
img = tf.io.read_file(path)
img = tf.image.decode_jpeg(img, channels=3)
img = tf.image.resize(img, IMG_SIZE)
img = tf.cast(img, tf.float32) / 255.0
label = tf.strings.split(path, '/')[-2]
return img, label
# Interleave improves parallel file reads
pipeline = files.interleave(
lambda p: tf.data.Dataset.from_tensors(p),
cycle_length=tf.data.AUTOTUNE,
num_parallel_calls=tf.data.AUTOTUNE
).map(load_decode, num_parallel_calls=tf.data.AUTOTUNE)
# If RAM allows, cache decoded images for subsequent epochs
pipeline = pipeline.cache()
# Shuffle after cache so you reshuffle cheaply each epoch
pipeline = pipeline.shuffle(10_000)
pipeline = pipeline.batch(BATCH, drop_remainder=True).prefetch(tf.data.AUTOTUNE)
# Iterate once to test speed
for imgs, labels in pipeline.take(1):
print(imgs.shape)
Ordering is crucial: decode → cache → shuffle → batch → prefetch. Use AUTOTUNE to adapt to your hardware.
Example 3: PyTorch — class balance without killing throughput
from torch.utils.data import DataLoader, WeightedRandomSampler
import numpy as np
# Assume train_ds.targets contains integer labels
class_counts = np.bincount(train_ds.targets)
class_weights = 1.0 / class_counts
sample_weights = class_weights[train_ds.targets]
sampler = WeightedRandomSampler(
weights=sample_weights,
num_samples=len(train_ds), # keep epoch length stable
replacement=True
)
loader_balanced = DataLoader(
train_ds, batch_size=128, sampler=sampler, num_workers=8,
pin_memory=True, prefetch_factor=4, persistent_workers=True
)
Use a sampler to maintain distribution while retaining parallel loading and prefetching.
How to measure and iterate
- Start with a simple baseline. Record imgs/sec and GPU utilization.
- Increase num_workers until gains flatten. Enable persistent_workers.
- Use pin_memory and non_blocking transfers on CUDA.
- Enable prefetching. Adjust prefetch_factor (PyTorch) or AUTOTUNE (TF).
- Reorder ops: cache early if possible, shuffle after cache, batch then prefetch.
- Consider file strategy: switch to SSD/NVMe, or shard into larger records if many tiny files.
- Move heavy augmentations to GPU where supported, or reduce their cost.
Mini measurement task
Time 100 batches, compute imgs/sec, and observe GPU utilization with a system monitor. Change only one knob at a time.
Exercises
Exercise 1 — Tune a PyTorch DataLoader for throughput
Goal: Increase images/sec on a folder dataset by at least 2× over a zero-worker baseline.
- Create an ImageFolder dataset with Resize→CenterCrop→ToTensor.
- Measure imgs/sec for num_workers in {0, 2, 4, 8, 12}, batch_size in {64, 128}, pin_memory on/off.
- Enable persistent_workers and prefetch_factor; remeasure.
- Report the best config and the improvement factor.
Exercise 2 — Build a fast tf.data pipeline
Goal: Build a pipeline that caches, shuffles, batches, and prefetches efficiently.
- Create a dataset from list_files over a labeled directory tree of JPEGs.
- Implement decode→resize→normalize in a map with num_parallel_calls=AUTOTUNE.
- Add cache, then shuffle(10000), then batch(128), then prefetch(AUTOTUNE).
- Measure steps/sec for 500 steps and record the result.
Readiness checklist
- GPU utilization is consistently high (e.g., >85%) during training
- Data queue never runs empty for long stretches
- Order of ops follows: decode → cache → shuffle → batch → prefetch
- num_workers (or parallel_calls) tuned by measurement, not guesswork
- pin_memory (PyTorch) or efficient prefetching (TF) enabled
- File strategy avoids millions of tiny random reads
Common mistakes and self-check
- Too few workers: GPU idles; fix by increasing num_workers/parallel_calls.
- Heavy CPU aug in a single thread: parallelize or move to GPU equivalents.
- Wrong order: shuffle before cache (TF) causes extra I/O; prefer cache then shuffle.
- No pin_memory: host→GPU copies block longer; enable pin_memory and non_blocking transfers.
- Millions of tiny files on HDD: shard into larger records or use SSD/NVMe.
- Inefficient collate: avoid Python object juggling; keep tensors contiguous.
- No persistent workers: slow per-epoch restarts; enable persistent_workers=True.
- Small prefetch buffers: increase prefetch_factor (PT) or AUTOTUNE (TF).
Self-check tips:
- Compare step time with and without the model (data-only loop). If data-only is slow, the loader is the bottleneck.
- Monitor GPU utilization; if low while CPU is saturated, tune the loader.
- Log queue depth if available; avoid frequent queue underflows.
Practical projects
- Project 1: Convert a folder dataset to TFRecords and compare training throughput vs raw files. Report imgs/sec and epoch time.
- Project 2: Build two pipelines for the same dataset: (A) CPU augmentations, (B) GPU-friendly augmentations. Compare speed and accuracy.
- Project 3: Multi-worker profiling: run 1, 4, 8, 16 workers on SSD vs HDD and visualize throughput scaling.
Who this is for
Computer Vision Engineers and ML practitioners who train models on image or video data and want to maximize GPU utilization and reduce training time.
Prerequisites
- Basic Python and familiarity with PyTorch or TensorFlow
- Understanding of batching, epochs, and GPU training loops
- Ability to run simple timing/profiling experiments
Learning path
- Learn the data pipeline stages and bottlenecks
- Tune parallelism, pinning, and prefetching
- Optimize operation order and caching
- Adopt better file strategies (SSD, sharding)
- Profile, iterate, and document best configs
Mini challenge
Within 30 minutes, double your baseline images/sec on a given dataset by changing only loader parameters and operation order. Record before/after metrics and settings.
Next steps
- Integrate your tuned loader into your training template
- Add a small profiler that records imgs/sec, GPU utilization, and queue depth each epoch
- Prepare a one-page playbook with your best-known configurations per machine
Quick Test
Take the quick test to check your understanding. The test is available to everyone; only logged-in users will have their progress saved.