luvv to helpDiscover the Best Free Online Tools
Topic 2 of 8

Efficient Data Loaders

Learn Efficient Data Loaders for free with explanations, exercises, and a quick test (for Computer Vision Engineer).

Published: January 5, 2026 | Updated: January 5, 2026

Why this matters

In computer vision training, your GPU can sit idle if the data loader cannot decode, augment, and deliver batches fast enough. Efficient data loaders keep the GPU fed so you train faster, iterate more, and reduce compute costs. Real tasks include tuning PyTorch DataLoader or tf.data pipelines, selecting the right number of workers, pinning memory, ordering cache/shuffle/batch/prefetch correctly, and choosing file formats that minimize I/O overhead.

  • Faster experiments: More epochs per day and quicker ablations.
  • Stable training: Smooth input pipelines reduce training jitter and timeouts.
  • Scalable pipelines: Same principles carry to multi-GPU and cluster training.

Concept explained simply

Think of your data pipeline like a factory line:

  1. Read bytes from storage (disk/remote)
  2. Decode image/video
  3. Apply transforms/augmentations
  4. Batch samples
  5. Transfer to GPU

The speed of the line equals the speed of the slowest step. You optimize by parallelizing CPU work, reducing I/O overhead, and overlapping CPU/GPU work.

Mental model

Imagine your model as a hungry GPU that must be fed N images every second. The CPU team (workers) prepares meals (decode/augment) and the runner (pinned memory transfer) delivers to the GPU table. If meals are late, the GPU starves; if meals are early, a small buffer (prefetch) prevents hiccups.

Throughput targets by model size (approximate)
  • Small CNNs: 1–5k imgs/sec
  • Mid models (ResNet50): 300–800 imgs/sec
  • Large ViT: 100–300 imgs/sec

Use these only as sanity checks. Actual numbers depend on hardware, augmentations, and image sizes.

Key tools and knobs

  • Parallelism: num_workers (PyTorch), num_parallel_calls and interleave (TensorFlow)
  • Overlap: prefetch_factor (PyTorch), dataset.prefetch(AUTOTUNE) (TensorFlow)
  • Transfer speed: pin_memory=True (PyTorch), tf.data on GPU-friendly tensors
  • Stability: persistent_workers=True (PyTorch), deterministic options (TF) when needed
  • Order of ops: read/decode → cache (if fits) → shuffle → map/augment → batch → prefetch
  • File strategy: Prefer NVMe/SSD, avoid millions of tiny files; use sharded records (e.g., TFRecord) when possible
  • Transforms: Move heavy aug to GPU where available; keep CPU aug vectorized
  • Collation: Efficient collate_fn; avoid expensive Python object manipulation
How to choose num_workers (PyTorch)
  1. Start with num_workers = number of physical CPU cores / 2
  2. Measure images/sec for 1, 2, 4, 8, 12, 16 workers
  3. Pick the smallest value within 5% of the best throughput to save CPU

Use persistent_workers=True after you find a good value to avoid per-epoch startup overhead.

Worked examples

Example 1: PyTorch — baseline vs tuned

# Baseline DataLoader (slow)
from torch.utils.data import DataLoader
from torchvision import datasets, transforms

transform = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
])
train_ds = datasets.ImageFolder('data/train', transform=transform)

baseline = DataLoader(train_ds, batch_size=64, shuffle=True, num_workers=0)

# Tuned DataLoader
opt = DataLoader(
    train_ds,
    batch_size=128,
    shuffle=True,
    num_workers=8,
    pin_memory=True,
    prefetch_factor=4,
    persistent_workers=True,
)

# Measuring throughput (no model, data-only)
import time
for loader in [baseline, opt]:
    # warmup
    it = iter(loader)
    for _ in range(10):
        next(it)
    # measure
    t0 = time.time()
    n = 0
    for i, batch in enumerate(loader):
        n += batch[0].shape[0]
        if i == 100: break
    dt = time.time() - t0
    print('imgs/sec:', n/dt)

Key improvements: higher batch size (fits memory), multiple workers, pinned memory, and prefetching. Always ensure your GPU transfer uses non_blocking=True when calling .to(device).

# Example transfer inside training loop
images, labels = images.to(device, non_blocking=True), labels.to(device, non_blocking=True)

Example 2: TensorFlow — tf.data with AUTOTUNE

import tensorflow as tf

IMG_SIZE = (224, 224)
BATCH = 128

files = tf.data.Dataset.list_files('data/train/*/*.jpg', shuffle=True)

def load_decode(path):
    img = tf.io.read_file(path)
    img = tf.image.decode_jpeg(img, channels=3)
    img = tf.image.resize(img, IMG_SIZE)
    img = tf.cast(img, tf.float32) / 255.0
    label = tf.strings.split(path, '/')[-2]
    return img, label

# Interleave improves parallel file reads
pipeline = files.interleave(
    lambda p: tf.data.Dataset.from_tensors(p),
    cycle_length=tf.data.AUTOTUNE,
    num_parallel_calls=tf.data.AUTOTUNE
).map(load_decode, num_parallel_calls=tf.data.AUTOTUNE)

# If RAM allows, cache decoded images for subsequent epochs
pipeline = pipeline.cache()

# Shuffle after cache so you reshuffle cheaply each epoch
pipeline = pipeline.shuffle(10_000)

pipeline = pipeline.batch(BATCH, drop_remainder=True).prefetch(tf.data.AUTOTUNE)

# Iterate once to test speed
for imgs, labels in pipeline.take(1):
    print(imgs.shape)

Ordering is crucial: decode → cache → shuffle → batch → prefetch. Use AUTOTUNE to adapt to your hardware.

Example 3: PyTorch — class balance without killing throughput

from torch.utils.data import DataLoader, WeightedRandomSampler
import numpy as np

# Assume train_ds.targets contains integer labels
class_counts = np.bincount(train_ds.targets)
class_weights = 1.0 / class_counts
sample_weights = class_weights[train_ds.targets]

sampler = WeightedRandomSampler(
    weights=sample_weights,
    num_samples=len(train_ds),  # keep epoch length stable
    replacement=True
)

loader_balanced = DataLoader(
    train_ds, batch_size=128, sampler=sampler, num_workers=8,
    pin_memory=True, prefetch_factor=4, persistent_workers=True
)

Use a sampler to maintain distribution while retaining parallel loading and prefetching.

How to measure and iterate

  1. Start with a simple baseline. Record imgs/sec and GPU utilization.
  2. Increase num_workers until gains flatten. Enable persistent_workers.
  3. Use pin_memory and non_blocking transfers on CUDA.
  4. Enable prefetching. Adjust prefetch_factor (PyTorch) or AUTOTUNE (TF).
  5. Reorder ops: cache early if possible, shuffle after cache, batch then prefetch.
  6. Consider file strategy: switch to SSD/NVMe, or shard into larger records if many tiny files.
  7. Move heavy augmentations to GPU where supported, or reduce their cost.
Mini measurement task

Time 100 batches, compute imgs/sec, and observe GPU utilization with a system monitor. Change only one knob at a time.

Exercises

Exercise 1 — Tune a PyTorch DataLoader for throughput

Goal: Increase images/sec on a folder dataset by at least 2× over a zero-worker baseline.

  1. Create an ImageFolder dataset with Resize→CenterCrop→ToTensor.
  2. Measure imgs/sec for num_workers in {0, 2, 4, 8, 12}, batch_size in {64, 128}, pin_memory on/off.
  3. Enable persistent_workers and prefetch_factor; remeasure.
  4. Report the best config and the improvement factor.

Exercise 2 — Build a fast tf.data pipeline

Goal: Build a pipeline that caches, shuffles, batches, and prefetches efficiently.

  1. Create a dataset from list_files over a labeled directory tree of JPEGs.
  2. Implement decode→resize→normalize in a map with num_parallel_calls=AUTOTUNE.
  3. Add cache, then shuffle(10000), then batch(128), then prefetch(AUTOTUNE).
  4. Measure steps/sec for 500 steps and record the result.

Readiness checklist

  • GPU utilization is consistently high (e.g., >85%) during training
  • Data queue never runs empty for long stretches
  • Order of ops follows: decode → cache → shuffle → batch → prefetch
  • num_workers (or parallel_calls) tuned by measurement, not guesswork
  • pin_memory (PyTorch) or efficient prefetching (TF) enabled
  • File strategy avoids millions of tiny random reads

Common mistakes and self-check

  • Too few workers: GPU idles; fix by increasing num_workers/parallel_calls.
  • Heavy CPU aug in a single thread: parallelize or move to GPU equivalents.
  • Wrong order: shuffle before cache (TF) causes extra I/O; prefer cache then shuffle.
  • No pin_memory: host→GPU copies block longer; enable pin_memory and non_blocking transfers.
  • Millions of tiny files on HDD: shard into larger records or use SSD/NVMe.
  • Inefficient collate: avoid Python object juggling; keep tensors contiguous.
  • No persistent workers: slow per-epoch restarts; enable persistent_workers=True.
  • Small prefetch buffers: increase prefetch_factor (PT) or AUTOTUNE (TF).

Self-check tips:

  • Compare step time with and without the model (data-only loop). If data-only is slow, the loader is the bottleneck.
  • Monitor GPU utilization; if low while CPU is saturated, tune the loader.
  • Log queue depth if available; avoid frequent queue underflows.

Practical projects

  • Project 1: Convert a folder dataset to TFRecords and compare training throughput vs raw files. Report imgs/sec and epoch time.
  • Project 2: Build two pipelines for the same dataset: (A) CPU augmentations, (B) GPU-friendly augmentations. Compare speed and accuracy.
  • Project 3: Multi-worker profiling: run 1, 4, 8, 16 workers on SSD vs HDD and visualize throughput scaling.

Who this is for

Computer Vision Engineers and ML practitioners who train models on image or video data and want to maximize GPU utilization and reduce training time.

Prerequisites

  • Basic Python and familiarity with PyTorch or TensorFlow
  • Understanding of batching, epochs, and GPU training loops
  • Ability to run simple timing/profiling experiments

Learning path

  1. Learn the data pipeline stages and bottlenecks
  2. Tune parallelism, pinning, and prefetching
  3. Optimize operation order and caching
  4. Adopt better file strategies (SSD, sharding)
  5. Profile, iterate, and document best configs

Mini challenge

Within 30 minutes, double your baseline images/sec on a given dataset by changing only loader parameters and operation order. Record before/after metrics and settings.

Next steps

  • Integrate your tuned loader into your training template
  • Add a small profiler that records imgs/sec, GPU utilization, and queue depth each epoch
  • Prepare a one-page playbook with your best-known configurations per machine

Quick Test

Take the quick test to check your understanding. The test is available to everyone; only logged-in users will have their progress saved.

Practice Exercises

2 exercises to complete

Instructions

Goal: Increase images/sec by at least 2× over a zero-worker baseline on a folder dataset.

  1. Create an ImageFolder dataset with Resize(256) → CenterCrop(224) → ToTensor.
  2. Benchmark imgs/sec for combinations: num_workers ∈ {0,2,4,8,12}, batch_size ∈ {64,128}, pin_memory ∈ {False,True}.
  3. Enable persistent_workers=True and prefetch_factor ∈ {2,4}. Re-benchmark.
  4. Use non_blocking=True on tensor .to(device) during timing with a lightweight model or dummy step.
  5. Submit a short report: best config, baseline vs best imgs/sec, and your reasoning.
Expected Output
A short report with baseline and best images/sec, chosen parameters (batch_size, num_workers, pin_memory, prefetch_factor, persistent_workers), and observed improvement factor.

Efficient Data Loaders — Quick Test

Test your knowledge with 10 questions. Pass with 70% or higher.

10 questions70% to pass

Have questions about Efficient Data Loaders?

AI Assistant

Ask questions about this tool