Who this is for
This lesson is for Computer Vision Engineers and ML practitioners who prepare image datasets and training pipelines for classification, detection, and segmentation. If you need models that generalize to new viewpoints and layouts, you will use crops, flips, and rotations regularly.
Prerequisites
- Comfort with basic image concepts: width, height, channels, coordinate systems (top-left origin).
- Understanding of dataset splits (train/val/test) and why augmentation belongs on training only.
- Knowing your task type: classification vs detection (bounding boxes) vs segmentation (masks) vs keypoints.
Why this matters
In real projects you rarely control how an object appears: it may be off-center, partially cropped, rotated slightly, or mirrored. Crops, flips, and rotations simulate these variations so your model:
- Recognizes objects despite camera framing (random crops simulate zoom/shift).
- Handles mirror symmetry (horizontal flips for people, animals, many objects).
- Is robust to slight orientation changes (small-angle rotations for everyday scenes).
Typical tasks where you need these:
- Image classification: improve accuracy and reduce overfitting with random-resized crops and flips.
- Object detection: increase recall with random crops; update bounding boxes consistently.
- Segmentation and keypoints: apply the same geometric transform to masks/landmarks.
Concept explained simply
- Crop: take a rectangular region of the image (random or center), optionally resize to the model's input size. This simulates zoom and framing changes.
- Flip: mirror the image horizontally (common) or vertically (less common). Horizontal flip often helps when left/right orientation does not change the class.
- Rotation: rotate the image by a small angle (e.g., ±10–20°). Use with care if orientation matters (e.g., digits, text, road signs).
Important: For detection, segmentation, and keypoints you must transform labels with the image. For example:
- Bounding boxes: adjust coordinates after crop/flip/rotation; clip to image boundaries.
- Masks: apply exactly the same spatial transform (e.g., nearest-neighbor interpolation for masks).
- Keypoints: rotate and flip each (x, y) and update visibility if points move outside the image.
Mental model
Think of these as camera jitters you add during training:
- Crop = you moved the camera slightly closer or off-center.
- Flip = the scene is mirrored, but its meaning is unchanged (if symmetry holds).
- Rotation = the camera tilted a bit.
Set simple rules: apply to training only, sample parameters randomly within safe ranges, and always keep labels in sync.
Parameters that often work (tune per dataset)
- Random-resized crop: scale range 0.6–1.0 of area, aspect ratio 3/4–4/3.
- Horizontal flip probability: 0.5 (commonly effective; reduce if left/right matters).
- Rotation: ±10–20° for general scenes; 0° if class is orientation-sensitive (digits, traffic signs, text).
- Detection safety: drop boxes with very small remaining area after crop (e.g., keep if at least 20% of original area remains).
Worked examples
Example 1 — Classification pipeline
- Random-resized crop to 224×224 from a larger image using scale [0.6, 1.0] and aspect ratio [3/4, 4/3].
- Horizontal flip with p=0.5.
- Small rotation in ±15° with p=0.3.
Effect: The model sees the same object at different scales, off-center positions, mirrored views, and mild tilts. Validation stays deterministic (e.g., resize + center crop only).
Example 2 — Detection: crop + flip box update
Image: 640×480 (W×H). Box: [x_min=150, y_min=120, x_max=420, y_max=360].
- Center-crop to 480×480: left boundary at x=80, right at x=560.
- Translate x by −80: box becomes [70, 120, 340, 360]. Clip to [0,480].
- Horizontal flip over width 480 using x' = W − x: new box = [480−340, 120, 480−70, 360] = [140, 120, 410, 360].
Always apply the same transforms to all boxes and discard boxes that become too small after clipping based on your threshold.
Example 3 — Keypoint rotation (90° clockwise)
Square image 256×256. Keypoint at (x=40, y=100). For a 90° clockwise rotation, a common mapping is: (x', y') = (H − 1 − y, x). Here H=256.
- x' = 256 − 1 − 100 = 155
- y' = 40
For arbitrary angles, rotate each point around the image center and reproject; for masks use nearest-neighbor interpolation; for images use bilinear.
Common mistakes and self-check
- Applying augmentation to validation/test. Self-check: ensure only deterministic preprocess (resize/center crop) is used for val/test.
- Not updating labels. Self-check: visually draw boxes/masks on augmented images for a small batch and inspect.
- Too aggressive rotations for orientation-sensitive classes. Self-check: run a small ablation comparing ±0°, ±10°, ±20°; choose the best.
- Vertical flips where gravity matters (e.g., pedestrians). Self-check: confirm label semantics still hold after flipping.
- Off-by-one errors in box flipping. Self-check: verify formula convention (e.g., x' = W − x for [min,max) coordinates).
Exercises
Do this hands-on task. Then check your work with the solution. Use the checklist to verify quality.
Exercise 1 — Crop then flip a bounding box (mirrors ex1 below)
Image: 256×256. Box: [30, 50, 130, 200]. Crop: take the top-left 80% area (region from (0,0) to (204,204)), then apply a horizontal flip on the cropped image.
- Assumption: coordinates are [x_min, y_min, x_max, y_max] with max exclusive; flip uses x' = W − x where W is cropped width.
What are the final box coordinates after crop and flip?
Show solution
- After crop (0,0)-(204,204): box remains [30, 50, 130, 200].
- Flip over W=204: new x_min = 204 − 130 = 74; new x_max = 204 − 30 = 174.
- Final: [74, 50, 174, 200].
Exercise checklist
- Used the same crop region for image and labels.
- Clipped boxes to the cropped boundaries.
- Applied the flip with a consistent coordinate convention.
- Verified no negative width/height after transforms.
Practical projects
- Build a classification training script with random-resized crops, p=0.5 horizontal flips, and ±15° rotations. Compare accuracy vs no augmentation.
- Create a detection visualizer: randomly crop and flip images and overlay adjusted boxes; export a small gallery to sanity-check label transforms.
- Segmentation mini-pipeline: rotate image+mask pairs by random small angles; ensure mask interpolation is nearest and boundaries remain crisp.
Learning path
- Master crops, flips, rotations (this lesson).
- Add scale and translation jitter; then photometric augments (color/brightness).
- Task-specific augments: CutOut, MixUp, Mosaic (for detection) after you are solid on label-safe geometry.
- Evaluation: run ablations to measure each augment's impact before adding more.
Mini challenge
Design a policy for a classification dataset of everyday objects where orientation mostly does not matter. Constraints:
- Average 2–3 geometric transforms per sample.
- No rotation beyond ±20°.
- Keep at least 60% of the original area in crops.
Deliverables: the parameter ranges you chose and a brief note on why (one sentence each). Bonus: run a quick ablation to compare with/without rotations.
Next steps
- Instrument a small ablation to quantify the benefit of each transform.
- Introduce task-aware constraints (e.g., disable flips for text-heavy datasets).
- Proceed to more advanced augmentations after your label-transform logic is robust.
Quick Test
Take the short test below to check your understanding. Everyone can take it for free. If you are logged in, your progress will be saved automatically.