Why this matters
In detection and segmentation, the model must find and outline objects under varied conditions. Augmentation simulates those conditions while keeping labels consistent. You will use it to:
- Increase robustness to lighting, scale, rotation, and occlusion.
- Balance class frequency by smartly re-using rare objects (e.g., copy–paste).
- Improve generalization on small datasets without collecting new images.
Real tasks you will face
- Build a training pipeline that flips, scales, and crops images while updating bounding boxes and masks.
- Filter boxes after crops when objects become too small or disappear.
- Mix multiple images (mosaic/mixup) and merge labels safely.
Concept explained simply
Augmentation is editing the image and moving its labels the same way. If the picture moves, labels move with it.
Mental model
Think of a printed photo with transparent stickers for boxes/masks on top. Any transform you apply to the photo (flip, rotate, crop) must also be applied to the stickers in the same order.
Core techniques you will use
Geometric transforms (update labels!)
- Flip: Horizontal flip over width W. New box x-coordinates: x_min' = W - x_max, x_max' = W - x_min. Masks/polygons: reflect all x coordinates.
- Scale/Resize: Multiply all coordinates by scale (or apply affine matrix). Masks: resample with nearest-neighbor to keep discrete labels.
- Rotate: Apply rotation matrix around image center, then re-box polygons to get new boxes. Masks: rotate with nearest-neighbor.
- Crop/Paste/Pad: Subtract crop offset, then clip to image bounds. Drop boxes/masks that shrink too much.
- Translate: Add shift to coordinates; clip to bounds.
Photometric transforms (no label geometry changes)
- Brightness/contrast/saturation/hue, gamma, noise, blur. Boxes and masks stay the same shape and position.
Cutout / Random erasing
- Mask out random rectangular regions. Keep boxes that are still sufficiently visible; optionally reduce their visible fraction.
Copy–Paste (instance-level)
- Cut segmented instances from one image, paste into another; transform the pasted mask and box consistently. Adjust occlusions if overlapping.
Mosaic / MixUp (image-level)
- Mosaic: Combine 2–4 images into one canvas; transform and translate labels from each image.
- MixUp: Blend two images; keep labels from both. For segmentation, keep instance masks; for semantic masks, blending is typically discouraged—prefer cut-mix with masks.
Label handling rules
- Boxes are usually stored as (x_min, y_min, x_max, y_max). Always clip to [0, W] × [0, H]. Ensure x_min ≤ x_max and y_min ≤ y_max after transforms.
- Masks: Use nearest-neighbor interpolation to avoid label bleeding across classes. Preserve instance IDs and class indices.
- Polygons: Transform every vertex; re-compute bounding boxes from transformed polygons when needed.
- Filtering after crops: drop tiny boxes (e.g., min area) or low-visibility boxes (e.g., min IoU with original or min visible fraction).
Worked examples
Example 1: Horizontal flip for detection
- Image size: W = 1024, H = 768.
- Original box: (x_min=100, y_min=200, x_max=300, y_max=500).
- Flip horizontally: x_min' = 1024 - 300 = 724; x_max' = 1024 - 100 = 924; y stays the same.
Result: (724, 200, 924, 500).
Example 2: Scale + crop for detection
- Image: 640×480. Box: (120, 60, 360, 300).
- Scale by 1.5 → new image 960×720; box → (180, 90, 540, 450).
- Crop 640×480 at offset (x=200, y=100): subtract offsets → (−20, −10, 340, 350); clip to [0..640]×[0..480] → (0, 0, 340, 350).
- Check min area: area = 340×350 = 119,000 px²; keep.
Example 3: Rotate + mask rule for segmentation
- Image: 512×512. Instance mask M (integer IDs).
- Rotate by +15° around center: apply affine rotation to the image and to M using nearest-neighbor for M.
- Polygons: rotate each vertex with same matrix; if polygon self-intersects after rotation+discretization, re-simplify or keep as mask.
Key: Never use bilinear for label masks; it creates invalid class indices.
Quality guards (keep labels trustworthy)
- Visibility threshold: drop boxes whose visible fraction after crop/erase is below, e.g., 0.3.
- Min size: drop boxes with width or height below, e.g., 2–3 pixels.
- Clipping: always clip boxes and polygons to image bounds.
- Order matters: apply the exact same transform chain, in the same order, to image and labels.
- Interpolation: images can use bilinear/bicubic; masks use nearest-neighbor.
- Randomness: set a seed per sample if you need reproducibility across image/label ops.
Self-check prompts
- After augmentations, are any boxes inverted (x_min > x_max) or outside bounds?
- Did class IDs or instance IDs change unintentionally after mask resampling?
- Are rare classes still present after random crops? If not, adjust crop policy.
Who this is for
- Engineers training object detectors or instance/semantic segmentation models.
- Practitioners working with limited data who need strong generalization.
Prerequisites
- Comfort with image coordinates and arrays.
- Basic understanding of object detection boxes and segmentation masks.
- Familiarity with affine transforms (scale, rotate, translate).
Learning path
- Master geometric transforms and consistent label updates.
- Add photometric transforms safely.
- Introduce advanced tactics (copy–paste, mosaic) and filtering rules.
- Tune augmentation strength with validation feedback.
Hands-on exercises
Do the exercise below. The quick test at the bottom is available to everyone; log in to save your progress.
Exercise ex1: Pipeline math for box transforms
You have an image W×H = 640×480 and a box (100, 120, 220, 300).
- Scale uniformly by 1.25 (image becomes 800×600).
- Crop 640×480 at top-left offset (60, 40).
- Flip horizontally across the 640-pixel width.
Compute the final box (x_min, y_min, x_max, y_max). If any part leaves the frame during the crop, clip to bounds first, then flip.
- Checklist: I multiplied coordinates by scale, subtracted crop offsets, clipped, then flipped.
- Checklist: I kept y unchanged during the horizontal flip.
- Checklist: Final coordinates are within [0..640]×[0..480] and ordered correctly.
Common mistakes and how to catch them
- Using bilinear for masks: creates non-integer labels. Fix: nearest-neighbor for any label map.
- Forgetting to update boxes after crop/flip: leads to label drift. Fix: unit-test each transform.
- Not filtering tiny/hidden boxes: noisy supervision. Fix: apply min-size and visibility thresholds.
- Wrong flip math: use x_min' = W - x_max, x_max' = W - x_min for horizontal flip.
- Applying transforms in different orders for image and labels: always share the same parameters/seed.
Quick self-test
After a random pipeline, quickly verify: number of boxes before vs. after (account for filtering), min box size non-zero, masks remain integer-valued, no NaNs/Infs.
Practical projects
- Balanced cropper for small objects: Implement a random crop that retries until at least one target class remains visible.
- Copy–paste rare instances: Extract object masks from a few images and paste them into others with slight scale/rotation.
- Mosaic for multi-scale training: Build a 2×2 mosaic combining four images and re-map all labels.
Mini challenge
Design a strong-yet-stable augmentation policy for a small dataset with tiny objects. Combine random crop with min visibility 0.5, light rotation (±10°), brightness/contrast jitter, and occasional copy–paste for rare classes. Measure mAP/IoU changes against a baseline and adjust thresholds to reduce label loss.
Next steps
- Add per-class augmentation strength (e.g., more copy–paste for underrepresented classes).
- Validate on a clean hold-out set to detect over-augmentation (performance drops or unstable training).
- Document your policy and thresholds so teammates can reproduce results.