What this subskill covers
What good looks like
- Clear labeling guideline with examples and counter-examples
- Planned sampling (random + stratified on rare/hard cases)
- Objective acceptance criteria (e.g., IoU thresholds, per-class rules)
- Inter-annotator agreement tracked and improved over time
- Gold tasks (known answers) to calibrate and monitor raters
- Logged corrections and feedback loop to labelers
Why this matters
As a Computer Vision Engineer, you will frequently:
- Validate vendor labels for detection/segmentation before training
- Audit new data after model drift is suspected
- Calibrate reviewers and labelers on edge cases (occlusion, glare, tiny objects)
- Quantify whether dataset revisions improved downstream model metrics
- Design human-in-the-loop (HITL) review for production monitoring
Concept explained simply
Think of a label audit like quality inspection on a manufacturing line. You do not re-check every item; you check enough, in the right places, with objective rules, to be confident the whole batch is good. When issues appear, you map them to root causes and fix the process.
Mental model
- Sampling: where to look
- Criteria: how to judge
- Agreement: how consistent humans are
- Feedback: how to reduce future errors
Key components and metrics
Sampling strategies
- Random sampling: baseline health check
- Stratified sampling: ensure rare classes and hard conditions are included (e.g., night, occlusion, small objects)
- Confidence-based: focus on model low-confidence or high-disagreement items
- Temporal/geo sampling: catch drift across time, devices, or locations
Audit checks
- Instruction adherence: correct class and hierarchy applied
- Geometry quality: bounding box tightness, mask accuracy (evaluate via IoU or boundary F-score)
- Completeness: no missing instances; no duplicated IDs across frames
- Appropriateness: redaction where required; no labeling PII beyond approved scope
Agreement metrics
- Pairwise agreement: percent agreement, Cohen’s kappa (two raters)
- Multi-rater: Fleiss’ kappa or Krippendorff’s alpha
- For shapes: compute overlap metrics (IoU) between rater and reference
Quick refresher: Cohen’s kappa
kappa = (Po − Pe) / (1 − Pe), where Po is observed agreement and Pe is chance agreement from marginal probabilities.
Gold sets and calibration
- Gold tasks: known answers placed in labeling queues (5–10% is common)
- Use gold performance for rater onboarding, spot checks, and drift detection
- Rotate fresh gold regularly and include edge cases
Bias and subgroup checks
- Audit performance across lighting, device type, backgrounds, demographics where ethically and legally appropriate
- Compare error rates per subgroup to surface hidden bias
Disagreement handling
- Adjudication: a trained lead reviews disagreements and updates the guideline
- Versioning: keep guideline versions and note which version produced each label
Worked examples
Example 1: Object detection boxes (pedestrians)
Plan: Stratify by occlusion (none/partial/heavy). Sample 60 images per stratum (180 total). Review criteria: IoU with lead reviewer ≥ 0.85; missing instances counted as critical errors.
Outcome: Mean IoU = 0.88, but 12% images have at least one missing person. Action: Add training examples of crowded scenes; update guideline with “count rules” and re-check the same strata next week.
Example 2: Segmentation masks (surface cracks)
Three annotators label 120 patches (binary: crack/no crack). Compute Fleiss’ kappa. Result: κ ≈ 0.62 (moderate). Action: Clarify what counts as hairline cracks, add high-resolution exemplars, re-run on 60 patches; κ improves to 0.75 (substantial).
Example 3: Model-assisted review
Use model confidence bins. Sample 100 items from 0.2–0.4 confidence and 100 from 0.4–0.6. Correction rate is 28% in the lower bin vs 9% in the mid bin. Action: Expand human review on 0.2–0.5 in production until retraining reduces corrections below 10%.
Step-by-step: Run a label audit
- Define acceptance criteria: e.g., boxes IoU ≥ 0.85; no missing instances; class accuracy ≥ 98% on gold.
- Choose sampling: random + stratified (rare classes, hard conditions). Include a slice from recent data for drift.
- Prepare gold: 20–50 items with adjudicated labels covering easy and hard cases.
- Review with checklist: see the reviewer checklist below; log every correction and its reason.
- Measure agreement: compute kappa/IoU; track by stratum and rater.
- Analyze error themes: taxonomy gaps, ambiguous boundaries, tool issues.
- Feedback loop: update guidelines, retrain labelers, and adjust tooling.
- Re-audit: sample again to confirm improvements.
Checklists
Reviewer checklist
- Guideline version and examples are open
- Label taxonomy loaded and unambiguous
- Image quality acceptable (else mark as unusable)
- All instances labeled; boxes tight; masks align edges
- Hard cases flagged for adjudication rather than guessed
- Corrections and rationales logged
Data stewardship checklist
- Privacy-safe data; sensitive content appropriately handled
- No personally identifiable information labeled unless explicitly approved
- Balanced sampling across conditions and subgroups
- Versioning for guidelines and datasets
Exercises
Do these before the quick test. Everyone can take the quick test; only logged-in users get saved progress.
- Exercise 1: Sampling and acceptance plan (see details below)
- Exercise 2: Compute Cohen’s kappa from a confusion table
Exercise 1 — Instructions
You have 10,000 images for vehicle detection with three strata: highway (7,500), city (2,000), night (500). You can review ~240 images this week. Design a sampling plan and define acceptance criteria for boxes (IoU) and completeness.
- Deliverable: counts per stratum, acceptance thresholds, and a rule to trigger rework.
Exercise 1 — Hints
- Use proportional allocation but upweight rare/hard strata
- IoU ≥ 0.85 is common for tight boxes; completeness matters too
Exercise 1 — Expected output
A stratified sample with night upweighted, clear IoU and completeness thresholds, and a rework trigger rule.
Exercise 1 — Solution
Plan: Highway 120, City 80, Night 40 (night is 5% of data but gets 16.7% of the sample). Criteria: mean IoU ≥ 0.85 with no more than 5% images having any box with IoU < 0.75; missing instances ≤ 1% of total instances. Trigger rework if any stratum violates criteria or if missing-instance rate exceeds 1.5% overall.
Exercise 2 — Instructions
Two annotators label 200 images (binary: object present/absent). Confusion counts: A=present/B=present: 70; A=present/B=absent: 10; A=absent/B=present: 20; A=absent/B=absent: 100. Compute Cohen’s kappa and decide if agreement is acceptable (threshold 0.70).
Exercise 2 — Hints
- Po = (agree)/N
- Pe from marginals: (row1*col1 + row2*col2)/N^2
Exercise 2 — Expected output
Kappa near 0.69 and a brief decision comment.
Exercise 2 — Solution
Po = (70 + 100)/200 = 0.85. Row totals: 80 present, 120 absent. Column totals: 90 present, 110 absent. Pe = (80*90 + 120*110)/200^2 = 20400/40000 = 0.51. Kappa = (0.85 − 0.51)/(1 − 0.51) ≈ 0.694. Below 0.70 threshold; improve guidelines, then re-measure.
Common mistakes and self-check
- Auditing only easy cases: Always stratify by hard conditions
- Mixing correction and adjudication: Separate quick fixes from formal guideline updates
- No versioning: You cannot attribute improvements without dataset/guideline versions
- Ignoring completeness: High IoU is meaningless if instances are missing
- Single reviewer only: Use at least periodic multi-rater checks
- No gold tasks: You lose calibration and drift detection
Self-check questions
- Do I have acceptance thresholds per stratum?
- Are gold tasks covering edge cases?
- Is disagreement resolved and documented with guideline changes?
Practical projects
- Build a lightweight audit dashboard: upload samples, record IoU, missing counts, and disagreement
- Create a 50-item gold set with annotated rationales and counter-examples
- Write a one-page reviewer guide with visual “do/do-not” examples
Who this is for, prerequisites, learning path
Who this is for
- Computer Vision Engineers owning dataset quality
- ML Ops/QA roles supporting production monitoring
Prerequisites
- Basic understanding of detection/segmentation tasks and IoU
- Comfort with simple statistics (proportions, agreement)
Learning path
- Start: This audit subskill
- Then: Error bucketing and root-cause analysis
- Next: Active learning and data curation for retraining
Mini challenge
Your classifier underperforms at night and on small objects. In 5 bullet points, propose an audit plan that will confirm label quality issues and guide improvements.
Next steps
- Run a pilot audit on one recent data batch
- Expand gold tasks and repeat weekly until stability
- Integrate review results into retraining data selection