Why this matters for Applied Scientists
Models fail when data and labels are wrong. As an Applied Scientist, you turn ambiguous business problems into measurable outcomes. A strong data and label strategy helps you: select representative data, define clear labels, control noise, scale labeling cost-effectively, and close the loop with active learning. This unlocks dependable model performance, faster iterations, and easier debugging.
Who this is for
- Applied Scientists and ML Engineers building models that depend on labeled data.
- Data Scientists planning pilots, PoCs, or production models with human labeling.
- Team leads who need a repeatable, auditable data process.
Prerequisites
- Basic Python and NumPy/Pandas.
- Familiarity with train/validation/test splits and evaluation metrics.
- Comfort with classification concepts (precision/recall), and basic probability.
Learning path
- Define the task and acceptance criteria — Write 2–3 target metrics and minimum thresholds. Identify edge cases.
- Design the dataset — Map sources, sampling strategy (stratified, temporal, or domain-balanced), and expected class balance.
- Create a label taxonomy and guidelines — Make labels mutually exclusive and collectively exhaustive; add positive/negative examples and counterexamples.
- Run a pilot annotation — Label a small batch with 2+ annotators, measure agreement, and refine guidelines.
- Scale with weak supervision — Write labeling rules or use distant supervision to cheaply expand coverage; estimate noise.
- Loop with active learning — Prioritize uncertain or high-value samples for human labeling.
- Augment thoughtfully — Use augmentation to improve robustness without leaking labels.
- Version and document — Track dataset versions, label guideline revisions, and lineage (how each sample was obtained and labeled).
- Privacy and compliance — Remove PII, minimize retention, and document consent and access controls.
Worked examples
1) Stratified sampling for an imbalanced dataset
Goal: keep class ratios consistent across splits and optionally weight minority classes.
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.utils.class_weight import compute_class_weight
# Example data
df = pd.DataFrame({
"text": ["..."],
"label": [0,1,0,0,1,0,0,0,1,0] # imbalanced
})
train_df, test_df = train_test_split(
df, test_size=0.2, random_state=42, stratify=df["label"]
)
# Optional: compute class weights for training
classes = sorted(df["label"].unique())
weights = compute_class_weight(class_weight="balanced", classes=classes, y=train_df["label"])
class_weight_map = {c: w for c, w in zip(classes, weights)}
print(class_weight_map)
Tip: For temporal data, use time-based splits first, then stratify within time windows if needed.
2) Inter-annotator agreement (Cohen’s kappa)
Goal: quantify consistency between two annotators and update guidelines.
from sklearn.metrics import cohen_kappa_score
# Two annotators labeled the same 20 items
ann_a = [0,1,0,1,1,0,2,1,0,2, 1,0,2,2,1,0,1,2,0,1]
ann_b = [0,1,0,1,0,0,2,1,0,2, 1,0,2,1,1,0,1,2,0,2]
kappa = cohen_kappa_score(ann_a, ann_b)
print(f"Cohen's kappa: {kappa:.2f}")
# 0.60–0.80: substantial; 0.40–0.60: moderate. Investigate disagreements.
Action: Extract top disagreement cases and add explicit rules to the guideline for them.
3) Simple weak supervision with labeling functions
Goal: bootstrap labels using heuristics and combine votes.
ABSTAIN = -1
def lf_contains_refund(x):
return 1 if "refund" in x.lower() else ABSTAIN
def lf_contains_thanks(x):
return 0 if "thank" in x.lower() else ABSTAIN
def lf_length_short(x):
return 1 if len(x.split()) < 4 else ABSTAIN
texts = [
"refund please", "thanks for the help", "need a refund asap", "question"
]
votes = []
for t in texts:
row = [lf_contains_refund(t), lf_contains_thanks(t), lf_length_short(t)]
votes.append(row)
# Majority vote ignoring ABSTAIN
import numpy as np
labels = []
for row in votes:
vals = [v for v in row if v != ABSTAIN]
if len(vals) == 0:
labels.append(ABSTAIN)
else:
counts = {c: vals.count(c) for c in set(vals)}
labels.append(max(counts, key=counts.get))
print(labels) # Pseudo-labels to prioritize for review
Improve: Estimate per-rule accuracy and correlation to weight votes rather than naive majority.
4) Active learning via uncertainty sampling
Goal: pick the most informative unlabeled samples.
import numpy as np
from sklearn.linear_model import LogisticRegression
X_l, y_l = ... # seed labeled features and labels
X_u = ... # unlabeled features
clf = LogisticRegression(max_iter=1000).fit(X_l, y_l)
probs = clf.predict_proba(X_u)
# Margin sampling: smallest margin between top-2 classes
margins = np.sort(probs, axis=1)[:,-1] - np.sort(probs, axis=1)[:,-2]
query_idx = np.argsort(margins)[:50] # pick 50 most uncertain
X_query = X_u[query_idx]
# Send X_query for human labeling, then retrain
Variation: add diversity (e.g., cluster then sample uncertain points across clusters).
5) Safe augmentation for text and images
Goal: improve robustness without changing label meaning.
# Text: light synonym replacement (whitelist-based)
import random
syn = {"quick": ["fast", "rapid"], "buy": ["purchase"]}
def aug_text(t):
out = []
for w in t.split():
if w.lower() in syn and random.random() < 0.2:
out.append(random.choice(syn[w.lower()]))
else:
out.append(w)
return " ".join(out)
# Image: deterministic horizontal flip
from PIL import Image
def flip_horizontal(path):
img = Image.open(path)
return img.transpose(Image.FLIP_LEFT_RIGHT)
Rule: Never augment validation/test. Keep augmentations label-preserving; avoid operations that change class identity.
6) Lightweight dataset versioning and lineage
Goal: track exact data, label guide version, and how labels were produced.
import hashlib, json
def file_hash(path):
h = hashlib.sha256()
with open(path, 'rb') as f:
for chunk in iter(lambda: f.read(8192), b''):
h.update(chunk)
return h.hexdigest()
manifest = {
"dataset_version": "1.2.0", # major: guideline change, minor: new data, patch: fixes
"sources": ["support_tickets_2025Q1"],
"splits": {"train": 3200, "val": 400, "test": 400},
"label_guideline_version": "v3",
"labeling_methods": ["human_v2", "weak_rules_v1"],
"artifacts": {"train_csv_sha256": file_hash("train.csv")}
}
with open("dataset_manifest.json", "w") as f:
json.dump(manifest, f, indent=2)
Practice: Store the manifest with your dataset and reference it in experiments for reproducibility.
Drills and exercises
- Write a 6–10 class label taxonomy for a problem you know. Mark any overlapping labels and merge or clarify them.
- Draft 8–12 bullet rules in a labeling guideline. Add at least 5 counterexamples.
- Design a sampling plan: identify domains/sources, target class ratios, and how you will avoid temporal leakage.
- Create 5 labeling functions (weak rules) and estimate which are noisy. Note one way to reduce each rule’s noise.
- Run a 50-sample dual-annotation pilot. Compute Cohen’s kappa and propose 3 guideline changes.
- List 3 safe augmentations for your task and 3 unsafe ones (with reasons).
- Write a one-page dataset card: data sources, consent, PII mitigation, known biases, and contact for issues.
Common mistakes and debugging tips
- Overlapping labels: If annotators disagree often between two labels, merge or add a decision rule.
- Ignoring the negative/other class: Define it clearly with examples; otherwise it becomes a sink for confusion.
- Sampling mismatch: Training on one domain, testing on another. Stratify by domain/time; hold out future periods.
- Leakage from augmentation: Never augment validation/test; avoid label-changing transforms.
- One-shot guidelines: Expect 2–3 iterations. Pilot, measure agreement, and revise.
- Weak rule conflicts: Estimate rule accuracies; don’t treat all rules equally.
- Untracked changes: Tie model runs to dataset and guideline versions. Store a manifest with hashes.
- Privacy gaps: Strip PII early, minimize retention, and restrict access to raw data.
Mini project: Build a data and label strategy for a pilot model
- Choose a task: e.g., triage support messages into 5 intents. Define target metrics (e.g., macro-F1 ≥ 0.75).
- Taxonomy + guideline v1: 5 labels, each with definition, 3 positives, 3 counterexamples, and tie-break rules.
- Sampling plan: 1,000 messages across 3 sources and last 3 months; preserve source ratios.
- Pilot annotation (n=150): Two annotators label the same set. Compute kappa and adjust the guideline.
- Weak supervision: Write 5 rules to tag obvious cases. Combine with majority or weighted voting.
- Active learning iteration: Train a baseline, pick top-100 uncertain samples, label them, retrain once.
- Augmentation: Add safe text paraphrases for minority classes only to training.
- Versioning + card: Save dataset_manifest.json (v1.0.0) and a dataset card with privacy notes.
Hints
- Target kappa ≥ 0.6 before scaling annotation.
- Track how many labels come from humans vs. weak rules; review a sample of weak labels.
- Stop augmenting if validation performance on minority classes does not improve.
Success criteria: Clear guideline v2+, kappa ≥ 0.6, a reproducible dataset manifest, and a baseline model that meets or approaches target metrics.
Practical projects
- Customer intent classification: end-to-end taxonomy, guidelines, pilot IAA, and active learning loop.
- Product defect image tagging: define visually distinct classes, safe augmentations, and stratified temporal splits.
- PII redaction data: build a labeled NER dataset with strict privacy rules and auditable lineage.
Subskills
- Dataset Design And Sampling — Choose sources, prevent leakage, and ensure splits match production distributions.
- Label Taxonomy And Guidelines — Make labels unambiguous, mutually exclusive, and well-documented with examples.
- Annotation QA And Agreement — Measure and improve consistency (kappa/alpha), and run spot checks.
- Weak Supervision Basics — Use heuristic rules and distant signals to cheaply expand labeled data.
- Active Learning Basics — Prioritize uncertain or diverse samples to maximize label impact.
- Data Augmentation Strategy — Apply label-preserving transforms to improve robustness only in training.
- Dataset Versioning And Lineage — Track versions, hashes, guideline revisions, and how labels were produced.
- Privacy And Compliance For Data — Remove PII, document consent, and minimize retention with access controls.
Next steps
- Finish the subskills in order, then run the mini project end-to-end.
- Adopt a simple manifest template for every dataset and tie it to your experiments.
- Take the skill exam below to check your readiness. Everyone can take it; logged-in users get saved progress.