luvv to helpDiscover the Best Free Online Tools

Data And Label Strategy

Learn Data And Label Strategy for Applied Scientist for free: roadmap, examples, subskills, and a skill exam.

Published: January 7, 2026 | Updated: January 7, 2026

Why this matters for Applied Scientists

Models fail when data and labels are wrong. As an Applied Scientist, you turn ambiguous business problems into measurable outcomes. A strong data and label strategy helps you: select representative data, define clear labels, control noise, scale labeling cost-effectively, and close the loop with active learning. This unlocks dependable model performance, faster iterations, and easier debugging.

Who this is for

  • Applied Scientists and ML Engineers building models that depend on labeled data.
  • Data Scientists planning pilots, PoCs, or production models with human labeling.
  • Team leads who need a repeatable, auditable data process.

Prerequisites

  • Basic Python and NumPy/Pandas.
  • Familiarity with train/validation/test splits and evaluation metrics.
  • Comfort with classification concepts (precision/recall), and basic probability.

Learning path

  1. Define the task and acceptance criteria — Write 2–3 target metrics and minimum thresholds. Identify edge cases.
  2. Design the dataset — Map sources, sampling strategy (stratified, temporal, or domain-balanced), and expected class balance.
  3. Create a label taxonomy and guidelines — Make labels mutually exclusive and collectively exhaustive; add positive/negative examples and counterexamples.
  4. Run a pilot annotation — Label a small batch with 2+ annotators, measure agreement, and refine guidelines.
  5. Scale with weak supervision — Write labeling rules or use distant supervision to cheaply expand coverage; estimate noise.
  6. Loop with active learning — Prioritize uncertain or high-value samples for human labeling.
  7. Augment thoughtfully — Use augmentation to improve robustness without leaking labels.
  8. Version and document — Track dataset versions, label guideline revisions, and lineage (how each sample was obtained and labeled).
  9. Privacy and compliance — Remove PII, minimize retention, and document consent and access controls.

Worked examples

1) Stratified sampling for an imbalanced dataset

Goal: keep class ratios consistent across splits and optionally weight minority classes.

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.utils.class_weight import compute_class_weight

# Example data
df = pd.DataFrame({
    "text": ["..."],
    "label": [0,1,0,0,1,0,0,0,1,0]  # imbalanced
})

train_df, test_df = train_test_split(
    df, test_size=0.2, random_state=42, stratify=df["label"]
)

# Optional: compute class weights for training
classes = sorted(df["label"].unique())
weights = compute_class_weight(class_weight="balanced", classes=classes, y=train_df["label"])
class_weight_map = {c: w for c, w in zip(classes, weights)}
print(class_weight_map)

Tip: For temporal data, use time-based splits first, then stratify within time windows if needed.

2) Inter-annotator agreement (Cohen’s kappa)

Goal: quantify consistency between two annotators and update guidelines.

from sklearn.metrics import cohen_kappa_score

# Two annotators labeled the same 20 items
ann_a = [0,1,0,1,1,0,2,1,0,2, 1,0,2,2,1,0,1,2,0,1]
ann_b = [0,1,0,1,0,0,2,1,0,2, 1,0,2,1,1,0,1,2,0,2]

kappa = cohen_kappa_score(ann_a, ann_b)
print(f"Cohen's kappa: {kappa:.2f}")
# 0.60–0.80: substantial; 0.40–0.60: moderate. Investigate disagreements.

Action: Extract top disagreement cases and add explicit rules to the guideline for them.

3) Simple weak supervision with labeling functions

Goal: bootstrap labels using heuristics and combine votes.

ABSTAIN = -1

def lf_contains_refund(x):
    return 1 if "refund" in x.lower() else ABSTAIN

def lf_contains_thanks(x):
    return 0 if "thank" in x.lower() else ABSTAIN

def lf_length_short(x):
    return 1 if len(x.split()) < 4 else ABSTAIN

texts = [
    "refund please", "thanks for the help", "need a refund asap", "question"
]

votes = []
for t in texts:
    row = [lf_contains_refund(t), lf_contains_thanks(t), lf_length_short(t)]
    votes.append(row)

# Majority vote ignoring ABSTAIN
import numpy as np
labels = []
for row in votes:
    vals = [v for v in row if v != ABSTAIN]
    if len(vals) == 0:
        labels.append(ABSTAIN)
    else:
        counts = {c: vals.count(c) for c in set(vals)}
        labels.append(max(counts, key=counts.get))

print(labels)  # Pseudo-labels to prioritize for review

Improve: Estimate per-rule accuracy and correlation to weight votes rather than naive majority.

4) Active learning via uncertainty sampling

Goal: pick the most informative unlabeled samples.

import numpy as np
from sklearn.linear_model import LogisticRegression

X_l, y_l = ...  # seed labeled features and labels
X_u = ...       # unlabeled features

clf = LogisticRegression(max_iter=1000).fit(X_l, y_l)
probs = clf.predict_proba(X_u)

# Margin sampling: smallest margin between top-2 classes
margins = np.sort(probs, axis=1)[:,-1] - np.sort(probs, axis=1)[:,-2]
query_idx = np.argsort(margins)[:50]  # pick 50 most uncertain

X_query = X_u[query_idx]
# Send X_query for human labeling, then retrain

Variation: add diversity (e.g., cluster then sample uncertain points across clusters).

5) Safe augmentation for text and images

Goal: improve robustness without changing label meaning.

# Text: light synonym replacement (whitelist-based)
import random
syn = {"quick": ["fast", "rapid"], "buy": ["purchase"]}

def aug_text(t):
    out = []
    for w in t.split():
        if w.lower() in syn and random.random() < 0.2:
            out.append(random.choice(syn[w.lower()]))
        else:
            out.append(w)
    return " ".join(out)

# Image: deterministic horizontal flip
from PIL import Image

def flip_horizontal(path):
    img = Image.open(path)
    return img.transpose(Image.FLIP_LEFT_RIGHT)

Rule: Never augment validation/test. Keep augmentations label-preserving; avoid operations that change class identity.

6) Lightweight dataset versioning and lineage

Goal: track exact data, label guide version, and how labels were produced.

import hashlib, json

def file_hash(path):
    h = hashlib.sha256()
    with open(path, 'rb') as f:
        for chunk in iter(lambda: f.read(8192), b''):
            h.update(chunk)
    return h.hexdigest()

manifest = {
  "dataset_version": "1.2.0",  # major: guideline change, minor: new data, patch: fixes
  "sources": ["support_tickets_2025Q1"],
  "splits": {"train": 3200, "val": 400, "test": 400},
  "label_guideline_version": "v3",
  "labeling_methods": ["human_v2", "weak_rules_v1"],
  "artifacts": {"train_csv_sha256": file_hash("train.csv")}
}

with open("dataset_manifest.json", "w") as f:
    json.dump(manifest, f, indent=2)

Practice: Store the manifest with your dataset and reference it in experiments for reproducibility.

Drills and exercises

  • Write a 6–10 class label taxonomy for a problem you know. Mark any overlapping labels and merge or clarify them.
  • Draft 8–12 bullet rules in a labeling guideline. Add at least 5 counterexamples.
  • Design a sampling plan: identify domains/sources, target class ratios, and how you will avoid temporal leakage.
  • Create 5 labeling functions (weak rules) and estimate which are noisy. Note one way to reduce each rule’s noise.
  • Run a 50-sample dual-annotation pilot. Compute Cohen’s kappa and propose 3 guideline changes.
  • List 3 safe augmentations for your task and 3 unsafe ones (with reasons).
  • Write a one-page dataset card: data sources, consent, PII mitigation, known biases, and contact for issues.

Common mistakes and debugging tips

  • Overlapping labels: If annotators disagree often between two labels, merge or add a decision rule.
  • Ignoring the negative/other class: Define it clearly with examples; otherwise it becomes a sink for confusion.
  • Sampling mismatch: Training on one domain, testing on another. Stratify by domain/time; hold out future periods.
  • Leakage from augmentation: Never augment validation/test; avoid label-changing transforms.
  • One-shot guidelines: Expect 2–3 iterations. Pilot, measure agreement, and revise.
  • Weak rule conflicts: Estimate rule accuracies; don’t treat all rules equally.
  • Untracked changes: Tie model runs to dataset and guideline versions. Store a manifest with hashes.
  • Privacy gaps: Strip PII early, minimize retention, and restrict access to raw data.

Mini project: Build a data and label strategy for a pilot model

  1. Choose a task: e.g., triage support messages into 5 intents. Define target metrics (e.g., macro-F1 ≥ 0.75).
  2. Taxonomy + guideline v1: 5 labels, each with definition, 3 positives, 3 counterexamples, and tie-break rules.
  3. Sampling plan: 1,000 messages across 3 sources and last 3 months; preserve source ratios.
  4. Pilot annotation (n=150): Two annotators label the same set. Compute kappa and adjust the guideline.
  5. Weak supervision: Write 5 rules to tag obvious cases. Combine with majority or weighted voting.
  6. Active learning iteration: Train a baseline, pick top-100 uncertain samples, label them, retrain once.
  7. Augmentation: Add safe text paraphrases for minority classes only to training.
  8. Versioning + card: Save dataset_manifest.json (v1.0.0) and a dataset card with privacy notes.
Hints
  • Target kappa ≥ 0.6 before scaling annotation.
  • Track how many labels come from humans vs. weak rules; review a sample of weak labels.
  • Stop augmenting if validation performance on minority classes does not improve.

Success criteria: Clear guideline v2+, kappa ≥ 0.6, a reproducible dataset manifest, and a baseline model that meets or approaches target metrics.

Practical projects

  • Customer intent classification: end-to-end taxonomy, guidelines, pilot IAA, and active learning loop.
  • Product defect image tagging: define visually distinct classes, safe augmentations, and stratified temporal splits.
  • PII redaction data: build a labeled NER dataset with strict privacy rules and auditable lineage.

Subskills

  • Dataset Design And Sampling — Choose sources, prevent leakage, and ensure splits match production distributions.
  • Label Taxonomy And Guidelines — Make labels unambiguous, mutually exclusive, and well-documented with examples.
  • Annotation QA And Agreement — Measure and improve consistency (kappa/alpha), and run spot checks.
  • Weak Supervision Basics — Use heuristic rules and distant signals to cheaply expand labeled data.
  • Active Learning Basics — Prioritize uncertain or diverse samples to maximize label impact.
  • Data Augmentation Strategy — Apply label-preserving transforms to improve robustness only in training.
  • Dataset Versioning And Lineage — Track versions, hashes, guideline revisions, and how labels were produced.
  • Privacy And Compliance For Data — Remove PII, document consent, and minimize retention with access controls.

Next steps

  • Finish the subskills in order, then run the mini project end-to-end.
  • Adopt a simple manifest template for every dataset and tie it to your experiments.
  • Take the skill exam below to check your readiness. Everyone can take it; logged-in users get saved progress.

Data And Label Strategy — Skill Exam

This exam checks your practical understanding of data and label strategy: dataset design, labeling quality, weak supervision, active learning, augmentation, versioning, and privacy. Score 70% or higher to pass. You can retake anytime. Everyone can take the exam; only logged-in users have their progress saved.

11 questions70% to pass

Have questions about Data And Label Strategy?

AI Assistant

Ask questions about this tool