How to learn Data And Label Strategy for Applied Scientist for free

Why this matters for Applied Scientists

Models fail when data and labels are wrong. As an Applied Scientist, you turn ambiguous business problems into measurable outcomes. A strong data and label strategy helps you: select representative data, define clear labels, control noise, scale labeling cost-effectively, and close the loop with active learning. This unlocks dependable model performance, faster iterations, and easier debugging.

Who this is for

Applied Scientists and ML Engineers building models that depend on labeled data.
Data Scientists planning pilots, PoCs, or production models with human labeling.
Team leads who need a repeatable, auditable data process.

Prerequisites

Basic Python and NumPy/Pandas.
Familiarity with train/validation/test splits and evaluation metrics.
Comfort with classification concepts (precision/recall), and basic probability.

Learning path

Define the task and acceptance criteria — Write 2–3 target metrics and minimum thresholds. Identify edge cases.
Design the dataset — Map sources, sampling strategy (stratified, temporal, or domain-balanced), and expected class balance.
Create a label taxonomy and guidelines — Make labels mutually exclusive and collectively exhaustive; add positive/negative examples and counterexamples.
Run a pilot annotation — Label a small batch with 2+ annotators, measure agreement, and refine guidelines.
Scale with weak supervision — Write labeling rules or use distant supervision to cheaply expand coverage; estimate noise.
Loop with active learning — Prioritize uncertain or high-value samples for human labeling.
Augment thoughtfully — Use augmentation to improve robustness without leaking labels.
Version and document — Track dataset versions, label guideline revisions, and lineage (how each sample was obtained and labeled).
Privacy and compliance — Remove PII, minimize retention, and document consent and access controls.

Worked examples

1) Stratified sampling for an imbalanced dataset

Goal: keep class ratios consistent across splits and optionally weight minority classes.

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.utils.class_weight import compute_class_weight

# Example data
df = pd.DataFrame({
    "text": ["..."],
    "label": [0,1,0,0,1,0,0,0,1,0]  # imbalanced
})

train_df, test_df = train_test_split(
    df, test_size=0.2, random_state=42, stratify=df["label"]
)

# Optional: compute class weights for training
classes = sorted(df["label"].unique())
weights = compute_class_weight(class_weight="balanced", classes=classes, y=train_df["label"])
class_weight_map = {c: w for c, w in zip(classes, weights)}
print(class_weight_map)

Tip: For temporal data, use time-based splits first, then stratify within time windows if needed.

2) Inter-annotator agreement (Cohen’s kappa)

Goal: quantify consistency between two annotators and update guidelines.

from sklearn.metrics import cohen_kappa_score

# Two annotators labeled the same 20 items
ann_a = [0,1,0,1,1,0,2,1,0,2, 1,0,2,2,1,0,1,2,0,1]
ann_b = [0,1,0,1,0,0,2,1,0,2, 1,0,2,1,1,0,1,2,0,2]

kappa = cohen_kappa_score(ann_a, ann_b)
print(f"Cohen's kappa: {kappa:.2f}")
# 0.60–0.80: substantial; 0.40–0.60: moderate. Investigate disagreements.

Action: Extract top disagreement cases and add explicit rules to the guideline for them.

3) Simple weak supervision with labeling functions

Goal: bootstrap labels using heuristics and combine votes.

ABSTAIN = -1

def lf_contains_refund(x):
    return 1 if "refund" in x.lower() else ABSTAIN

def lf_contains_thanks(x):
    return 0 if "thank" in x.lower() else ABSTAIN

def lf_length_short(x):
    return 1 if len(x.split()) < 4 else ABSTAIN

texts = [
    "refund please", "thanks for the help", "need a refund asap", "question"
]

votes = []
for t in texts:
    row = [lf_contains_refund(t), lf_contains_thanks(t), lf_length_short(t)]
    votes.append(row)

# Majority vote ignoring ABSTAIN
import numpy as np
labels = []
for row in votes:
    vals = [v for v in row if v != ABSTAIN]
    if len(vals) == 0:
        labels.append(ABSTAIN)
    else:
        counts = {c: vals.count(c) for c in set(vals)}
        labels.append(max(counts, key=counts.get))

print(labels)  # Pseudo-labels to prioritize for review

Improve: Estimate per-rule accuracy and correlation to weight votes rather than naive majority.

4) Active learning via uncertainty sampling

Goal: pick the most informative unlabeled samples.

import numpy as np
from sklearn.linear_model import LogisticRegression

X_l, y_l = ...  # seed labeled features and labels
X_u = ...       # unlabeled features

clf = LogisticRegression(max_iter=1000).fit(X_l, y_l)
probs = clf.predict_proba(X_u)

# Margin sampling: smallest margin between top-2 classes
margins = np.sort(probs, axis=1)[:,-1] - np.sort(probs, axis=1)[:,-2]
query_idx = np.argsort(margins)[:50]  # pick 50 most uncertain

X_query = X_u[query_idx]
# Send X_query for human labeling, then retrain

Variation: add diversity (e.g., cluster then sample uncertain points across clusters).

5) Safe augmentation for text and images

Goal: improve robustness without changing label meaning.

# Text: light synonym replacement (whitelist-based)
import random
syn = {"quick": ["fast", "rapid"], "buy": ["purchase"]}

def aug_text(t):
    out = []
    for w in t.split():
        if w.lower() in syn and random.random() < 0.2:
            out.append(random.choice(syn[w.lower()]))
        else:
            out.append(w)
    return " ".join(out)

# Image: deterministic horizontal flip
from PIL import Image

def flip_horizontal(path):
    img = Image.open(path)
    return img.transpose(Image.FLIP_LEFT_RIGHT)

Rule: Never augment validation/test. Keep augmentations label-preserving; avoid operations that change class identity.

6) Lightweight dataset versioning and lineage

Goal: track exact data, label guide version, and how labels were produced.

import hashlib, json

def file_hash(path):
    h = hashlib.sha256()
    with open(path, 'rb') as f:
        for chunk in iter(lambda: f.read(8192), b''):
            h.update(chunk)
    return h.hexdigest()

manifest = {
  "dataset_version": "1.2.0",  # major: guideline change, minor: new data, patch: fixes
  "sources": ["support_tickets_2025Q1"],
  "splits": {"train": 3200, "val": 400, "test": 400},
  "label_guideline_version": "v3",
  "labeling_methods": ["human_v2", "weak_rules_v1"],
  "artifacts": {"train_csv_sha256": file_hash("train.csv")}
}

with open("dataset_manifest.json", "w") as f:
    json.dump(manifest, f, indent=2)

Practice: Store the manifest with your dataset and reference it in experiments for reproducibility.

Drills and exercises

Write a 6–10 class label taxonomy for a problem you know. Mark any overlapping labels and merge or clarify them.
Draft 8–12 bullet rules in a labeling guideline. Add at least 5 counterexamples.
Design a sampling plan: identify domains/sources, target class ratios, and how you will avoid temporal leakage.
Create 5 labeling functions (weak rules) and estimate which are noisy. Note one way to reduce each rule’s noise.
Run a 50-sample dual-annotation pilot. Compute Cohen’s kappa and propose 3 guideline changes.
List 3 safe augmentations for your task and 3 unsafe ones (with reasons).
Write a one-page dataset card: data sources, consent, PII mitigation, known biases, and contact for issues.

Common mistakes and debugging tips

Overlapping labels: If annotators disagree often between two labels, merge or add a decision rule.
Ignoring the negative/other class: Define it clearly with examples; otherwise it becomes a sink for confusion.
Sampling mismatch: Training on one domain, testing on another. Stratify by domain/time; hold out future periods.
Leakage from augmentation: Never augment validation/test; avoid label-changing transforms.
One-shot guidelines: Expect 2–3 iterations. Pilot, measure agreement, and revise.
Weak rule conflicts: Estimate rule accuracies; don’t treat all rules equally.
Untracked changes: Tie model runs to dataset and guideline versions. Store a manifest with hashes.
Privacy gaps: Strip PII early, minimize retention, and restrict access to raw data.

Mini project: Build a data and label strategy for a pilot model

Choose a task: e.g., triage support messages into 5 intents. Define target metrics (e.g., macro-F1 ≥ 0.75).
Taxonomy + guideline v1: 5 labels, each with definition, 3 positives, 3 counterexamples, and tie-break rules.
Sampling plan: 1,000 messages across 3 sources and last 3 months; preserve source ratios.
Pilot annotation (n=150): Two annotators label the same set. Compute kappa and adjust the guideline.
Weak supervision: Write 5 rules to tag obvious cases. Combine with majority or weighted voting.
Active learning iteration: Train a baseline, pick top-100 uncertain samples, label them, retrain once.
Augmentation: Add safe text paraphrases for minority classes only to training.
Versioning + card: Save dataset_manifest.json (v1.0.0) and a dataset card with privacy notes.

Hints

Target kappa ≥ 0.6 before scaling annotation.
Track how many labels come from humans vs. weak rules; review a sample of weak labels.
Stop augmenting if validation performance on minority classes does not improve.

Success criteria: Clear guideline v2+, kappa ≥ 0.6, a reproducible dataset manifest, and a baseline model that meets or approaches target metrics.

Practical projects

Customer intent classification: end-to-end taxonomy, guidelines, pilot IAA, and active learning loop.
Product defect image tagging: define visually distinct classes, safe augmentations, and stratified temporal splits.
PII redaction data: build a labeled NER dataset with strict privacy rules and auditable lineage.

Subskills

Dataset Design And Sampling — Choose sources, prevent leakage, and ensure splits match production distributions.
Label Taxonomy And Guidelines — Make labels unambiguous, mutually exclusive, and well-documented with examples.
Annotation QA And Agreement — Measure and improve consistency (kappa/alpha), and run spot checks.
Weak Supervision Basics — Use heuristic rules and distant signals to cheaply expand labeled data.
Active Learning Basics — Prioritize uncertain or diverse samples to maximize label impact.
Data Augmentation Strategy — Apply label-preserving transforms to improve robustness only in training.
Dataset Versioning And Lineage — Track versions, hashes, guideline revisions, and how labels were produced.
Privacy And Compliance For Data — Remove PII, document consent, and minimize retention with access controls.

Next steps

Finish the subskills in order, then run the mini project end-to-end.
Adopt a simple manifest template for every dataset and tie it to your experiments.
Take the skill exam below to check your readiness. Everyone can take it; logged-in users get saved progress.

Menu

Data And Label Strategy

Table of Contents

Why this matters for Applied Scientists

Who this is for

Prerequisites

Learning path

Worked examples

Drills and exercises

Common mistakes and debugging tips

Mini project: Build a data and label strategy for a pilot model

Practical projects

Subskills

Next steps

Data And Label Strategy — Skill Exam

Topics

Label Taxonomy And Guidelines

Weak Supervision Basics

Data Augmentation Strategy

Annotation QA And Agreement

Active Learning Basics

Dataset Versioning And Lineage

Privacy And Compliance For Data

Dataset Design And Sampling

Have questions about Data And Label Strategy?

AI Assistant