Who this is for
Applied Scientists and ML practitioners building real-world classifiers, ranking systems, or detectors where positives are rare and data can be messy.
Prerequisites
- Comfort with train/validation/test splits and cross-validation
- Basic understanding of classification metrics (precision, recall, F1)
- Ability to train baseline models (logistic regression, tree-based models, simple neural nets)
Why this matters
- Fraud/abuse detection: positive class is rare; false positives are costly.
- Medical triage: prioritize recall for critical cases while keeping precision acceptable.
- Anomaly and risk scoring: thresholds decide real-world actions and budgets.
- User-generated data: mislabeled items and outliers are common and can mislead models.
Concept explained simply
Imbalance means one class appears much less often than others. If you only optimize accuracy, a model can predict the majority class and look good while missing almost all positives. Noise means some labels or features are wrong or untrustworthy. Both issues distort learning and metrics if not handled deliberately.
Mental model
- Signal vs. noise: Your job is to amplify true signal (useful patterns) and dampen noise (errors/outliers).
- Cost-aware decisions: Not all mistakes cost the same; tune metrics and thresholds to match business impact.
- Data curation loop: Train → diagnose errors → clean or reweight → retrain. Iterate.
Key techniques
Metrics that reflect imbalance
- Use Precision-Recall curves, AUPRC, F1 (macro for multi-class), balanced accuracy, recall at fixed precision.
- Report class-specific metrics and confusion matrices, not just accuracy or AUROC.
Resampling
- Undersampling: Reduce majority examples to balance. Fast, risk of losing information.
- Oversampling: Duplicate or synthesize minority examples (e.g., SMOTE, ADASYN). Risk of overfitting; validate carefully.
- Do resampling inside cross-validation folds to avoid leakage.
Class weights and loss design
- Class weights: Increase penalty for minority errors. Often a strong baseline.
- Focal loss: Down-weights easy examples to focus learning on hard/rare cases.
- Cost-sensitive tuning: Encode business costs into the loss or post-hoc thresholding.
Threshold tuning
- Optimize threshold on validation to hit targets like “Precision ≥ 0.9” or maximize Fβ given business priorities.
- Use PR curve and operating points; calibrate probabilities first if needed.
Handling noisy labels
- Spot suspicious labels: high per-example loss, disagreement across models, or annotator inconsistency.
- Actions: relabel a small ‘golden’ subset, down-weight noisy examples, or filter with loss-based pruning.
- Early stopping and regularization reduce memorization of noise.
Feature noise and outliers
- Robust preprocessing: clipping/winsorization, robust scalers (median/IQR), text/image normalization.
- Outlier handling: isolation forests or simple rules to flag extreme values for review.
Cross-validation strategy
- Use stratified k-fold to maintain class proportions.
- Group-aware splitting when duplicates/sessions exist to prevent leakage.
Calibration
- Calibrate outputs (Platt scaling, isotonic) to get reliable probabilities before thresholding or ranking.
Monitoring after deployment
- Track class priors, score distributions, precision/recall at your operating threshold, and drift signals.
- Maintain a labeled trickle (golden set) to continuously assess quality.
Worked examples
Example 1: Fraud detection (binary, rare positives)
- Baseline: logistic regression + class weights; stratified CV; metric: AUPRC and Recall@Precision≥0.9.
- Tune threshold to achieve Precision ≥ 0.9 on validation. Expect recall to drop; trade off consciously.
- Add focal loss or oversample minority if recall is too low; re-evaluate on PR curve.
Result pattern: AUROC can be high but AUPRC and recall@high-precision are the deciding metrics.
Example 2: Noisy labels in review classification
- Train a simple model; track per-example losses over epochs.
- Flag items with persistently high loss; sample and audit 50 of them to estimate noise rate.
- Down-weight or relabel flagged items; early stop to avoid memorizing noise; retrain.
Result pattern: Small relabeling of a targeted subset often yields larger gains than extra epochs.
Example 3: Long-tail multi-class support topics
- Metric: macro-F1 and per-class recall to protect tail classes.
- Downsample head classes + oversample minority; apply class weights during training.
- For one-vs-rest outputs, tune per-class thresholds to meet minimum precision per topic.
Result pattern: Macro-F1 rises even if overall accuracy minimally changes.
Exercises
Note: Everyone can attempt the exercises and view solutions. If you log in, your progress will be saved.
Exercise 1 — Threshold tuning for precision/recall
Dataset (score, label) for 20 items:
[(0.97,1),(0.92,1),(0.89,0),(0.77,1),(0.65,0),(0.62,0),(0.58,0),(0.54,0),(0.49,0),(0.45,0),(0.43,0),(0.39,0),(0.36,0),(0.33,0),(0.30,0),(0.27,0),(0.20,0),(0.12,0),(0.08,0),(0.03,0)]
- Task: Choose a threshold that achieves Precision ≥ 0.9 and Recall ≥ 0.5.
- Compute the confusion matrix (TP, FP, FN, TN) at your chosen threshold.
Hints
- Sort by score and evaluate precision/recall at each unique score as a candidate threshold.
- Recall = TP / Positives (there are 3 positives). Precision = TP / (TP + FP).
Solution
Set threshold at 0.92. Predictions ≥ 0.92 are positives: (0.97,1) and (0.92,1). TP=2, FP=0, FN=1, TN=17. Precision=1.0, Recall=2/3≈0.667. Both targets satisfied.
Exercise 2 — Identify likely mislabeled items
1D dataset (x, label) with k=3 nearest neighbors:
[(0.02,0),(0.05,0),(0.07,0),(0.09,1),(0.10,0),(0.90,1),(0.92,1),(0.94,0),(0.95,1),(0.97,1),(0.98,1),(0.99,1)]
- Task: Using a simple k-NN majority check, flag indices most likely mislabeled and propose a cleaning action.
Hints
- Look at items near 0.09 and 0.94; compare their labels with their 3 nearest neighbors.
- If label ≠ local majority, consider relabeling or down-weighting.
Solution
Flag (0.09,1) and (0.94,0). Both disagree with nearby neighbors. Action: send for relabeling; if unable, down-weight in training or exclude after manual spot-check.
Checklist before you ship
- Stratified split used; no leakage across folds
- Primary metrics match business goals (e.g., Recall@Precision≥X)
- Threshold tuned on validation; probabilities calibrated if needed
- Resampling/weights applied within CV only
- Noisy labels audited; high-loss items reviewed or down-weighted
- Outliers handled with robust preprocessing
- Per-class performance reported; tails not ignored
- Deployment monitoring plan defined (drift, golden set, alerting)
Common mistakes
- Optimizing accuracy in extreme imbalance → trivial all-negative model looks good
- Evaluating before threshold tuning and calibration → misleading metrics
- Resampling before splitting → leakage inflates results
- Ignoring label noise → the model memorizes errors
- Using only AUROC → hides poor precision in rare classes
How to self-check
- Compare AUROC vs AUPRC; big gap suggests imbalance issues.
- Plot PR curve and mark your operating point; does it meet business targets?
- Inspect top-loss examples; do they look mislabeled or out-of-distribution?
- Re-run CV with and without resampling; consistent gains indicate real improvements.
Practical projects
- Fraud-like simulation: Create a 1% positive dataset; train weighted logistic regression; tune threshold for Precision ≥ 0.95 and report Recall.
- Long-tail text topics: Build a multi-class classifier; compare micro vs macro-F1; apply class weights and per-class thresholds.
- Noisy label cleanup: Take a small image or text set, inject 10% label noise, use loss-based filtering and partial relabeling, and measure recovery.
Learning path
- Master imbalance-aware metrics (PR curves, AUPRC, macro-F1).
- Practice stratified CV and leakage-safe resampling.
- Apply class weights and focal loss; compare to oversampling.
- Tune thresholds to business constraints and calibrate probabilities.
- Detect and mitigate noisy labels and outliers.
- Design a monitoring plan and golden-set evaluation post-deployment.
Mini challenge
Given a validation PR curve and a cost rule “False Positive costs 1, False Negative costs 10,” choose an operating point and justify it in two sentences.
Next steps
Take the Quick Test below to check your understanding. The test is available to everyone; log in if you want your progress to be saved.