Why this matters
In real ML systems, the positive class is often rare: fraud, churn, defects, failures, abusive content. A naive model can score high accuracy by predicting the majority class always, yet miss the very cases you care about. As a Machine Learning Engineer, you must design training, evaluation, and deployment that preserve minority signal and deliver business value.
- Flag fraudulent transactions without overwhelming analysts.
- Identify high-risk patients while keeping false alarms manageable.
- Catch defective items early in production pipelines.
Real tasks you may handle
- Build stratified training and cross-validation splits.
- Pick metrics that reflect costs (recall at fixed precision).
- Apply class weighting, resampling, or specialized losses.
- Tune decision thresholds for a target alert budget.
- Calibrate probabilities for stable thresholding in production.
Who this is for
Engineers training classifiers where positives are rare (1–20%). Useful for fraud detection, medical alerts, safety systems, content moderation, quality control, and churn prediction.
Prerequisites
- Comfort with binary classification basics (precision, recall, confusion matrix).
- Familiarity with an ML framework (scikit-learn, XGBoost/LightGBM, PyTorch/TensorFlow).
- Basic Python and data handling with Pandas/NumPy.
Concept explained simply
Imbalanced data means one class is much more common than the other. If positives are 1% and you predict negative always, you get 99% accuracy—yet detect nothing. So, we train and evaluate in ways that amplify signal from the minority class without overfitting noise.
Mental model
Imagine panning for gold in a river. There is lots of sand (majority), little gold (minority). You need the right sieve (metrics), water flow (sampling/weights), and routine (cross-validation) so you keep more gold and less sand. Too coarse: you lose gold (low recall). Too fine: you keep too much sand (low precision). Your goal is a practical balance.
Core techniques (practical recipe)
- Stratify and baseline
- Use stratified train/validation splits to preserve class ratios.
- Compute naive baselines: predict-all-negative, predict-positive at base rate.
- Pick business-aligned metrics
- Prefer PR AUC, F1, recall at fixed precision, or precision at fixed recall.
- ROC AUC can look good when the positive class is tiny; be cautious.
- Balance signal
- Class weights or sample weights in the loss function.
- Resampling: random under/over-sampling, SMOTE/variants (use inside CV).
- Algorithm knobs: e.g., scale_pos_weight for gradient boosting.
- Tune threshold and calibrate
- Choose a decision threshold that meets alert or cost constraints.
- Calibrate probabilities (Platt scaling or isotonic) if thresholds drift.
When to use what
- Class weights: First try; simple, robust, minimal leakage risk.
- Under-sampling: Fast when majority is huge; risk losing information.
- Over-sampling: Keeps all data; risk of overfitting duplicates.
- SMOTE/variants: Synthetic positives; powerful but use only within each CV fold.
- Algorithm-specific: Tree boosting often responds well to scale_pos_weight.
Worked examples
Example 1 — Stratified split and metrics
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, average_precision_score
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2,
stratify=y, random_state=42)
# Train a quick baseline (e.g., LogisticRegression or XGBClassifier)
model.fit(X_train, y_train)
proba = model.predict_proba(X_val)[:,1]
y_pred = (proba >= 0.5).astype(int)
print(classification_report(y_val, y_pred, digits=3))
print("PR AUC (Average Precision):", average_precision_score(y_val, proba))Why: Stratification preserves ratios. PR AUC reflects performance on the rare class better than plain accuracy.
Example 2 — Class weights
from sklearn.linear_model import LogisticRegression
from sklearn.utils.class_weight import compute_class_weight
import numpy as np
classes = np.unique(y_train)
weights = compute_class_weight(class_weight='balanced', classes=classes, y=y_train)
class_weight = {cls: w for cls, w in zip(classes, weights)}
clf = LogisticRegression(max_iter=200, class_weight=class_weight)
clf.fit(X_train, y_train)
proba = clf.predict_proba(X_val)[:,1]
Why: The loss penalizes misclassifying minority cases more, shifting the model to pay attention to them.
Example 3 — SMOTE inside cross-validation
from imblearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.metrics import make_scorer, average_precision_score
from sklearn.ensemble import RandomForestClassifier
smote = SMOTE(k_neighbors=5, random_state=42)
rf = RandomForestClassifier(n_estimators=200, random_state=42)
pipeline = Pipeline(steps=[('smote', smote), ('rf', rf)])
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
pr_auc_scorer = make_scorer(average_precision_score, needs_proba=True)
scores = cross_val_score(pipeline, X, y, scoring=pr_auc_scorer, cv=cv)
print("Mean PR AUC:", scores.mean())Why: Applying SMOTE within each fold avoids training on synthetic points derived from validation data (leakage).
Example 4 — Threshold tuning to meet alert budget
import numpy as np
from sklearn.metrics import precision_recall_curve
proba = model.predict_proba(X_val)[:,1]
prec, rec, thr = precision_recall_curve(y_val, proba)
# Suppose you can handle precision ≥ 0.9
ok_idx = np.where(prec[:-1] >= 0.90)[0] # ignore last point as it has no threshold
best = ok_idx[np.argmax(rec[ok_idx])] # highest recall under the precision constraint
chosen_thr = thr[best]
print("Chosen threshold:", chosen_thr, "Precision:", prec[best], "Recall:", rec[best])Why: Pick a threshold that satisfies business constraints instead of relying on 0.5.
Exercises
Do these hands-on tasks. You can compare your answers with the solutions provided. Everyone can attempt; logged-in users get saved progress.
Exercise 1 — Weighted loss and thresholding
Goal: Train a weighted classifier, report PR AUC and F1, and choose a threshold to achieve recall ≥ 0.75 if possible.
- Split with stratification.
- Compute balanced class weights and train LogisticRegression.
- Report PR AUC and F1 at threshold 0.5.
- Use precision_recall_curve to find the highest threshold that still gets recall ≥ 0.75. Report precision/recall at that threshold.
Hint
Use compute_class_weight and classification_report, then scan the precision-recall curve.
Show reference solution
# See the Exercises panel solution for full code.
Exercise 2 — SMOTE in CV vs. data leakage
Goal: Compare vanilla model vs. SMOTE-in-pipeline using PR AUC via 5-fold Stratified CV.
- Build Model A: RandomForestClassifier with default settings.
- Build Model B: Pipeline(SMOTE → RandomForestClassifier).
- Evaluate both with cross_val_score on PR AUC.
- Explain why fitting SMOTE before the split would inflate scores.
Hint
Use imblearn.pipeline.Pipeline and make_scorer(average_precision_score).
Show reference solution
# See the Exercises panel solution for full code.
- [Checklist] You used stratified splits.
- [Checklist] You reported PR AUC.
- [Checklist] You avoided leakage by embedding SMOTE in CV.
- [Checklist] You tuned threshold to a business constraint, not to 0.5.
Common mistakes and self-check
- Mistake: Using accuracy or ROC AUC alone. Self-check: Do you report PR AUC or precision/recall at a target?
- Mistake: Oversampling the entire dataset before CV (leakage). Self-check: Is resampling inside each fold via a Pipeline?
- Mistake: Ignoring decision thresholds. Self-check: Did you compute a threshold that satisfies cost/alert constraints?
- Mistake: Comparing models at different class priors. Self-check: Are splits stratified and random_state fixed?
- Mistake: Uncalibrated probabilities in production. Self-check: Did you evaluate calibration (e.g., reliability curve) and consider Platt or isotonic?
Practical projects
- Fraud alerts: Maximize recall at precision ≥ 0.9 with class weights; add SMOTE if recall plateaus.
- Churn prediction: Optimize PR AUC; deliver top-N customers with calibrated probabilities.
- Manufacturing defects: Under-sample majority to speed training; threshold for a fixed daily review budget.
Mini tasks you can add
- Estimate base rate and compute naive PR AUC baseline.
- Try focal loss in a deep model and compare to class weights.
- Stress-test thresholds when class prior shifts ±50%.
Learning path
- Master evaluation: confusion matrix, precision/recall, PR AUC, cost-sensitive metrics.
- Apply class weights and algorithm-specific imbalance parameters.
- Use resampling properly within CV (SMOTE/under/over sampling).
- Tune thresholds to constraints; calibrate probabilities.
- Monitor in production: drift in class prior and threshold performance.
Next steps
- Complete the exercises above on a dataset you know.
- Take the quick test below to check understanding. Everyone can take it; logged-in users get saved progress.
- Apply these techniques to your active project and record metric changes (PR AUC, recall at precision constraint).
Mini challenge
You have an alert budget of 100 per day and a validation set with 20,000 samples, base rate 1%. Your model outputs probabilities. Design a simple procedure to pick a threshold that yields approximately 100 positives while keeping precision ≥ 0.85. Write the steps, then implement them with a precision-recall curve and a sweep of thresholds.