luvv to helpDiscover the Best Free Online Tools
Topic 5 of 10

Handling Imbalanced Data

Learn Handling Imbalanced Data for free with explanations, exercises, and a quick test (for Machine Learning Engineer).

Published: January 1, 2026 | Updated: January 1, 2026

Why this matters

In real ML systems, the positive class is often rare: fraud, churn, defects, failures, abusive content. A naive model can score high accuracy by predicting the majority class always, yet miss the very cases you care about. As a Machine Learning Engineer, you must design training, evaluation, and deployment that preserve minority signal and deliver business value.

  • Flag fraudulent transactions without overwhelming analysts.
  • Identify high-risk patients while keeping false alarms manageable.
  • Catch defective items early in production pipelines.
Real tasks you may handle
  • Build stratified training and cross-validation splits.
  • Pick metrics that reflect costs (recall at fixed precision).
  • Apply class weighting, resampling, or specialized losses.
  • Tune decision thresholds for a target alert budget.
  • Calibrate probabilities for stable thresholding in production.

Who this is for

Engineers training classifiers where positives are rare (1–20%). Useful for fraud detection, medical alerts, safety systems, content moderation, quality control, and churn prediction.

Prerequisites

  • Comfort with binary classification basics (precision, recall, confusion matrix).
  • Familiarity with an ML framework (scikit-learn, XGBoost/LightGBM, PyTorch/TensorFlow).
  • Basic Python and data handling with Pandas/NumPy.

Concept explained simply

Imbalanced data means one class is much more common than the other. If positives are 1% and you predict negative always, you get 99% accuracy—yet detect nothing. So, we train and evaluate in ways that amplify signal from the minority class without overfitting noise.

Mental model

Imagine panning for gold in a river. There is lots of sand (majority), little gold (minority). You need the right sieve (metrics), water flow (sampling/weights), and routine (cross-validation) so you keep more gold and less sand. Too coarse: you lose gold (low recall). Too fine: you keep too much sand (low precision). Your goal is a practical balance.

Core techniques (practical recipe)

  1. Stratify and baseline
    • Use stratified train/validation splits to preserve class ratios.
    • Compute naive baselines: predict-all-negative, predict-positive at base rate.
  2. Pick business-aligned metrics
    • Prefer PR AUC, F1, recall at fixed precision, or precision at fixed recall.
    • ROC AUC can look good when the positive class is tiny; be cautious.
  3. Balance signal
    • Class weights or sample weights in the loss function.
    • Resampling: random under/over-sampling, SMOTE/variants (use inside CV).
    • Algorithm knobs: e.g., scale_pos_weight for gradient boosting.
  4. Tune threshold and calibrate
    • Choose a decision threshold that meets alert or cost constraints.
    • Calibrate probabilities (Platt scaling or isotonic) if thresholds drift.
When to use what
  • Class weights: First try; simple, robust, minimal leakage risk.
  • Under-sampling: Fast when majority is huge; risk losing information.
  • Over-sampling: Keeps all data; risk of overfitting duplicates.
  • SMOTE/variants: Synthetic positives; powerful but use only within each CV fold.
  • Algorithm-specific: Tree boosting often responds well to scale_pos_weight.

Worked examples

Example 1 — Stratified split and metrics
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, average_precision_score

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2,
                                                  stratify=y, random_state=42)
# Train a quick baseline (e.g., LogisticRegression or XGBClassifier)
model.fit(X_train, y_train)
proba = model.predict_proba(X_val)[:,1]
y_pred = (proba >= 0.5).astype(int)

print(classification_report(y_val, y_pred, digits=3))
print("PR AUC (Average Precision):", average_precision_score(y_val, proba))

Why: Stratification preserves ratios. PR AUC reflects performance on the rare class better than plain accuracy.

Example 2 — Class weights
from sklearn.linear_model import LogisticRegression
from sklearn.utils.class_weight import compute_class_weight
import numpy as np

classes = np.unique(y_train)
weights = compute_class_weight(class_weight='balanced', classes=classes, y=y_train)
class_weight = {cls: w for cls, w in zip(classes, weights)}

clf = LogisticRegression(max_iter=200, class_weight=class_weight)
clf.fit(X_train, y_train)
proba = clf.predict_proba(X_val)[:,1]

Why: The loss penalizes misclassifying minority cases more, shifting the model to pay attention to them.

Example 3 — SMOTE inside cross-validation
from imblearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.metrics import make_scorer, average_precision_score
from sklearn.ensemble import RandomForestClassifier

smote = SMOTE(k_neighbors=5, random_state=42)
rf = RandomForestClassifier(n_estimators=200, random_state=42)
pipeline = Pipeline(steps=[('smote', smote), ('rf', rf)])

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
pr_auc_scorer = make_scorer(average_precision_score, needs_proba=True)

scores = cross_val_score(pipeline, X, y, scoring=pr_auc_scorer, cv=cv)
print("Mean PR AUC:", scores.mean())

Why: Applying SMOTE within each fold avoids training on synthetic points derived from validation data (leakage).

Example 4 — Threshold tuning to meet alert budget
import numpy as np
from sklearn.metrics import precision_recall_curve

proba = model.predict_proba(X_val)[:,1]
prec, rec, thr = precision_recall_curve(y_val, proba)
# Suppose you can handle precision ≥ 0.9
ok_idx = np.where(prec[:-1] >= 0.90)[0]  # ignore last point as it has no threshold
best = ok_idx[np.argmax(rec[ok_idx])]  # highest recall under the precision constraint
chosen_thr = thr[best]
print("Chosen threshold:", chosen_thr, "Precision:", prec[best], "Recall:", rec[best])

Why: Pick a threshold that satisfies business constraints instead of relying on 0.5.

Exercises

Do these hands-on tasks. You can compare your answers with the solutions provided. Everyone can attempt; logged-in users get saved progress.

Exercise 1 — Weighted loss and thresholding

Goal: Train a weighted classifier, report PR AUC and F1, and choose a threshold to achieve recall ≥ 0.75 if possible.

  1. Split with stratification.
  2. Compute balanced class weights and train LogisticRegression.
  3. Report PR AUC and F1 at threshold 0.5.
  4. Use precision_recall_curve to find the highest threshold that still gets recall ≥ 0.75. Report precision/recall at that threshold.
Hint

Use compute_class_weight and classification_report, then scan the precision-recall curve.

Show reference solution
# See the Exercises panel solution for full code.

Exercise 2 — SMOTE in CV vs. data leakage

Goal: Compare vanilla model vs. SMOTE-in-pipeline using PR AUC via 5-fold Stratified CV.

  1. Build Model A: RandomForestClassifier with default settings.
  2. Build Model B: Pipeline(SMOTE → RandomForestClassifier).
  3. Evaluate both with cross_val_score on PR AUC.
  4. Explain why fitting SMOTE before the split would inflate scores.
Hint

Use imblearn.pipeline.Pipeline and make_scorer(average_precision_score).

Show reference solution
# See the Exercises panel solution for full code.
  • [Checklist] You used stratified splits.
  • [Checklist] You reported PR AUC.
  • [Checklist] You avoided leakage by embedding SMOTE in CV.
  • [Checklist] You tuned threshold to a business constraint, not to 0.5.

Common mistakes and self-check

  • Mistake: Using accuracy or ROC AUC alone. Self-check: Do you report PR AUC or precision/recall at a target?
  • Mistake: Oversampling the entire dataset before CV (leakage). Self-check: Is resampling inside each fold via a Pipeline?
  • Mistake: Ignoring decision thresholds. Self-check: Did you compute a threshold that satisfies cost/alert constraints?
  • Mistake: Comparing models at different class priors. Self-check: Are splits stratified and random_state fixed?
  • Mistake: Uncalibrated probabilities in production. Self-check: Did you evaluate calibration (e.g., reliability curve) and consider Platt or isotonic?

Practical projects

  • Fraud alerts: Maximize recall at precision ≥ 0.9 with class weights; add SMOTE if recall plateaus.
  • Churn prediction: Optimize PR AUC; deliver top-N customers with calibrated probabilities.
  • Manufacturing defects: Under-sample majority to speed training; threshold for a fixed daily review budget.
Mini tasks you can add
  • Estimate base rate and compute naive PR AUC baseline.
  • Try focal loss in a deep model and compare to class weights.
  • Stress-test thresholds when class prior shifts ±50%.

Learning path

  1. Master evaluation: confusion matrix, precision/recall, PR AUC, cost-sensitive metrics.
  2. Apply class weights and algorithm-specific imbalance parameters.
  3. Use resampling properly within CV (SMOTE/under/over sampling).
  4. Tune thresholds to constraints; calibrate probabilities.
  5. Monitor in production: drift in class prior and threshold performance.

Next steps

  • Complete the exercises above on a dataset you know.
  • Take the quick test below to check understanding. Everyone can take it; logged-in users get saved progress.
  • Apply these techniques to your active project and record metric changes (PR AUC, recall at precision constraint).

Mini challenge

You have an alert budget of 100 per day and a validation set with 20,000 samples, base rate 1%. Your model outputs probabilities. Design a simple procedure to pick a threshold that yields approximately 100 positives while keeping precision ≥ 0.85. Write the steps, then implement them with a precision-recall curve and a sweep of thresholds.

Practice Exercises

2 exercises to complete

Instructions

Train a LogisticRegression with balanced class weights, evaluate PR AUC and F1 at threshold 0.5, then choose a threshold that achieves recall ≥ 0.75 if feasible.

  1. Stratified split.
  2. Compute class_weight='balanced' or use compute_class_weight.
  3. Report PR AUC and F1 at 0.5.
  4. Use precision_recall_curve to find a threshold giving recall ≥ 0.75; report precision/recall.
Expected Output
Text summary including: PR AUC, F1@0.5, chosen threshold, precision and recall at the chosen threshold. If recall ≥ 0.75 is not achievable, state the best recall at precision ≥ your target.

Handling Imbalanced Data — Quick Test

Test your knowledge with 10 questions. Pass with 70% or higher.

10 questions70% to pass

Have questions about Handling Imbalanced Data?

AI Assistant

Ask questions about this tool