How to learn Handling Imbalanced Data for ML Frameworks in Machine Learning Engineer for free

Why this matters

In real ML systems, the positive class is often rare: fraud, churn, defects, failures, abusive content. A naive model can score high accuracy by predicting the majority class always, yet miss the very cases you care about. As a Machine Learning Engineer, you must design training, evaluation, and deployment that preserve minority signal and deliver business value.

Flag fraudulent transactions without overwhelming analysts.
Identify high-risk patients while keeping false alarms manageable.
Catch defective items early in production pipelines.

Real tasks you may handle

Build stratified training and cross-validation splits.
Pick metrics that reflect costs (recall at fixed precision).
Apply class weighting, resampling, or specialized losses.
Tune decision thresholds for a target alert budget.
Calibrate probabilities for stable thresholding in production.

Who this is for

Engineers training classifiers where positives are rare (1–20%). Useful for fraud detection, medical alerts, safety systems, content moderation, quality control, and churn prediction.

Prerequisites

Comfort with binary classification basics (precision, recall, confusion matrix).
Familiarity with an ML framework (scikit-learn, XGBoost/LightGBM, PyTorch/TensorFlow).
Basic Python and data handling with Pandas/NumPy.

Concept explained simply

Imbalanced data means one class is much more common than the other. If positives are 1% and you predict negative always, you get 99% accuracy—yet detect nothing. So, we train and evaluate in ways that amplify signal from the minority class without overfitting noise.

Mental model

Imagine panning for gold in a river. There is lots of sand (majority), little gold (minority). You need the right sieve (metrics), water flow (sampling/weights), and routine (cross-validation) so you keep more gold and less sand. Too coarse: you lose gold (low recall). Too fine: you keep too much sand (low precision). Your goal is a practical balance.

Core techniques (practical recipe)

Stratify and baseline
- Use stratified train/validation splits to preserve class ratios.
- Compute naive baselines: predict-all-negative, predict-positive at base rate.
Pick business-aligned metrics
- Prefer PR AUC, F1, recall at fixed precision, or precision at fixed recall.
- ROC AUC can look good when the positive class is tiny; be cautious.
Balance signal
- Class weights or sample weights in the loss function.
- Resampling: random under/over-sampling, SMOTE/variants (use inside CV).
- Algorithm knobs: e.g., scale_pos_weight for gradient boosting.
Tune threshold and calibrate
- Choose a decision threshold that meets alert or cost constraints.
- Calibrate probabilities (Platt scaling or isotonic) if thresholds drift.

When to use what

Class weights: First try; simple, robust, minimal leakage risk.
Under-sampling: Fast when majority is huge; risk losing information.
Over-sampling: Keeps all data; risk of overfitting duplicates.
SMOTE/variants: Synthetic positives; powerful but use only within each CV fold.
Algorithm-specific: Tree boosting often responds well to scale_pos_weight.

Worked examples

Example 1 — Stratified split and metrics

from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, average_precision_score

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2,
                                                  stratify=y, random_state=42)
# Train a quick baseline (e.g., LogisticRegression or XGBClassifier)
model.fit(X_train, y_train)
proba = model.predict_proba(X_val)[:,1]
y_pred = (proba >= 0.5).astype(int)

print(classification_report(y_val, y_pred, digits=3))
print("PR AUC (Average Precision):", average_precision_score(y_val, proba))

Why: Stratification preserves ratios. PR AUC reflects performance on the rare class better than plain accuracy.

Example 2 — Class weights

from sklearn.linear_model import LogisticRegression
from sklearn.utils.class_weight import compute_class_weight
import numpy as np

classes = np.unique(y_train)
weights = compute_class_weight(class_weight='balanced', classes=classes, y=y_train)
class_weight = {cls: w for cls, w in zip(classes, weights)}

clf = LogisticRegression(max_iter=200, class_weight=class_weight)
clf.fit(X_train, y_train)
proba = clf.predict_proba(X_val)[:,1]

Why: The loss penalizes misclassifying minority cases more, shifting the model to pay attention to them.

Example 3 — SMOTE inside cross-validation

from imblearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.metrics import make_scorer, average_precision_score
from sklearn.ensemble import RandomForestClassifier

smote = SMOTE(k_neighbors=5, random_state=42)
rf = RandomForestClassifier(n_estimators=200, random_state=42)
pipeline = Pipeline(steps=[('smote', smote), ('rf', rf)])

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
pr_auc_scorer = make_scorer(average_precision_score, needs_proba=True)

scores = cross_val_score(pipeline, X, y, scoring=pr_auc_scorer, cv=cv)
print("Mean PR AUC:", scores.mean())

Why: Applying SMOTE within each fold avoids training on synthetic points derived from validation data (leakage).

Example 4 — Threshold tuning to meet alert budget

import numpy as np
from sklearn.metrics import precision_recall_curve

proba = model.predict_proba(X_val)[:,1]
prec, rec, thr = precision_recall_curve(y_val, proba)
# Suppose you can handle precision ≥ 0.9
ok_idx = np.where(prec[:-1] >= 0.90)[0]  # ignore last point as it has no threshold
best = ok_idx[np.argmax(rec[ok_idx])]  # highest recall under the precision constraint
chosen_thr = thr[best]
print("Chosen threshold:", chosen_thr, "Precision:", prec[best], "Recall:", rec[best])

Why: Pick a threshold that satisfies business constraints instead of relying on 0.5.

Exercises

Do these hands-on tasks. You can compare your answers with the solutions provided. Everyone can attempt; logged-in users get saved progress.

Exercise 1 — Weighted loss and thresholding

Goal: Train a weighted classifier, report PR AUC and F1, and choose a threshold to achieve recall ≥ 0.75 if possible.

Split with stratification.
Compute balanced class weights and train LogisticRegression.
Report PR AUC and F1 at threshold 0.5.
Use precision_recall_curve to find the highest threshold that still gets recall ≥ 0.75. Report precision/recall at that threshold.

Hint

Use compute_class_weight and classification_report, then scan the precision-recall curve.

Show reference solution

# See the Exercises panel solution for full code.

Exercise 2 — SMOTE in CV vs. data leakage

Goal: Compare vanilla model vs. SMOTE-in-pipeline using PR AUC via 5-fold Stratified CV.

Build Model A: RandomForestClassifier with default settings.
Build Model B: Pipeline(SMOTE → RandomForestClassifier).
Evaluate both with cross_val_score on PR AUC.
Explain why fitting SMOTE before the split would inflate scores.

Hint

Use imblearn.pipeline.Pipeline and make_scorer(average_precision_score).

Show reference solution

# See the Exercises panel solution for full code.

[Checklist] You used stratified splits.
[Checklist] You reported PR AUC.
[Checklist] You avoided leakage by embedding SMOTE in CV.
[Checklist] You tuned threshold to a business constraint, not to 0.5.

Common mistakes and self-check

Mistake: Using accuracy or ROC AUC alone. Self-check: Do you report PR AUC or precision/recall at a target?
Mistake: Oversampling the entire dataset before CV (leakage). Self-check: Is resampling inside each fold via a Pipeline?
Mistake: Ignoring decision thresholds. Self-check: Did you compute a threshold that satisfies cost/alert constraints?
Mistake: Comparing models at different class priors. Self-check: Are splits stratified and random_state fixed?
Mistake: Uncalibrated probabilities in production. Self-check: Did you evaluate calibration (e.g., reliability curve) and consider Platt or isotonic?

Practical projects

Fraud alerts: Maximize recall at precision ≥ 0.9 with class weights; add SMOTE if recall plateaus.
Churn prediction: Optimize PR AUC; deliver top-N customers with calibrated probabilities.
Manufacturing defects: Under-sample majority to speed training; threshold for a fixed daily review budget.

Mini tasks you can add

Estimate base rate and compute naive PR AUC baseline.
Try focal loss in a deep model and compare to class weights.
Stress-test thresholds when class prior shifts ±50%.

Learning path

Master evaluation: confusion matrix, precision/recall, PR AUC, cost-sensitive metrics.
Apply class weights and algorithm-specific imbalance parameters.
Use resampling properly within CV (SMOTE/under/over sampling).
Tune thresholds to constraints; calibrate probabilities.
Monitor in production: drift in class prior and threshold performance.

Next steps

Complete the exercises above on a dataset you know.
Take the quick test below to check understanding. Everyone can take it; logged-in users get saved progress.
Apply these techniques to your active project and record metric changes (PR AUC, recall at precision constraint).

Mini challenge

You have an alert budget of 100 per day and a validation set with 20,000 samples, base rate 1%. Your model outputs probabilities. Design a simple procedure to pick a threshold that yields approximately 100 positives while keeping precision ≥ 0.85. Write the steps, then implement them with a precision-recall curve and a sweep of thresholds.

Menu

Handling Imbalanced Data

Table of Contents

Why this matters

Who this is for

Prerequisites

Concept explained simply

Mental model

Core techniques (practical recipe)

Worked examples

Exercises

Exercise 1 — Weighted loss and thresholding

Exercise 2 — SMOTE in CV vs. data leakage

Common mistakes and self-check

Practical projects

Learning path

Next steps

Mini challenge

Practice Exercises

Weighted loss and thresholding

Instructions

Expected Output

SMOTE inside CV vs leakage

Handling Imbalanced Data — Quick Test

Have questions about Handling Imbalanced Data?

AI Assistant