Why this matters
In production, models face changing users, markets, devices, and data pipelines. Data drift (and feature drift) is the shift in input distributions between training/reference data and current production data. If unnoticed, it leads to silent performance decay, fairness issues, and risky business decisions.
- Real task: Detect when a key feature’s distribution shifts and trigger an alert before model KPIs degrade.
- Real task: Quantify which features drifted the most and prioritize retraining or pipeline fixes.
- Real task: Set thresholds and windows so alerts are actionable, not noisy.
Concept explained simply
Data drift: input data distribution changes over time. Feature drift: drift measured on individual features. Concept drift: the relationship between inputs and outputs changes (even if inputs look the same). Here we focus on data/feature drift detectable without labels.
Mental model
Think of your training data as a "map" and production as the "terrain." Drift tells you how far the current terrain moved from the map. Small deviations are normal; persistent or large ones mean the map needs updating.
Metrics and methods that work
Univariate metrics (per-feature)
- Numerical: KS statistic; Wasserstein distance; Jensen–Shannon distance; PSI (Population Stability Index).
- Categorical: Jensen–Shannon distance; Chi-square test; simple % change of top categories; population share change.
- Text or embeddings: compare embedding distributions via KS/JSD or distances on reduced dimensions.
Tip: Add a tiny epsilon (e.g., 1e-6) to probabilities to avoid division by zero for metrics like KL/PSI.
Multivariate drift
- Domain classifier: train a classifier to distinguish reference vs. production. AUC ~ 0.5 means no drift; higher AUC indicates stronger drift.
- MMD (Maximum Mean Discrepancy) or energy distance on a selected feature set or embeddings.
- PCA/UMAP shift: track centroid distance or overlap of low-dimensional projections.
Aggregating drift
- Share of features over threshold (e.g., % features with PSI ≥ 0.2).
- Weighted average by feature importance (e.g., SHAP or permutation importance).
- Top-N drifted features with severity scoring for quick triage.
Windows and baselines
- Reference baseline: training set or a clean validation slice; or a rolling baseline (e.g., last 30 days).
- Production window: daily, hourly, or per batch; choose a window with enough samples to be statistically meaningful.
- Segmented monitoring: track slices (region, device, account type) to localize issues.
Set it up step-by-step
- Choose baseline: start with training or validation; add a rolling baseline for seasonality.
- Pick per-feature metrics: KS/Wasserstein/JSD/PSI for numeric; JSD/Chi-square for categorical.
- Define windows: e.g., daily production versus full training; require minimum N per feature/bin.
- Aggregate: % features over threshold and weighted average by importance.
- Thresholds & alerts: e.g., PSI ≥ 0.2 major; KS ≥ 0.2 notable. Use warning and critical tiers.
- Backtest: replay past months; check alert precision/recall against known incidents.
- Document response playbooks: when to retrain, recalibrate, or hotfix pipelines.
Quick checklist before you alert
- Enough samples in the window (per feature and per category).
- No sudden missing-value spikes.
- Binning or encoding consistent with baseline.
- Seasonality accounted for (compare Friday-to-Friday, not Friday-to-Monday).
- Segment-level view checked (no hidden drift in a small but critical slice).
Worked examples
Example 1 — Credit risk: PSI on a numeric feature
Feature: income. Bins (train vs. prod proportions): [0–20k]: 0.20 vs 0.35; [20–40k]: 0.40 vs 0.30; [40–60k]: 0.30 vs 0.25; [60k+]: 0.10 vs 0.10. PSI ≈ 0.122 (moderate drift). Action: watch closely; if other key features also drift, trigger retrain evaluation.
Example 2 — E-commerce: Domain classifier AUC
Train a classifier to distinguish reference vs. yesterday’s data on 20 features. AUC = 0.84 indicates notable multivariate drift. Top drifted features by importance-weighted JSD: device_type, country. Action: segment dashboards by device_type to see if a new device release changed traffic; consider retraining if business KPIs drop.
Example 3 — Fraud: KS and rare categories
Transaction_amount KS = 0.25 (critical). Merchant_category introduced a new rare value (from 0% to 1.5%). Action: ensure encoding handles unseen categories; check pipeline; if valid trend, update vocabulary and assess model calibration.
Exercises (do these now)
Exercise 1 — Compute PSI
You have training vs. production counts for 4 bins of a numeric feature (each total 1000 rows):
- Bin1: Train 200, Prod 350
- Bin2: Train 400, Prod 300
- Bin3: Train 300, Prod 250
- Bin4: Train 100, Prod 100
Compute PSI and interpret the result. (Use PSI = Σ (p2 - p1) * ln(p2/p1), with proportions p1=train, p2=prod.)
Exercise 2 — Design a drift alert policy
Data: 10 numeric features, 5 categorical; daily window ~20k rows; importance known. Define:
- Per-feature metrics and thresholds.
- Aggregation rule for a model-level alert.
- Minimum sample rules.
- An action plan for warning vs. critical.
Self-check checklist
- Did you ensure proportions sum to 1 per distribution when computing PSI?
- Did you avoid zero-probability bins (epsilon or merged bins)?
- Do thresholds differ for numeric vs. categorical where needed?
- Do you have both per-feature and aggregated alerts?
- Did you specify minimum N to avoid noisy alerts?
Common mistakes and how to self-check
- Confusing data drift with concept drift. Self-check: Are you measuring inputs only? If yes, it’s data/feature drift.
- Only univariate checks. Self-check: Add a domain classifier AUC or MMD across key features.
- Ignoring sample size. Self-check: Enforce per-feature minimum N and suppress alerts if not met.
- Static bins causing artifacts. Self-check: Use quantile bins on baseline; keep them fixed over time.
- No segmentation. Self-check: Track at least one business-critical slice (e.g., region, device).
- One-size thresholds. Self-check: Tiered thresholds (warning/critical) and importance weighting.
- Alert without playbook. Self-check: For each alert tier, define owner, action, and timeout.
Practical projects
- Build an offline drift dashboard: load a baseline CSV and a production CSV; compute per-feature KS/JSD/PSI; output the top 5 drifted features with severities and a simple HTML report.
- Backtest alert thresholds: replay 90 days of data windows; record alerts; compare against known incidents to tune thresholds.
- Segmented monitoring: choose one slice (e.g., device_type); compute slice-level drift and compare to global; show where drift originates.
Who this is for
- MLOps engineers and data engineers running models in production.
- Data scientists responsible for model reliability.
Prerequisites
- Basic statistics: distributions, percentiles, hypothesis testing.
- Feature engineering basics (binning, encoding).
- Familiarity with your production data pipeline and logging.
Learning path
- Understand data vs. feature vs. concept drift.
- Implement univariate drift metrics per feature type.
- Add multivariate drift (domain classifier, MMD).
- Design windows, thresholds, and aggregation.
- Backtest and operationalize alerts with a playbook.
Next steps
- Instrument logging to capture feature distributions per window and per key segment.
- Automate drift reports and tiered alerts.
- Connect alerts to retraining or calibration workflows.
Quick Test is available to everyone. Only logged-in users will have their progress saved.
Mini challenge
Your model’s AUC dropped slightly from 0.86 to 0.83 this week. Univariate drift shows only two minor offenders (PSI≈0.12 each). Domain classifier AUC is 0.82. Propose a plan in 5 steps to investigate and mitigate. Include at least one segmented analysis and one operational action.