Why this matters
Models face the real world, not your training set. When incoming data changes, predictions can degrade quietly. Detecting data drift early protects KPIs, reduces incident time, and guides retraining decisions.
- Real task: Set up monitors that alert when key features no longer look like training data.
- Real task: Decide if a performance drop is due to input drift, label drift, or concept drift.
- Real task: Propose mitigation: recalibration, retraining, data pipeline fix, or threshold tweak.
Concept explained simply
Data drift means the distribution of your inputs changes over time compared to a baseline (usually training data or a recent stable period). Feature drift focuses on a single feature’s distribution change. Label drift is a change in the target proportion. Concept drift is a change in the relationship between inputs and target.
- Feature drift: P(Xi) changes (e.g., average price rises).
- Data drift: P(X) changes overall (e.g., user mix shifts to mobile-only users).
- Label drift: P(y) changes (e.g., conversion rate jumps due to a promotion).
- Concept drift: P(y|X) changes (e.g., same users behave differently due to policy changes).
Mental model
Think of your model as calibrated to a particular climate. Data drift is the weather shifting: sometimes gradual (seasonal), sometimes sudden (marketing campaign). Your monitors are thermometers and barometers. If they spike, you:
- Verify the signal (is it a data issue or real world change?).
- Measure impact (is model performance affected?).
- Act (fix data, retrain, recalibrate, or update thresholds).
Common drift metrics (open to view)
- PSI (Population Stability Index): bucket-based difference between baseline and current proportions. Rough guide: < 0.1 minimal, 0.1–0.25 moderate, > 0.25 significant.
- KS test (continuous): compares cumulative distributions of baseline vs current.
- Chi-square (categorical): tests if category frequencies changed.
- Jensen–Shannon / KL divergence: information-theoretic distance between distributions.
- Earth Mover’s Distance (Wasserstein): "cost" to morph one distribution into another.
Worked examples
Example 1: Feature-level PSI for a numeric feature
Feature: user_income. Baseline (train) vs current (serving week) proportions across 3 bins:
- Bin A (< 40k): baseline 0.20, current 0.32
- Bin B (40–80k): baseline 0.60, current 0.50
- Bin C (> 80k): baseline 0.20, current 0.18
PSI per bin ≈ (curr - base) * ln(curr / base). Summing bins yields ≈ 0.12 (moderate). This warrants investigation and performance checks.
Example 2: Categorical feature drift with Chi-square
Feature: traffic_source. Baseline: direct 50%, email 30%, search 20%. Current: direct 35%, email 45%, search 20%.
- Large shift from direct → email. Chi-square would likely detect a significant change.
- Action: check campaign launches, ensure model saw similar email-heavy users in training, verify model performance by source.
Example 3: Label drift vs concept drift
Baseline conversion rate: 5%. Current: 8% after a site-wide promotion.
- Label drift: P(y) changed due to external promo.
- If model AUC and calibration remain good, it’s mainly label drift. Recalibration may help; retraining can be scheduled post-promo.
Example 4: Text domain shift
Customer messages shift from English to more Spanish messages. Token distribution and language ID show drift.
- Action: enable multilingual model or add translation, retrain on mixed-language data, and monitor language share.
Detecting drift in practice
Choose baseline and windows
- Baseline: training data or a recent stable production window (e.g., first 2 weeks after launch).
- Current window: sliding time windows (e.g., daily or hourly) to compute metrics.
- Sample size: avoid tiny windows (noisy). As a rule of thumb, aim for 300+ samples per comparison when possible.
What to monitor
- High-importance features (top SHAP/feature importance).
- High-risk features (known to change seasonally or with campaigns).
- Label rate (if available) and key business outcome metrics.
Thresholds and runbooks
- Example thresholds: PSI ≥ 0.1 (warn), ≥ 0.25 (alert). Tailor by criticality.
- When alert fires: validate data pipeline, segment by cohort (source, geography), check model performance metrics (AUC, calibration), and decide action.
- Actions: data fix, feature engineering update, recalibration, retrain, fallback rules, or temporary threshold changes.
Step-by-step: setting up a PSI monitor
- Define stable bins on baseline data (e.g., quantile bins for numeric features).
- Each window, compute bin proportions for current data.
- Compute PSI per feature; aggregate for dashboard.
- Trigger alerts when PSI crosses thresholds for N consecutive windows to reduce noise.
- Log drift context: time, segments, sample sizes, recent deployments, known events.
Step-by-step: categorical drift with Chi-square
- Collect category counts for baseline and current window.
- Compute expected counts from baseline proportions.
- Run Chi-square; if p-value is low, flag drift.
- Investigate category-level performance and recent marketing changes.
Exercises
Practice the exact skills you need in production.
Exercise 1: Compute PSI and interpret
Feature: session_length (minutes). Use 3 bins with baseline vs current proportions:
- Bin 1 (< 5): baseline 0.25, current 0.40
- Bin 2 (5–15): baseline 0.55, current 0.50
- Bin 3 (> 15): baseline 0.20, current 0.10
Tasks:
- Compute PSI per bin and total PSI.
- Interpret the result using 0.1/0.25 thresholds.
- Propose the first two checks you would run after an alert.
Hints
- PSI ≈ (curr - base) * ln(curr / base) per bin; sum them up.
- Investigate segments that changed most.
Exercise 2: Identify drift type and choose monitors
Scenario: Training positive rate was 18%. Last week it was 27%. Feature drift monitors show no major shifts in top-5 features; AUC dropped slightly, calibration off at high scores.
- Is this data drift, label drift, concept drift, or a mix?
- What additional monitor or analysis would you add next?
- What near-term mitigation would you try?
Hints
- Compare P(y), P(X), and P(y|X) changes.
- Look at reliability diagrams and segment-wise performance.
- Checklist to self-verify:
- Did you compute PSI correctly (signs and logs)?
- Did you distinguish label drift from feature drift?
- Did you propose both monitoring and mitigation steps?
Common mistakes and self-check
- Only monitoring overall PSI: Large shifts can hide in subsegments. Self-check: view drift by key cohorts (channel, country).
- Changing bins each window: Makes PSI inconsistent. Self-check: fix bins from baseline.
- Alerting on tiny samples: Leads to noise. Self-check: enforce minimum window size or consecutive-window rule.
- Confusing label drift with concept drift: Self-check: check model discrimination (AUC) vs calibration and input drift.
- Ignoring data quality issues: Self-check: validate missing rates, schema, and unexpected zeros before blaming the model.
Practical projects
- Build a drift dashboard: PSI for top 10 features, Chi-square for categoricals, label rate trend, AUC and calibration plots.
- Runbook automation: When PSI >= 0.25 for 2 days, create an incident with suggested checks and owners.
- Shadow retraining: Keep a weekly retrain pipeline and compare offline metrics before promoting.
Who this is for
- Machine Learning Engineers and Data Scientists deploying models to production.
- ML Ops practitioners responsible for reliability and model health.
Prerequisites
- Basic probability and statistics (distributions, p-values).
- Familiarity with common ML metrics (AUC, calibration, log loss).
- Comfort with aggregations and time-windowed analysis.
Learning path
- Learn data and feature drift (this lesson).
- Monitor label and concept drift with performance metrics.
- Set thresholds, alerts, and runbooks in your monitoring stack.
- Automate retraining and calibration workflows.
Next steps
- Add drift monitors to one real model you own; start with 3–5 critical features.
- Document a one-page runbook with thresholds and actions.
- Schedule a monthly review of drift trends and retraining cadence.
Mini challenge
Your model’s overall PSI is low, but conversion in the UK drops sharply. Design a quick experiment to confirm whether a specific feature (e.g., price) drifted only in the UK, and outline the next action if confirmed.
Ready to test yourself?
Take the quick test below. Note: The test is available to everyone; only logged-in users get saved progress.