Topic Not Found

Who this is for

MLOps engineers who need reliable alerting on ML health.
Data scientists deploying models to production.
Engineers maintaining data pipelines that feed ML models.

Prerequisites

Basic understanding of supervised ML (classification/regression).
Comfort with common metrics (precision, recall, MAE/MAPE).
Know how your model is served and where logs/metrics are stored.

Learning path

Understand drift types and why performance can degrade.
Choose monitoring windows, baselines, and thresholds.
Implement input/data drift checks and performance checks.
Alert, investigate, rollback or retrain.

Why this matters

In production, your model faces changing data and behavior. Without monitoring, silent failures cost money and trust.

Fraud models must catch new fraud patterns fast.
Recommendation models need to adapt to seasonal shifts.
Forecasting models must handle promotions and anomalies.

Real tasks you will do on the job

Set up input drift alerts to catch schema and distribution shifts.
Track delayed performance (e.g., weekly labels) and proxy metrics until labels arrive.
Build runbooks to investigate alerts and decide rollback vs. retrain.

Concept explained simply

Concept drift is when the relationship between features and the target changes over time. Even if inputs look similar, the meaning of patterns can shift. Data/input drift is when feature distributions change; this often precedes or accompanies concept drift. Performance drift is the observed drop in model metrics in production.

Covariate shift: P(X) changes, P(y|X) same (inputs differ).
Prior/label shift: P(y) changes (class balance shifts).
Concept change: P(y|X) changes (the mapping changes).

Mental model

Imagine a river (data) feeding a water wheel (model). If the river’s flow changes (distribution), the wheel turns differently. If the wheel’s blades wear out (mapping), performance drops even if flow looks normal. Monitor both the river and the wheel’s output.

Key metrics and signals

Input/data drift: PSI, KS test, JS/KL divergence, mean/variance shift, missing rates, schema changes.
Target/performance: Precision/Recall/F1, ROC/PR AUC, MAE/MAPE/WAPE, calibration error, segment metrics.
Proxy signals when labels are delayed: score distribution drift, entropy, win-rate at a decision threshold, business outcomes lag (approximations).
Operational: latency, error rates, timeouts, feature freshness, data volume.

Typical thresholds (starting points)

PSI: 0.1 moderate, 0.2 high (investigate).
KS p-value < 0.05 indicates distribution difference.
Precision/Recall drop > 10–20% from baseline: investigate.
Missing rate increase > 5%: investigate upstream pipeline.

Tune thresholds to your risk tolerance and label delay.

Worked examples

Example 1: Classification with delayed labels

Baseline (last month): Precision 0.90, Recall 0.80. Current week: labels not yet available. Score distribution shows a lower median score and higher entropy.

Check input drift (PSI across key features). Suppose PSI=0.25 for device_type.
Proxy check: fraction of positive decisions at threshold dropped from 6% to 3%.
Action: raise a warning alert; tighten rules for high-risk segment until labels arrive; schedule targeted backtest once labels land.

Why this points to risk

High PSI on a feature tied to fraud patterns plus fewer high scores suggests covariate shift; potential concept change is plausible.

Example 2: Regression demand forecasting

Baseline MAPE: 12%. Current week MAPE (with labels): 22%. Input features show weekend_flag distribution increased due to a holiday campaign.

Segment analysis: on campaign days, MAPE=30%; non-campaign days, MAPE=14%.
Diagnosis: prior unseen promotion effect; concept change in mapping for price vs. demand under promotion.
Action: temporary rule-based uplift; add promotion features; retrain with recent windows.

Quick calc

Overall MAPE increase: (22-12)/12 = 83% relative increase. Alert threshold likely exceeded.

Example 3: NLP topic classifier

OOV rate grew from 2% to 9%; embedding centroid distance increased by 0.35 (baseline 0.10±0.05).

Data drift clearly present (language shift).
Performance proxy: confidence entropy increased.
Action: expand vocabulary, fine-tune on recent data; consider temporary human review for low-confidence cases.

Signal combination

Combine OOV rate, centroid shift, and entropy into a composite health score to stabilize alerts.

How to monitor in production

Choose baselines: training set, recent clean window, or champion model’s last stable period.
Windowing: rolling daily for input drift; weekly or per-cohort for performance when labels delay.
Compute stats: per feature PSI/KS, per metric precision/recall, segment breakdowns (e.g., region, device, new vs. returning).
Thresholds & SLOs: define warning and critical levels; include grace periods to avoid flapping.
Alerts: route to on-call; include runbook links, recent changes, and top contributing features.
Investigate: check data quality first (volume, schema, missingness), then feature drift, then model scores; reproduce locally with sampled data.
Respond: rollback to previous model, hotfix thresholds, or trigger retraining; document post-incident notes.

Runbook snippet (copy/paste into your doc)

Step 1: Verify data pipeline health (freshness, counts, nulls).
Step 2: Inspect top-PSI features and sample records.
Step 3: Compare score histograms and decision rates.
Step 4: If labels exist, compute metrics; else apply proxy checks.
Step 5: Decide rollback vs. retrain; open incident ticket if critical.

Exercises

These mirror the graded exercises below. Try here first, then submit in the Quick Test area to track your progress. Your test results are available to everyone; progress saving requires login.

Exercise 1: Decide if an alert should fire

Baseline (binary classifier): Precision=0.88, Recall=0.76 at threshold 0.6. SLO: alert if relative drop >= 15% in either metric. Current (with labels): Precision=0.80, Recall=0.60.

Task A: Compute relative drops for both metrics.
Task B: Decide if alert triggers and classify the likely issue (data drift vs concept change vs both) given that PSI for feature price_band=0.05 and for device_type=0.23.

Show solution

Precision drop: (0.88-0.80)/0.88=9.1% (no alert). Recall drop: (0.76-0.60)/0.76=21.1% (alert). device_type PSI=0.23 indicates input drift; performance drop suggests possible concept change too. Trigger alert; investigate device_type shift first.

Exercise 2: Simple drift rule on regression inputs

Baseline feature price mean=20.0 (std 5.0), missing rate=0.5%. Current mean=25.5, missing rate=6.5%. Rule: alert if abs% change in mean > 20% OR missing rate > 5%.

Task A: Compute abs% change in mean.
Task B: Decide alert and primary investigation path.

Show solution

Abs% change: (25.5-20)/20=27.5% > 20%, and missing rate 6.5% > 5%. Alert fires. Investigate missingness (pipeline or upstream schema) and understand why mean jumped—look for segment or supply changes.

[Checklist] I can distinguish input drift vs concept drift vs performance drift.
[Checklist] I can compute relative metric changes.
[Checklist] I defined warning/critical thresholds and windows.
[Checklist] I know how to act: investigate, rollback, or retrain.

Common mistakes and self-check

Mistake: Only watching aggregate metrics. Self-check: Do I monitor key segments (region, device, customer type)?
Mistake: Ignoring label delay. Self-check: Do I have proxy metrics until labels arrive?
Mistake: Treating any drift as bad. Self-check: Is the drift beneficial or neutral (e.g., campaign uplift)?
Mistake: Thresholds too sensitive (alert fatigue). Self-check: Do I have grace periods and combined signals?
Mistake: Not separating data quality from model drift. Self-check: Do I check freshness, counts, and nulls first?

Practical projects

Build a dashboard that shows PSI per feature, decision rate, and delayed performance over rolling windows.
Create a runbook and simulate three incidents: schema change, covariate shift, and concept change; document decisions.
Implement a canary evaluation comparing champion vs challenger models with identical monitoring.

Mini challenge

Design a composite health score that blends: top-3 feature PSI (weighted), decision rate change, and confidence entropy. Define warning and critical thresholds and describe the action plan for each.

Next steps

Instrument feature-level logging if missing.
Add segment-based alerting for your top risk cohorts.
Plan a periodic retraining cadence and backtesting protocol.

Quick Test note

The quick test below is available to everyone. To save your progress and see it on your profile, log in.

Menu

Concept Drift And Performance Monitoring

Table of Contents

Who this is for

Prerequisites

Learning path

Why this matters

Concept explained simply

Mental model

Key metrics and signals

Worked examples

Example 1: Classification with delayed labels

Example 2: Regression demand forecasting

Example 3: NLP topic classifier

How to monitor in production

Exercises

Common mistakes and self-check

Practical projects

Mini challenge

Next steps

Quick Test note

Practice Exercises

Decide if an alert should fire

Instructions

Expected Output

Simple drift rule on regression inputs

Concept Drift And Performance Monitoring — Quick Test

Have questions about Concept Drift And Performance Monitoring?

AI Assistant