How to learn Monitoring And Drift Basics for Model Evaluation in Data Scientist for free

Who this is for

Data Scientists, MLOps engineers, and analysts who need production models to stay accurate, fair, and reliable over time.

Prerequisites

Basic model metrics (accuracy, AUC, precision/recall)
Understanding of training/validation/test splits
Comfort with reading distributions and simple statistics (mean, variance, percentiles)

Learning path

Before this: Offline model evaluation basics (metrics, validation strategy)
This lesson: Monitoring and drift fundamentals for production
Next: Alerting design, retraining policies, and champion–challenger deployments

Why this matters

Real-world models face changing users, markets, and data pipelines. Without monitoring, performance silently degrades and harms business outcomes.

Fraud detection: detect spikes in false negatives after a new product launch
Recommenders: track click-through rate decay and content inventory shifts
Risk scoring: monitor shifts in applicant demographics and credit bureau feeds
Support triage: ensure topic distributions and language mix don’t drift

Progress note

The quick test on this page is available to everyone. Only logged-in users get saved progress and badges.

Concept explained simply

Monitoring checks your model’s health in production, continuously. Drift is a change in what your model sees or in how the world behaves.

Data (covariate) drift: p(x) changes. Example: average transaction amount doubles.
Label/prior drift: p(y) changes. Example: fraud rate drops from 5% to 2%.
Concept drift: p(y|x) changes. Example: same spending pattern has new meaning after a policy change.
Prediction drift: p(ŷ) changes. Example: your model starts outputting higher risk scores overall.

Monitoring spans four buckets:

Quality metrics: AUC, log loss, calibration, lift (when labels available)
Data integrity: missing rates, type mismatches, out-of-range values
Distribution drift: PSI, KS test, Chi-square, KL divergence, MMD
Operational: latency, error rate, throughput, cost, fairness slices

Mental model

Think of monitoring like a thermostat plus a dashboard:

Reference window: a period of healthy behavior (often validation or an early stable week)
Live window: the most recent period (e.g., last day or last 10k predictions)
Comparator: statistical tests or distance scores between live and reference
Control rules: thresholds, grace periods, and multi-signal alerts

A simple setup that works

Define a reference window (training or a proven good production week).
Select windows: e.g., live = last 24h, reference = last 30 days or fixed baseline.
Log essentials: timestamp, features, prediction score, model version, cohort keys (country, device, segment), and ground-truth label when it arrives.
Choose signals: PSI per key numerical feature, top-categories share for categoricals, missing-rate checks, AUC/logloss on delayed labels, latency p95.
Set thresholds: e.g., PSI < 0.1 (green), 0.1–0.25 (amber), > 0.25 (red). Add grace periods to avoid flapping.
Alerting: fire when two or more red signals or persistent amber for 3 windows.
Response playbook: inspect cohorts, roll back model, recalibrate, or trigger retraining.

What to log (minimal)

request_id, timestamp, model_version
features_after_processing (or hashes + stats)
prediction score and decision
cohort keys (e.g., market, device)
ground truth when available, with its timestamp

Worked examples

Example 1: Credit scoring PSI

Feature: income. Reference distribution (binned monthly income in USD): [0–2k: 30%, 2–5k: 45%, 5–10k: 20%, 10k+: 5%]. Live: [0–2k: 18%, 2–5k: 44%, 5–10k: 28%, 10k+: 10%].

PSI per bin = (live - ref) * ln(live / ref).

0–2k: (0.18-0.30)*ln(0.18/0.30) ≈ (-0.12)*ln(0.6) ≈ (-0.12)*(-0.511) ≈ 0.061
2–5k: (0.44-0.45)*ln(0.44/0.45) ≈ (-0.01)*(-0.022) ≈ 0.0002
5–10k: (0.28-0.20)*ln(0.28/0.20) ≈ (0.08)*ln(1.4) ≈ 0.08*0.336 ≈ 0.027
10k+: (0.10-0.05)*ln(0.10/0.05) ≈ 0.05*ln(2) ≈ 0.05*0.693 ≈ 0.035

Total PSI ≈ 0.061 + 0.0002 + 0.027 + 0.035 ≈ 0.123. Interpretation: mild-to-moderate drift; investigate before it grows.

Example 2: CTR model with delayed labels

Labels arrive after 24–48 hours. You monitor prediction drift (mean score, calibration via reliability curves weekly), data drift (impression country mix, device type), and operational latency. You compute AUC with a two-day lag, compare 7-day rolling AUC vs. reference. Threshold: AUC drop > 0.03 or mean score shift > 0.1 triggers investigation.

Example 3: Image classifier concept drift

A retailer adds new product styles. p(x) changes (textures, backgrounds), and p(y|x) changes because old visual cues no longer indicate category. Monitoring detects higher misclassification in the “Shoes” slice and increased entropy of predictions. Response: add new labeled examples, retrain, and add slice-specific monitoring for new categories.

Choosing thresholds and windows

Start conservative: PSI red > 0.25, amber 0.1–0.25; KS p-value < 0.01 = red
Use rolling windows sized to traffic: enough examples for stable stats (e.g., 5–10k events)
Require persistence: 2 consecutive red windows, or 3 amber, before paging
Combine signals: drift + performance drop is more actionable than either alone

Rules of thumb

Univariate tests catch big shifts; multivariate tests catch subtle interactions
Monitor key slices (region, device, language) to avoid hidden degradation
Calibrate or retrain when you see stable drift plus KPI impact

Quick checks and diagnostics

If labels are delayed: watch prediction drift and calibration drift; compute performance when labels land
If traffic is low: aggregate over longer windows and use exact tests (Fisher’s)
If features are many: track top-K features by importance and a rotating subset
If data is noisy: add data-quality gates (schema, ranges, missingness)

Common mistakes and self-check

Mistake: Using training data as the only reference forever. Fix: refresh reference to a known-good recent period.
Mistake: Alerting on a single metric. Fix: require multiple corroborating signals or persistence.
Mistake: Ignoring slices. Fix: set up key cohort monitoring.
Mistake: No ground-truth join. Fix: design a pipeline to attach labels later.
Mistake: Thresholds copied from blogs. Fix: calibrate thresholds using historical variability.

Self-check checklist

I have a clear reference window
I log features, predictions, versions, and cohorts
I monitor distribution drift and performance (when labels exist)
I defined thresholds, windows, and alert rules
I have a playbook for investigation and rollback

Practical projects

Build a PSI/KS dashboard for your last project with daily windows and cohort slices
Simulate label delay and compute lagged AUC/logloss with rolling evaluation
Create a drift playbook: from alert triage to retrain decision in 5 steps

Exercises

These mirror the exercises below. Try them before opening solutions.

ex1: Compute PSI for a categorical feature with 4 bins and interpret it.
ex2: Design a minimum monitoring plan for a churn model with 14-day label delay.

Exercise checklist

I computed per-bin contributions correctly
I summed PSI and interpreted against thresholds
My plan includes drift, performance (delayed), and ops metrics
I specified windows, thresholds, and actions

Mini challenge

Your model’s mean prediction increased by 0.15 overnight, PSI on two key features is 0.18 and 0.28, and latency p95 is stable. What do you do first?

Suggested approach

Investigate the feature with PSI 0.28 and recent data ingestion changes
Check slice breakdowns to see where drift concentrates
Run calibration check; if miscalibrated, consider recalibration or rollback

Next steps

Add multivariate drift tests (MMD or train a drift classifier)
Introduce champion–challenger evaluation with canary traffic
Automate retraining triggers with human-in-the-loop approvals

Menu

Monitoring And Drift Basics

Table of Contents

Who this is for

Prerequisites

Learning path

Why this matters

Concept explained simply

Mental model

A simple setup that works

Worked examples

Example 1: Credit scoring PSI

Example 2: CTR model with delayed labels

Example 3: Image classifier concept drift

Choosing thresholds and windows

Quick checks and diagnostics

Common mistakes and self-check

Practical projects

Exercises

Mini challenge

Next steps

Practice Exercises

Compute PSI for a categorical feature

Instructions

Expected Output

Design a monitoring plan with delayed labels

Monitoring And Drift Basics — Quick Test

Have questions about Monitoring And Drift Basics?

AI Assistant