luvv to helpDiscover the Best Free Online Tools
Topic 10 of 10

Monitoring And Drift Basics

Learn Monitoring And Drift Basics for free with explanations, exercises, and a quick test (for Data Scientist).

Published: January 1, 2026 | Updated: January 1, 2026

Who this is for

Data Scientists, MLOps engineers, and analysts who need production models to stay accurate, fair, and reliable over time.

Prerequisites

  • Basic model metrics (accuracy, AUC, precision/recall)
  • Understanding of training/validation/test splits
  • Comfort with reading distributions and simple statistics (mean, variance, percentiles)

Learning path

  • Before this: Offline model evaluation basics (metrics, validation strategy)
  • This lesson: Monitoring and drift fundamentals for production
  • Next: Alerting design, retraining policies, and champion–challenger deployments

Why this matters

Real-world models face changing users, markets, and data pipelines. Without monitoring, performance silently degrades and harms business outcomes.

  • Fraud detection: detect spikes in false negatives after a new product launch
  • Recommenders: track click-through rate decay and content inventory shifts
  • Risk scoring: monitor shifts in applicant demographics and credit bureau feeds
  • Support triage: ensure topic distributions and language mix don’t drift
Progress note

The quick test on this page is available to everyone. Only logged-in users get saved progress and badges.

Concept explained simply

Monitoring checks your model’s health in production, continuously. Drift is a change in what your model sees or in how the world behaves.

  • Data (covariate) drift: p(x) changes. Example: average transaction amount doubles.
  • Label/prior drift: p(y) changes. Example: fraud rate drops from 5% to 2%.
  • Concept drift: p(y|x) changes. Example: same spending pattern has new meaning after a policy change.
  • Prediction drift: p(ŷ) changes. Example: your model starts outputting higher risk scores overall.

Monitoring spans four buckets:

  • Quality metrics: AUC, log loss, calibration, lift (when labels available)
  • Data integrity: missing rates, type mismatches, out-of-range values
  • Distribution drift: PSI, KS test, Chi-square, KL divergence, MMD
  • Operational: latency, error rate, throughput, cost, fairness slices

Mental model

Think of monitoring like a thermostat plus a dashboard:

  • Reference window: a period of healthy behavior (often validation or an early stable week)
  • Live window: the most recent period (e.g., last day or last 10k predictions)
  • Comparator: statistical tests or distance scores between live and reference
  • Control rules: thresholds, grace periods, and multi-signal alerts

A simple setup that works

  1. Define a reference window (training or a proven good production week).
  2. Select windows: e.g., live = last 24h, reference = last 30 days or fixed baseline.
  3. Log essentials: timestamp, features, prediction score, model version, cohort keys (country, device, segment), and ground-truth label when it arrives.
  4. Choose signals: PSI per key numerical feature, top-categories share for categoricals, missing-rate checks, AUC/logloss on delayed labels, latency p95.
  5. Set thresholds: e.g., PSI < 0.1 (green), 0.1–0.25 (amber), > 0.25 (red). Add grace periods to avoid flapping.
  6. Alerting: fire when two or more red signals or persistent amber for 3 windows.
  7. Response playbook: inspect cohorts, roll back model, recalibrate, or trigger retraining.
What to log (minimal)
  • request_id, timestamp, model_version
  • features_after_processing (or hashes + stats)
  • prediction score and decision
  • cohort keys (e.g., market, device)
  • ground truth when available, with its timestamp

Worked examples

Example 1: Credit scoring PSI

Feature: income. Reference distribution (binned monthly income in USD): [0–2k: 30%, 2–5k: 45%, 5–10k: 20%, 10k+: 5%]. Live: [0–2k: 18%, 2–5k: 44%, 5–10k: 28%, 10k+: 10%].

PSI per bin = (live - ref) * ln(live / ref).

  • 0–2k: (0.18-0.30)*ln(0.18/0.30) ≈ (-0.12)*ln(0.6) ≈ (-0.12)*(-0.511) ≈ 0.061
  • 2–5k: (0.44-0.45)*ln(0.44/0.45) ≈ (-0.01)*(-0.022) ≈ 0.0002
  • 5–10k: (0.28-0.20)*ln(0.28/0.20) ≈ (0.08)*ln(1.4) ≈ 0.08*0.336 ≈ 0.027
  • 10k+: (0.10-0.05)*ln(0.10/0.05) ≈ 0.05*ln(2) ≈ 0.05*0.693 ≈ 0.035

Total PSI ≈ 0.061 + 0.0002 + 0.027 + 0.035 ≈ 0.123. Interpretation: mild-to-moderate drift; investigate before it grows.

Example 2: CTR model with delayed labels

Labels arrive after 24–48 hours. You monitor prediction drift (mean score, calibration via reliability curves weekly), data drift (impression country mix, device type), and operational latency. You compute AUC with a two-day lag, compare 7-day rolling AUC vs. reference. Threshold: AUC drop > 0.03 or mean score shift > 0.1 triggers investigation.

Example 3: Image classifier concept drift

A retailer adds new product styles. p(x) changes (textures, backgrounds), and p(y|x) changes because old visual cues no longer indicate category. Monitoring detects higher misclassification in the “Shoes” slice and increased entropy of predictions. Response: add new labeled examples, retrain, and add slice-specific monitoring for new categories.

Choosing thresholds and windows

  • Start conservative: PSI red > 0.25, amber 0.1–0.25; KS p-value < 0.01 = red
  • Use rolling windows sized to traffic: enough examples for stable stats (e.g., 5–10k events)
  • Require persistence: 2 consecutive red windows, or 3 amber, before paging
  • Combine signals: drift + performance drop is more actionable than either alone
Rules of thumb
  • Univariate tests catch big shifts; multivariate tests catch subtle interactions
  • Monitor key slices (region, device, language) to avoid hidden degradation
  • Calibrate or retrain when you see stable drift plus KPI impact

Quick checks and diagnostics

  • If labels are delayed: watch prediction drift and calibration drift; compute performance when labels land
  • If traffic is low: aggregate over longer windows and use exact tests (Fisher’s)
  • If features are many: track top-K features by importance and a rotating subset
  • If data is noisy: add data-quality gates (schema, ranges, missingness)

Common mistakes and self-check

  • Mistake: Using training data as the only reference forever. Fix: refresh reference to a known-good recent period.
  • Mistake: Alerting on a single metric. Fix: require multiple corroborating signals or persistence.
  • Mistake: Ignoring slices. Fix: set up key cohort monitoring.
  • Mistake: No ground-truth join. Fix: design a pipeline to attach labels later.
  • Mistake: Thresholds copied from blogs. Fix: calibrate thresholds using historical variability.
Self-check checklist
  • I have a clear reference window
  • I log features, predictions, versions, and cohorts
  • I monitor distribution drift and performance (when labels exist)
  • I defined thresholds, windows, and alert rules
  • I have a playbook for investigation and rollback

Practical projects

  • Build a PSI/KS dashboard for your last project with daily windows and cohort slices
  • Simulate label delay and compute lagged AUC/logloss with rolling evaluation
  • Create a drift playbook: from alert triage to retrain decision in 5 steps

Exercises

These mirror the exercises below. Try them before opening solutions.

  1. ex1: Compute PSI for a categorical feature with 4 bins and interpret it.
  2. ex2: Design a minimum monitoring plan for a churn model with 14-day label delay.
Exercise checklist
  • I computed per-bin contributions correctly
  • I summed PSI and interpreted against thresholds
  • My plan includes drift, performance (delayed), and ops metrics
  • I specified windows, thresholds, and actions

Mini challenge

Your model’s mean prediction increased by 0.15 overnight, PSI on two key features is 0.18 and 0.28, and latency p95 is stable. What do you do first?

Suggested approach
  • Investigate the feature with PSI 0.28 and recent data ingestion changes
  • Check slice breakdowns to see where drift concentrates
  • Run calibration check; if miscalibrated, consider recalibration or rollback

Next steps

  • Add multivariate drift tests (MMD or train a drift classifier)
  • Introduce champion–challenger evaluation with canary traffic
  • Automate retraining triggers with human-in-the-loop approvals

Practice Exercises

2 exercises to complete

Instructions

You have a feature with 4 categories (A, B, C, D). Reference shares: A=0.50, B=0.20, C=0.20, D=0.10. Live shares: A=0.40, B=0.25, C=0.25, D=0.10.

Task: Compute PSI per category and total. Use PSI = (live - ref) * ln(live / ref). Interpret the result using green (<0.1), amber (0.1–0.25), red (>0.25).

Expected Output
A table or list with per-category PSI and the total PSI value, plus a short interpretation (green/amber/red).

Monitoring And Drift Basics — Quick Test

Test your knowledge with 8 questions. Pass with 70% or higher.

8 questions70% to pass

Have questions about Monitoring And Drift Basics?

AI Assistant

Ask questions about this tool