How to learn Monitoring ML Systems for Machine Learning Engineer for free

Why this skill matters for Machine Learning Engineers

Models don’t fail only at training time—they fail in production when latency spikes, inputs shift, labels arrive late, or feedback loops skew behavior. Monitoring ML systems lets you catch issues early, explain incidents, and iterate faster. It unlocks reliable inference services, safer launches, and trustworthy decision-making across teams.

Ship confidently: track latency, throughput, and error rates to protect user experience.
Protect model quality: detect data and concept drift before business KPIs suffer.
Build trust: monitor calibration, bias/fairness, and prediction quality with feedback loops.
Respond fast: alerting, dashboards, and structured logs enable root cause analysis (RCA).

Who this is for

Machine Learning Engineers deploying or maintaining models in production.
Data Scientists handing off models to engineering teams.
Platform/Infra engineers supporting model-serving systems.

Prerequisites

Python and basic data analysis (pandas/numpy).
Familiarity with model metrics (accuracy, AUC, F1) and classification/regression basics.
Comfort with REST/gRPC services and JSON logs.
Basic understanding of dashboards and alerting concepts (SLOs, thresholds).

Learning path and roadmap

1) Instrumentation foundations

Log inputs, outputs, model/version, timing, and request IDs. Compute latency, throughput, error rates. Define SLIs/SLOs.

2) Drift monitoring

Measure data and feature drift (distribution shifts) and concept drift (performance shifts). Choose tests (KS/PSI) and windows.

3) Quality feedback loops

Ingest delayed ground truth, compute rolling metrics, and close the loop with retraining triggers.

4) Calibration, bias, and fairness

Monitor reliability (ECE), segment metrics by groups, and set guardrails against unintended harm.

5) Alerting, dashboards, and RCA

Design alerts to avoid noise, build focused dashboards, and practice incident response and root cause analysis.

Milestone checklist

SLIs/SLOs defined for latency, errors, throughput.
Data and concept drift jobs run on schedule.
Feedback loop from labels to rolling metrics is operating.
Calibration and fairness dashboards exist with thresholds.
Alert runbooks and RCA template maintained.

Worked examples

Example 1: Compute latency, throughput, and error rate from logs

Given a day of inference logs, compute p50/p95 latency, requests per minute, and error rate.

import pandas as pd

# Example rows: ts, req_id, status, latency_ms, model, version
logs = pd.DataFrame([
    {"ts":"2026-01-01T10:00:00Z","req_id":"a1","status":200,"latency_ms":42,"model":"clf","version":"1"},
    {"ts":"2026-01-01T10:00:01Z","req_id":"a2","status":500,"latency_ms":120,"model":"clf","version":"1"},
    {"ts":"2026-01-01T10:00:02Z","req_id":"a3","status":200,"latency_ms":80,"model":"clf","version":"1"},
])
logs['ts'] = pd.to_datetime(logs['ts'])

lat_p50 = logs['latency_ms'].quantile(0.5)
lat_p95 = logs['latency_ms'].quantile(0.95)
error_rate = (logs['status'] >= 400).mean()
thr_per_min = logs.resample('1min', on='ts')['req_id'].count().rename('rpm').mean()

print({"p50_ms": lat_p50, "p95_ms": lat_p95, "error_rate": error_rate, "rpm": thr_per_min})

Interpretation: p95 spikes usually impact user experience and should back alerts. Error rate captures correctness of the service call, not model quality.

Try it

Window by version and confirm whether a new model release changed p95 more than 20%.

Example 2: Detect data drift with KS test and PSI

Compare a current batch to a reference for a numeric feature.

import numpy as np
from scipy.stats import ks_2samp

def population_stability_index(ref, cur, bins=10):
    # Simple PSI for 1D
    qs = np.quantile(ref, np.linspace(0, 1, bins+1))
    ref_hist, _ = np.histogram(ref, bins=qs)
    cur_hist, _ = np.histogram(cur, bins=qs)
    ref_prop = np.clip(ref_hist / max(ref_hist.sum(), 1), 1e-6, 1)
    cur_prop = np.clip(cur_hist / max(cur_hist.sum(), 1), 1e-6, 1)
    psi = np.sum((cur_prop - ref_prop) * np.log(cur_prop / ref_prop))
    return psi

np.random.seed(0)
ref = np.random.normal(0, 1, 5000)
cur = np.random.normal(0.5, 1.2, 5000)  # shifted

ks_stat, ks_p = ks_2samp(ref, cur)
psi = population_stability_index(ref, cur)
print({"ks_stat": ks_stat, "ks_p": ks_p, "psi": psi})

Rules of thumb: PSI < 0.1 small, 0.1–0.25 moderate, > 0.25 significant drift. Use with context.

Try it

Run the same for categorical features by comparing proportion vectors across reference and current.

Example 3: Concept drift via rolling performance

Use delayed labels to compute rolling AUC and alert on degradation.

import pandas as pd
from sklearn.metrics import roc_auc_score

# Assume predictions table with columns: ts, id, score, y_true (may be delayed)
df = pd.DataFrame([
    {"ts":"2026-01-01", "score":0.8, "y_true":1},
    {"ts":"2026-01-02", "score":0.6, "y_true":0},
    {"ts":"2026-01-03", "score":0.7, "y_true":1},
])

df['ts'] = pd.to_datetime(df['ts'])

def rolling_auc(data, window='7D', min_points=50):
    out = []
    for end in pd.date_range(data['ts'].min(), data['ts'].max(), freq='D'):
        start = end - pd.Timedelta(window)
        batch = data[(data['ts'] > start) & (data['ts'] <= end)]
        if len(batch) >= min_points and len(batch['y_true'].unique()) > 1:
            out.append({"ts": end, "auc": roc_auc_score(batch['y_true'], batch['score'])})
    return pd.DataFrame(out)

auc_df = rolling_auc(df, window='14D', min_points=200)
# Alert idea: if AUC drops by >= 5% versus 30-day baseline or falls below an absolute floor.

Try it

Add segment-level AUC (e.g., by region) and compare to global AUC to find localized drift.

Example 4: Calibration monitoring (ECE)

Expected Calibration Error (ECE) summarizes how well predicted probabilities match observed frequencies.

import numpy as np

def ece(scores, labels, n_bins=10):
    scores = np.asarray(scores)
    labels = np.asarray(labels)
    bins = np.linspace(0, 1, n_bins+1)
    ece = 0.0
    for i in range(n_bins):
        idx = (scores >= bins[i]) & (scores < bins[i+1])
        if idx.sum() == 0: 
            continue
        conf = scores[idx].mean()
        acc = labels[idx].mean()
        ece += (idx.mean()) * abs(acc - conf)
    return ece

# Example
scores = [0.1,0.3,0.9,0.7,0.2,0.8]
labels = [0,0,1,1,0,1]
print({"ece": ece(scores, labels, n_bins=5)})

Monitor ECE and reliability curves. Rising ECE suggests recalibration (e.g., temperature scaling) or retraining.

Try it

Compute ECE per segment (e.g., device type) to catch calibration drift in subpopulations.

Example 5: Alert rules and SLOs

Define SLIs and actionable alerts.

# Pseudo-config
SLOs:
  inference_latency_p95_ms: 250
  request_error_rate: 0.01   # 1%
  model_auc_floor: 0.70

Alerts:
  - name: HighLatencyP95
    expr: p95_latency_ms > 250 for 10m
    severity: page
    runbook: Investigate upstream dependencies, autoscaling, hot caches
  - name: RisingErrorRate
    expr: error_rate > 0.02 for 15m
    severity: page
    runbook: Check recent deploys, rollback if correlated
  - name: ConceptDrift
    expr: rolling_auc_7d < 0.72 AND drop_vs_30d >= 0.05
    severity: ticket
    runbook: Trigger label audit and candidate retrain

Try it

Add a noise-reduction strategy: require 2 consecutive evaluation windows before paging.

Drills and exercises

Compute p50/p95 latency and error rate for each model version in a sample log.
Run a KS test on 3 key numeric features and calculate PSI; decide which needs investigation.
Build a 14-day rolling AUC chart with a 30-day baseline band.
Calculate ECE overall and by a chosen segment; propose a recalibration plan if ECE doubles.
Draft SLOs and two alert rules; include a short runbook for each.
Create a JSON logging schema that includes request_id, model_version, feature_hash, and inference_time_ms.

Common mistakes and debugging tips

Mistake: Monitoring only infra, not model quality

Symptom: Service is healthy but business KPI drops. Fix: Add concept drift metrics (AUC/F1/regression error) with delayed labels and segment breakdowns.

Mistake: Alert noise and pager fatigue

Symptom: Frequent false positives. Fix: Use percent-change plus absolute thresholds, require persistence (X of Y windows), and set appropriate severities.

Mistake: No ground truth pipeline

Symptom: Unable to compute performance metrics. Fix: Build a feedback loop to join predictions with later labels and handle delays and missingness.

Mistake: Missing context in logs

Symptom: RCA stalls. Fix: Log model/version, request/trace IDs, key feature summaries, and upstream dependency markers.

Mistake: Ignoring calibration and fairness

Symptom: Well-ranked but miscalibrated scores, or harm to groups. Fix: Track ECE and group metrics (TPR/FPR/parity) and set guardrails.

Practical projects

Mini project: Shadow deployment monitor

Goal: Safely evaluate a new classifier (v2) in shadow alongside v1.

Route a copy of traffic to v2 without affecting users. Log v1 and v2 predictions with shared request_id.
Build a job to compare v1 vs v2: latency p95, error rate, ECE, and where predictions disagree most.
When labels arrive, compute rolling AUC for both; alert if v2 underperforms or drifts more than v1.
Deliver a dashboard with top drifted features and a short launch decision summary.

Hints

Use a stable reference window (e.g., last 14 days) for drift tests.
Log version, checksum of features, and any preprocessing flags.

Project idea: Streaming metrics aggregator

Consume inference events, compute per-minute latency/error SLIs, and write compact time series for dashboards.

Project idea: Fairness and calibration board

Compute ECE and TPR/FPR by key groups and surface guardrail breaches with runbook links.

Subskills

Latency Throughput Error Rate Monitoring
Data Drift And Feature Drift
Concept Drift And Performance Monitoring
Prediction Quality Feedback Loops
Calibration Monitoring Basics
Alerting And Dashboards
Logging Inputs Outputs And Metadata
Monitoring Bias And Fairness Basics
Root Cause Analysis For ML Incidents

Next steps

Automate your first end-to-end monitoring pipeline (logs to metrics to alerts).
Add feedback-loop metrics with delayed labels and segment-level views.
Expand guardrails: calibration and fairness checks with clear thresholds and runbooks.
Practice incident response with a mock RCA using your logs and dashboards.

Menu

Monitoring ML Systems

Table of Contents

Why this skill matters for Machine Learning Engineers

Who this is for

Prerequisites

Learning path and roadmap

1) Instrumentation foundations

2) Drift monitoring

3) Quality feedback loops

4) Calibration, bias, and fairness

5) Alerting, dashboards, and RCA

Worked examples

Drills and exercises

Common mistakes and debugging tips

Practical projects

Mini project: Shadow deployment monitor

Project idea: Streaming metrics aggregator

Project idea: Fairness and calibration board

Subskills

Next steps

Monitoring ML Systems — Skill Exam

Topics

Monitoring Bias And Fairness Basics

Root Cause Analysis For ML Incidents

Latency Throughput Error Rate Monitoring

Data Drift And Feature Drift

Concept Drift And Performance Monitoring

Prediction Quality Feedback Loops

Calibration Monitoring Basics

Alerting And Dashboards

Logging Inputs Outputs And Metadata

Have questions about Monitoring ML Systems?

AI Assistant