Who this is for
If you build, deploy, or maintain ML pipelines (data ingestion, feature processing, training, or inference), this lesson will help you keep them reliable, fast, and cost-effective.
Prerequisites
- Basic understanding of ML pipelines: data in, process, output (features, models, predictions).
- Comfort reading logs and basic metrics (counters, gauges, histograms).
- Familiarity with scheduling/orchestration concepts (DAGs, runs, tasks).
Why this matters
Real-world ML fails silently without monitoring. Typical MLOps tasks that rely on pipeline health monitoring:
- Ensure features are fresh before daily inference.
- Catch schema or null-value spikes before they break training.
- Detect drift early to trigger retraining.
- Keep latency under SLOs for user-facing predictions.
- Control costs by spotting runaway resource use.
What can go wrong without monitoring?
- Models use stale or missing features causing sudden performance drops.
- Training runs succeed but used corrupted data due to silent upstream failure.
- Inference endpoints respond slowly; product KPIs drop.
- Alert storms lead to fatigue; real incidents get ignored.
Concept explained simply
Pipeline health monitoring is the continuous measurement of the signals that prove your ML pipeline is working as intended (correct, timely, fast, and cost-aware), plus alerting when it is not.
Mental model
Think of your pipeline as a factory line:
- Inputs: raw materials (data).
- Stations: transform, assemble (feature engineering, training).
- Outputs: finished goods (features, models, predictions).
- Quality sensors: detect delays, defects, shortages.
- Runbook: what to do when a sensor trips.
Monitoring installs these sensors and codifies the runbook.
What to monitor: key signals
- Availability and status: run success rate, task failures, retries, time-to-recover (MTTR).
- Timeliness: data freshness and end-to-end latency; schedule adherence.
- Volume and completeness: row counts, feature coverage, missing/null rates.
- Data quality and schema: type/shape checks, valid ranges, categorical cardinality.
- Statistical health: distribution drift (covariate), label drift, training-serving skew.
- Performance: throughput (rows/sec), p95/p99 step latencies, resource saturation.
- Prediction health: error rate, confidence distribution, business KPIs proxy.
- Cost signals: compute hours, storage growth, hot-path vs. batch costs.
SLI, SLO, SLA quick definitions
- SLI (indicator): the metric you measure (e.g., data freshness in minutes).
- SLO (objective): target (e.g., freshness <= 60 minutes 99% of runs).
- SLA (agreement): external contract; often legal. Not always needed for internal ML.
SLOs, alerts, and triage
- Define user impact: What breaks if this pipeline degrades? Choose a few SLIs that map to user experience (e.g., p95 inference latency, daily feature freshness).
- Set SLOs: Pick achievable targets and error budgets (e.g., 99% runs under 20 min).
- Instrument: Emit metrics, logs, and traces per step. Tag by run_id, dataset, model_version.
- Alert: Page on user-impacting, sustained SLO violations. Route the rest to tickets. Use multi-window (short + long) detection to avoid noise.
- Runbook: For each alert, include probable causes, quick checks, rollback/disable steps, and owners.
- Review: Weekly error-budget and post-incident reviews to refine thresholds.
Worked examples
Example 1: Data freshness for daily features
Scenario: A batch feature pipeline must deliver features by 6:00 UTC.
- SLI: max(source_ingest_ts) to now (minutes).
- SLO: freshness <= 60 min for 99% of days.
- Alert: Trigger if freshness > 60 min for >= 15 min AND after 5:30 UTC.
- Runbook snippet: Check upstream extractor status, partition lag, storage quota; if stuck, backfill last partition and re-run transform.
Example 2: Catching schema changes in training data
Scenario: Upstream adds a new category value.
- SLI: categorical cardinality delta and null-rate per column.
- SLO: cardinality change <= 20% week-over-week OR approved change ticket.
- Alert: Ticket-only if single column; page if > 5 columns change at once.
- Runbook snippet: Compare profile to baseline; if intentional, update schema contract; if not, quarantine offending partitions.
Example 3: Inference p95 latency regression
Scenario: Real-time predictions must be < 120 ms at p95.
- SLI: p95 latency in ms, error rate.
- SLO: p95 <= 120 ms 99.5% of 5-min windows; error rate < 1%.
- Alert: Page if two consecutive 5-min windows breach.
- Runbook snippet: Check autoscaling events, model size drift, feature service latency; rollback to previous model if needed.
Hands-on: instrument a toy pipeline
Imagine a batch pipeline with steps: extract, validate, transform, train, publish.
# Pseudo-instrumentation (language-agnostic)
metric_counter('runs_started', tags={'pipeline':'churn_batch'})
with timer('extract_duration'): extract()
validate()
metric_gauge('null_rate_email', value=compute_null_rate('email'))
with timer('transform_duration'): transform()
metric_histogram('row_counts', value=rows_processed)
with timer('train_duration'): train()
metric_gauge('auc', value=eval_auc)
metric_counter('runs_succeeded', tags={'pipeline':'churn_batch'})
Alert examples
- extract_duration p95 > 10 min for 3 windows.
- null_rate_email > 5% sustained 30 min.
- auc < 0.70 after training, compare vs. baseline 0.75 ± 0.02.
- Emit at least one counter (runs), one gauge (freshness/null-rate), and one histogram (duration).
- Tag metrics by run_id and step to slice failures quickly.
- Add a simple runbook note for each alert.
Exercises
These mirror the practice tasks below. Do them now, then take the Quick Test.
- Exercise 1: Define SLIs/SLOs for a nightly feature pipeline and write two alert rules.
- Exercise 2: Add drift checks and choose thresholds that avoid false positives.
- Exercise 3: Create a runbook for a failed step with long queues and missing partitions.
Common mistakes
- Too many alerts: leads to fatigue. Fix: alert only on user-impacting, sustained breaches; route the rest to dashboards/tickets.
- Mean-only latency: hides tail problems. Fix: use p95/p99.
- Unlabeled metrics: hard to debug. Fix: tag by run_id, step, dataset, model_version.
- No baseline: thresholds guessed. Fix: collect 2–4 weeks of metrics, set data-informed SLOs.
- One-off checks: not automated. Fix: encode checks in pipeline code; fail fast with clear messages.
Self-check before you ship
- Can you prove the last successful run produced fresh, complete features?
- Can you see where time is spent per step (p95)?
- Do you have at most 3 paging alerts for this pipeline?
- Does each alert link to a runbook with owners and rollback steps?
- Can you trace a request from input to prediction with IDs?
Practical projects
- Batch: Instrument a daily feature pipeline with freshness, null-rate, and row-count checks; add a backfill runbook.
- Training: Add schema and drift profiling; auto-block model publication if metrics degrade beyond thresholds.
- Serving: Track p95/p99 latency and error rate; implement canary rollback when SLO breaches persist.
Mini challenge
In one page, define the top 5 SLIs for your most important pipeline, set SLOs with error budgets, and write one high-severity alert with an actionable runbook. Keep it minimal but testable within a week.
Learning path
- Instrument basic metrics (runs, durations, freshness).
- Add data quality and schema checks.
- Introduce drift monitoring and model-quality gates.
- Define SLOs and implement paging vs. ticketing.
- Automate runbooks and post-incident reviews.
Next steps
- Apply instrumentation to one real pipeline this week.
- Reduce noisy alerts by 50% via thresholds and grouping.
- Schedule a weekly error-budget review for the next month.
Quick Test
The test is available to everyone. Only logged-in users will have their progress saved.