luvv to helpDiscover the Best Free Online Tools
Topic 2 of 8

Anomaly Detection For Pipelines

Learn Anomaly Detection For Pipelines for free with explanations, exercises, and a quick test (for Data Platform Engineer).

Published: January 11, 2026 | Updated: January 11, 2026

Who this is for

  • Data Platform Engineers who operate batch or streaming pipelines.
  • Data Engineers adding reliability guardrails to ingestion, transformation, and serving layers.

Prerequisites

  • [ ] Basic statistics: mean/median, standard deviation, percentiles.
  • [ ] Comfort with SQL and one scripting language (e.g., Python) to compute metrics.
  • [ ] Familiarity with your pipelines’ SLAs/SLOs (freshness, latency, data volume).
  • [ ] Access to historical pipeline run metadata (counts, durations, error rates).

Learning path

  1. Map your pipeline: stages, inputs/outputs, SLOs.
  2. Choose signals: volume, freshness, schema, distribution, duplicates, business KPIs.
  3. Pick detection methods: thresholds, seasonal baselines, robust z-score, change-points.
  4. Design alert rules: severity, noise reduction, escalation, runbook links.
  5. Iterate with feedback: tune baselines, suppress expected spikes, improve context.

Why this matters

In real teams, you will be asked to:

  • Detect sudden drops in ingestion volume before dashboards go blank.
  • Catch schema drift (e.g., new column type) before jobs fail.
  • Flag latency regressions that threaten freshness SLOs.
  • Spot distribution shifts (e.g., null-rate or categorical mix changes) that break models.
  • Reduce alert fatigue by suppressing known seasonal spikes and correlating related alerts.

Concept explained simply

Anomaly detection answers: “Is today’s behavior too different from what we usually see?” You define a baseline from historical behavior, measure current signals, compare, and alert when deviation crosses a threshold. Good systems also add context, suppress noise, and learn from feedback.

Mental model
  1. Observe: Collect signals per run/window (volume, freshness, etc.).
  2. Baseline: Estimate “normal” (seasonal medians, quantiles, EWMA).
  3. Detect: Compare now vs baseline (z-score, percent deviation, change-point).
  4. Decide: Apply rules (severity, consecutive breaches, cool-down).
  5. Act: Alert with context and a short runbook. Record feedback.

Core signals to monitor

  • Volume: rows/files/bytes per run/window; new vs late-arriving rows.
  • Freshness: max event_time, lag to now, staleness vs SLO.
  • Latency: job duration, end-to-end time, queue time.
  • Schema: column add/drop/type change; incompatible encodings.
  • Quality: null/blank rate, duplicates, out-of-range values, referential integrity.
  • Distribution: numeric percentiles, categorical proportions, drift metrics (e.g., PSI).
  • Failures: error rate, retry count, dead-letter queue size.

Detection methods (when to use what)

  • Static thresholds: quick guardrails for hard SLOs (e.g., freshness < 2h).
  • Seasonal baselines: day-of-week or hour-of-day medians/quantiles for periodic patterns.
  • Robust z-score: use median and MAD for skewed metrics. Formula:
    robust_z = 0.6745 * (x - median) / MAD
  • EWMA/EMA: smooth noisy metrics and detect gradual drifts.
  • Change-point detection: detect regime shifts (e.g., sudden mean change).
  • Drift metrics: PSI/JS divergence to track distribution shifts in features.
Noise reduction patterns
  • Consecutive breaches: alert only after N consecutive anomalies.
  • Hysteresis: different enter/exit thresholds to avoid flapping.
  • Cool-down windows: pause alerts after acknowledgement.
  • Correlation: group related alerts from the same run or dataset.

Worked examples

Example 1: Volume with weekday seasonality

Scenario: Daily ingestion has weekly patterns. Build a per-weekday baseline using the last 4 weeks.

  1. For each weekday, compute median count and MAD from the last 4 same-weekday runs.
  2. Compute robust z for today. If |z| > 3.5, mark anomaly.
  3. Severity: moderate if 3.5–5, high if > 5 or absolute drop > 25%.
Small numeric demo
Last 4 Mondays (rows in millions): 10.1, 10.3, 9.9, 10.2
Median = 10.15
Absolute deviations: 0.05, 0.15, 0.25, 0.05 -> MAD = 0.075
Today = 7.9
robust_z = 0.6745*(7.9 - 10.15)/0.075 ≈ -20.2 (high anomaly)

Example 2: Null rate spike

Scenario: email column null_rate jumps from typical 1–2% to 15%.

  1. Baseline: rolling day-of-week 90th percentile null_rate.
  2. Rule: alert if today > baseline + 5 percentage points AND > 2x baseline.
  3. Auto-silence if upstream outage already acknowledged for the same window.
Reasoning

Percentile baseline handles skew; dual-threshold reduces noise from tiny increases.

Example 3: Latency regression

Scenario: Job normally completes in 25–35 minutes. Today at 55 minutes and rising.

  1. Compute EWMA of duration with alpha=0.3.
  2. Alert if duration > EWMA + 3*rolling MAD OR absolute > 45 minutes.
  3. Severity high if SLO breach risk within the next run window.
Tip

Always keep a hard cap for SLOs in addition to statistical rules.

Implementation checklist

  • [ ] List critical datasets and their SLOs.
  • [ ] Define per-dataset signals and owners.
  • [ ] Choose baselines per signal (static, seasonal, robust).
  • [ ] Set noise controls: consecutive breaches, cool-down, hysteresis.
  • [ ] Attach context to alerts: last normal value, chart, upstream run IDs.
  • [ ] Create short runbooks for top 5 failure modes.
  • [ ] Review weekly: tune thresholds and retire noisy alerts.

Common mistakes and self-check

  • Mistake: Using mean/stdev on skewed metrics. Self-check: Compare median/MAD vs mean/stdev; pick the one with fewer false alarms.
  • Mistake: Ignoring seasonality. Self-check: Plot metric by weekday/hour; if patterns exist, switch to seasonal baselines.
  • Mistake: Alerting on single-point spikes. Self-check: Require 2–3 consecutive breaches for non-critical signals.
  • Mistake: No root-cause context. Self-check: Include upstream job status and recent schema changes in the alert.
  • Mistake: One-size-fits-all thresholds. Self-check: Calibrate per dataset and severity.

Practical projects

  • Project A: Add anomaly detection to one high-impact dataset. Deliver: signal list, baselines, alert rules, and a runbook. Success: meaningful alert fired in dry-run without excessive noise.
  • Project B: Build a weekly anomaly review. Deliver: dashboard of anomalies, suppression reasons, and tuning actions. Success: 30–50% reduction in noisy alerts in 2 weeks.

Exercises

Complete Exercises 1–2 and verify with the solutions. Use this checklist while working:

  • [ ] Identify the right baseline for the signal.
  • [ ] Compute the detection score (z-score, percent deviation, etc.).
  • [ ] Decide on thresholds and severity.
  • [ ] Add at least one noise-reduction rule.
Exercise 1 — Mirror of ex1

Compute robust z-score for today’s volume vs a weekday baseline and decide if it is an anomaly.

Exercise 2 — Mirror of ex2

Design an alerting rule for a null rate with weekly seasonality and propose suppression logic.

Mini challenge

Your streaming topic suddenly shows a 5x spike in message rate between 18:00–19:00 daily for a week. Propose a detection and alerting plan that avoids paging every evening but still catches unexpected changes. Include: baseline choice, detection rule, and noise controls.

Next steps

  • Implement one method per signal (volume, freshness, latency) this week.
  • Add context to alerts and create concise runbooks.
  • Take the quick test to check your understanding.

Quick Test

Available to everyone for free. Your progress is saved if you are logged in.

Practice Exercises

2 exercises to complete

Instructions

You have the last 4 Monday row counts (in millions): 10.1, 10.3, 9.9, 10.2. Today’s Monday count is 7.9.

  1. Compute the median and MAD of the 4 historical values.
  2. Compute the robust z-score using 0.6745 * (x - median) / MAD.
  3. Mark anomaly if |z| > 3.5 and classify severity (moderate if 3.5–5, high if > 5).
Expected Output
Anomaly: yes; robust z ≈ -20.2; Severity: high

Anomaly Detection For Pipelines — Quick Test

Test your knowledge with 10 questions. Pass with 70% or higher.

10 questions70% to pass

Have questions about Anomaly Detection For Pipelines?

AI Assistant

Ask questions about this tool