Menu

Topic 3 of 8

Anomaly Detection Basics

Learn Anomaly Detection Basics for free with explanations, exercises, and a quick test (for Data Engineer).

Published: January 8, 2026 | Updated: January 8, 2026

Why this matters

As a Data Engineer, you’re the guardian of pipelines. Anomaly detection helps you catch issues early so downstream dashboards, ML models, and stakeholders aren’t misled.

  • Spot sudden drops/spikes in daily events before KPIs break.
  • Catch null-rate surges after an upstream schema change.
  • Detect late-arriving files or lagging streaming consumers.
  • Identify distribution shifts (e.g., unusually high order amounts) that may signal bugs or fraud.

Concept explained simply

An anomaly is a data point or pattern that deviates from what you normally expect. It’s not always an error, but it deserves attention.

  • Outlier: a single point that looks unusual relative to its neighbors.
  • Shift/drift: the whole distribution changes (e.g., mean increases for several days).
  • Break: abrupt structural change (e.g., schema change or missing batch).

Mental model

Think of a thermostat with a comfort range. You set a band where temperature is fine; anything outside triggers a check. Your job is to set a reasonable band (baseline) and a reaction plan.

Core signals to monitor

  • Volume: rows per batch/window.
  • Freshness/latency: time since last successful load.
  • Null/empty rates per column.
  • Distinct counts/uniqueness (e.g., duplicate IDs).
  • Value ranges and distribution percentiles (p50/p95/p99).
  • Schema changes: column added/removed/type changed.

Methods you can start with

  • Rule-based thresholds: fixed min/max, or % change vs yesterday/last week.
  • Rolling mean + z-score: flag when |value - mean| > k * std over a recent window.
  • Median + MAD (robust): more stable when you have outliers in the baseline.
  • Seasonal baseline: compare today to the same weekday/hour in prior weeks.
  • Multivariate sanity rules: e.g., orders_count ≈ sum of per-region counts.
Show quick formulas

z-score: z = (x - mean) / std. Flag if |z| ≥ k (common k: 3).

Robust z-score: z_r = 0.6745 * (x - median) / MAD, where MAD = median(|x - median|).

Percent change: pct = (today - baseline) / baseline.

Worked examples

  1. Daily events z-score.

    Last 7 days (baseline): 1005, 995, 1015, 1003, 990, 1008, 995. Today: 1120.

    • mean ≈ 1001.57; std ≈ 8.07.
    • z ≈ (1120 - 1001.57) / 8.07 ≈ 14.7 → anomaly.
  2. Null-rate percent change.

    Baseline null rate avg = 1.0%. Today = 3.5%.

    • pct change = (3.5 - 1.0) / 1.0 = 250% ↑.
    • If your rule is “alert if >= 100% increase,” this triggers.
  3. Freshness SLA.

    Batch should arrive by 02:00 UTC daily. Today it arrived 05:10 UTC.

    • Delay = 3h10m > 1h SLA ⇒ alert as freshness anomaly.
  4. Distribution shift via quantiles.

    Order amount p95 last 14 days ≈ $220. Today p95 = $350.

    • Rule: flag if p95 > baseline_p95 + 40% → 220 * 1.4 = 308. Alert.

How to implement in practice

  1. Select signals: choose 3–6 per table (volume, freshness, key column nulls, p95 of amounts).
  2. Baseline: rolling 7–28 day window, weekday-aware if seasonal.
  3. Checks: start with rules + z/MAD thresholds.
  4. Run: execute in your orchestrator after loads (batches) or in stream windows (e.g., 5-min).
  5. Alerting: route to on-call channel; include context (table, metric, last good value, link to run).
  6. Noise control: cooldowns, require 2 consecutive breaches, or use higher thresholds for low-volume data.
  7. Review: log outcomes, tune thresholds monthly.
Example: simple SQL-style checks
-- Batch volume vs 7-day avg
WITH base AS (
  SELECT AVG(row_count) AS avg_rows, STDDEV_POP(row_count) AS sd_rows
  FROM daily_counts
  WHERE dt BETWEEN CURRENT_DATE - 7 AND CURRENT_DATE - 1
), today AS (
  SELECT row_count AS rows_today FROM daily_counts WHERE dt = CURRENT_DATE
)
SELECT CASE WHEN ABS(rows_today - avg_rows) >= 3 * sd_rows THEN 'ALERT' ELSE 'OK' END AS status
FROM base, today;

-- Null-rate robust check (median + MAD)
WITH hist AS (
  SELECT null_rate FROM col_null_rates WHERE dt BETWEEN CURRENT_DATE - 14 AND CURRENT_DATE - 1
), m AS (
  SELECT PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY null_rate) AS med FROM hist
), dev AS (
  SELECT ABS(null_rate - m.med) AS d FROM hist, m
), mad AS (
  SELECT PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY d) AS mad FROM dev
), t AS (
  SELECT null_rate AS today FROM col_null_rates WHERE dt = CURRENT_DATE
)
SELECT CASE WHEN 0.6745 * (today - m.med) / NULLIF(mad.mad,0) >= 5 THEN 'ALERT' ELSE 'OK' END
FROM t, m, mad;
  

Hands-on exercises

Try these small, realistic tasks. Compare your results with the solutions below each exercise card.

Exercise 1 — Rolling z-score on daily orders (matches ex1)

Data (rows/day): D-9..D-1 = 980, 1020, 1005, 995, 1015, 1003, 990, 1008, 995; Today D0 = 1120. Use the previous 7 days (D-7..D-1) as baseline to compute mean, std, and z for D0. Threshold: |z| ≥ 3.

Show solution

Baseline (D-7..D-1): 1005, 995, 1015, 1003, 990, 1008, 995.

mean ≈ 1001.57; std ≈ 8.07; z ≈ (1120 - 1001.57)/8.07 ≈ 14.7 → Alert.

Exercise 2 — Robust null-rate alert (matches ex2)

Last 14 days null rates (%): 1.1, 0.9, 1.0, 1.2, 0.8, 1.0, 1.1, 0.9, 1.3, 0.7, 1.0, 1.2, 0.9, 1.0. Today: 8.0. Use robust z: z_r = 0.6745 * (x - median) / MAD. Alert if z_r ≥ 5.

Show solution

median = 1.0; deviations median (MAD) = 0.1; z_r = 0.6745 * (8.0 - 1.0)/0.1 = 47.2 → Alert.

Exercise checklist

  • I used only the specified baseline window (no peeking at today).
  • I computed mean/median and std/MAD correctly (units consistent).
  • I applied the threshold exactly as defined.
  • I wrote down an alert message that includes value, baseline, and z-score.

Common mistakes and self-checks

  • Too-sensitive thresholds: set k too low. Self-check: alert rate > 5% week-over-week? Raise k or require 2 consecutive breaches.
  • Ignoring seasonality: compare Monday to Sunday. Self-check: stratify baselines by weekday/hour.
  • One-off spikes causing future noise: outliers pollute mean/std. Self-check: prefer median+MAD for baselines.
  • Alerting on low volume: small counts swing wildly. Self-check: add minimum volume guard before applying tests.
  • Mixing missing vs zero: Null ≠ 0. Self-check: verify semantic meaning per column.
  • No backtesting: thresholds unproven. Self-check: replay last 60–90 days and count true/false alerts.

Practical projects

  1. Warehouse table sentry: For a fact table, add checks: daily row count, p95 amount, null rate of key columns, freshness. Store results in an audit table and render a simple dashboard.
  2. Streaming heartbeat: In a stream, compute 5-min windows of events/sec and end-to-end latency. Alert if events/sec z ≥ 3 or latency > SLA for 2 windows.
  3. Seasonal baseline: Build weekday-aware baselines for volume and compare today vs median of the last 4 same-weekdays.

Learning path

  1. Start: rule-based thresholds (min/max, % change) on core signals.
  2. Add: rolling mean/std and median/MAD for stability.
  3. Seasonality: weekday/hour-aware comparisons.
  4. Streaming: sliding windows and consecutive-breach logic.
  5. Scale: central metrics registry, alert templates, backtesting and tuning cadence.

Who this is for

  • Aspiring and practicing Data Engineers who maintain batch/stream pipelines.
  • Analytics Engineers responsible for data model reliability.
  • ML Engineers needing robust input data sanity checks.

Prerequisites

  • Basic SQL (aggregations, window functions helpful).
  • Comfort with descriptive statistics (mean, median, std, quantiles).
  • Familiarity with your orchestration tool and alert channel.

Next steps

  • Instrument 3–5 checks on one critical table; review in a week.
  • Add a weekday-aware baseline for volume.
  • Implement a cooldown or consecutive-breach rule to reduce noise.

Mini challenge

Your orders table doubled volume today, but revenue is flat. In 3 lines, propose which two checks would have caught this earlier and what context you’d include in the alert.

Quick Test

Note: The quick test is available to everyone; only logged-in users will have progress saved.

Practice Exercises

2 exercises to complete

Instructions

You have daily order counts: D-9..D-1 = 980, 1020, 1005, 995, 1015, 1003, 990, 1008, 995. Today D0 = 1120. Use the previous 7 days (D-7..D-1) as the baseline. Compute baseline mean, std (population), the z-score for today, and decide if |z| ≥ 3 triggers an alert.

Expected Output
mean ≈ 1001.57; std ≈ 8.07; z ≈ 14.7; Trigger: Alert

Anomaly Detection Basics — Quick Test

Test your knowledge with 10 questions. Pass with 70% or higher.

10 questions70% to pass

Have questions about Anomaly Detection Basics?

AI Assistant

Ask questions about this tool