Topic Not Found

Why this matters

As a Data Engineer, you’re the guardian of pipelines. Anomaly detection helps you catch issues early so downstream dashboards, ML models, and stakeholders aren’t misled.

Spot sudden drops/spikes in daily events before KPIs break.
Catch null-rate surges after an upstream schema change.
Detect late-arriving files or lagging streaming consumers.
Identify distribution shifts (e.g., unusually high order amounts) that may signal bugs or fraud.

Concept explained simply

An anomaly is a data point or pattern that deviates from what you normally expect. It’s not always an error, but it deserves attention.

Outlier: a single point that looks unusual relative to its neighbors.
Shift/drift: the whole distribution changes (e.g., mean increases for several days).
Break: abrupt structural change (e.g., schema change or missing batch).

Mental model

Think of a thermostat with a comfort range. You set a band where temperature is fine; anything outside triggers a check. Your job is to set a reasonable band (baseline) and a reaction plan.

Core signals to monitor

Volume: rows per batch/window.
Freshness/latency: time since last successful load.
Null/empty rates per column.
Distinct counts/uniqueness (e.g., duplicate IDs).
Value ranges and distribution percentiles (p50/p95/p99).
Schema changes: column added/removed/type changed.

Methods you can start with

Rule-based thresholds: fixed min/max, or % change vs yesterday/last week.
Rolling mean + z-score: flag when |value - mean| > k * std over a recent window.
Median + MAD (robust): more stable when you have outliers in the baseline.
Seasonal baseline: compare today to the same weekday/hour in prior weeks.
Multivariate sanity rules: e.g., orders_count ≈ sum of per-region counts.

Show quick formulas

z-score: z = (x - mean) / std. Flag if |z| ≥ k (common k: 3).

Robust z-score: z_r = 0.6745 * (x - median) / MAD, where MAD = median(|x - median|).

Percent change: pct = (today - baseline) / baseline.

Worked examples

Daily events z-score.
Last 7 days (baseline): 1005, 995, 1015, 1003, 990, 1008, 995. Today: 1120.
- mean ≈ 1001.57; std ≈ 8.07.
- z ≈ (1120 - 1001.57) / 8.07 ≈ 14.7 → anomaly.
Null-rate percent change.
Baseline null rate avg = 1.0%. Today = 3.5%.
- pct change = (3.5 - 1.0) / 1.0 = 250% ↑.
- If your rule is “alert if >= 100% increase,” this triggers.
Freshness SLA.
Batch should arrive by 02:00 UTC daily. Today it arrived 05:10 UTC.
- Delay = 3h10m > 1h SLA ⇒ alert as freshness anomaly.
Distribution shift via quantiles.
Order amount p95 last 14 days ≈ $220. Today p95 = $350.
- Rule: flag if p95 > baseline_p95 + 40% → 220 * 1.4 = 308. Alert.

How to implement in practice

Select signals: choose 3–6 per table (volume, freshness, key column nulls, p95 of amounts).
Baseline: rolling 7–28 day window, weekday-aware if seasonal.
Checks: start with rules + z/MAD thresholds.
Run: execute in your orchestrator after loads (batches) or in stream windows (e.g., 5-min).
Alerting: route to on-call channel; include context (table, metric, last good value, link to run).
Noise control: cooldowns, require 2 consecutive breaches, or use higher thresholds for low-volume data.
Review: log outcomes, tune thresholds monthly.

Example: simple SQL-style checks

-- Batch volume vs 7-day avg
WITH base AS (
  SELECT AVG(row_count) AS avg_rows, STDDEV_POP(row_count) AS sd_rows
  FROM daily_counts
  WHERE dt BETWEEN CURRENT_DATE - 7 AND CURRENT_DATE - 1
), today AS (
  SELECT row_count AS rows_today FROM daily_counts WHERE dt = CURRENT_DATE
)
SELECT CASE WHEN ABS(rows_today - avg_rows) >= 3 * sd_rows THEN 'ALERT' ELSE 'OK' END AS status
FROM base, today;

-- Null-rate robust check (median + MAD)
WITH hist AS (
  SELECT null_rate FROM col_null_rates WHERE dt BETWEEN CURRENT_DATE - 14 AND CURRENT_DATE - 1
), m AS (
  SELECT PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY null_rate) AS med FROM hist
), dev AS (
  SELECT ABS(null_rate - m.med) AS d FROM hist, m
), mad AS (
  SELECT PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY d) AS mad FROM dev
), t AS (
  SELECT null_rate AS today FROM col_null_rates WHERE dt = CURRENT_DATE
)
SELECT CASE WHEN 0.6745 * (today - m.med) / NULLIF(mad.mad,0) >= 5 THEN 'ALERT' ELSE 'OK' END
FROM t, m, mad;

Hands-on exercises

Try these small, realistic tasks. Compare your results with the solutions below each exercise card.

Exercise 1 — Rolling z-score on daily orders (matches ex1)

Data (rows/day): D-9..D-1 = 980, 1020, 1005, 995, 1015, 1003, 990, 1008, 995; Today D0 = 1120. Use the previous 7 days (D-7..D-1) as baseline to compute mean, std, and z for D0. Threshold: |z| ≥ 3.

Show solution

Baseline (D-7..D-1): 1005, 995, 1015, 1003, 990, 1008, 995.

mean ≈ 1001.57; std ≈ 8.07; z ≈ (1120 - 1001.57)/8.07 ≈ 14.7 → Alert.

Exercise 2 — Robust null-rate alert (matches ex2)

Last 14 days null rates (%): 1.1, 0.9, 1.0, 1.2, 0.8, 1.0, 1.1, 0.9, 1.3, 0.7, 1.0, 1.2, 0.9, 1.0. Today: 8.0. Use robust z: z_r = 0.6745 * (x - median) / MAD. Alert if z_r ≥ 5.

Show solution

median = 1.0; deviations median (MAD) = 0.1; z_r = 0.6745 * (8.0 - 1.0)/0.1 = 47.2 → Alert.

Exercise checklist

I used only the specified baseline window (no peeking at today).
I computed mean/median and std/MAD correctly (units consistent).
I applied the threshold exactly as defined.
I wrote down an alert message that includes value, baseline, and z-score.

Common mistakes and self-checks

Too-sensitive thresholds: set k too low. Self-check: alert rate > 5% week-over-week? Raise k or require 2 consecutive breaches.
Ignoring seasonality: compare Monday to Sunday. Self-check: stratify baselines by weekday/hour.
One-off spikes causing future noise: outliers pollute mean/std. Self-check: prefer median+MAD for baselines.
Alerting on low volume: small counts swing wildly. Self-check: add minimum volume guard before applying tests.
Mixing missing vs zero: Null ≠ 0. Self-check: verify semantic meaning per column.
No backtesting: thresholds unproven. Self-check: replay last 60–90 days and count true/false alerts.

Practical projects

Warehouse table sentry: For a fact table, add checks: daily row count, p95 amount, null rate of key columns, freshness. Store results in an audit table and render a simple dashboard.
Streaming heartbeat: In a stream, compute 5-min windows of events/sec and end-to-end latency. Alert if events/sec z ≥ 3 or latency > SLA for 2 windows.
Seasonal baseline: Build weekday-aware baselines for volume and compare today vs median of the last 4 same-weekdays.

Learning path

Start: rule-based thresholds (min/max, % change) on core signals.
Add: rolling mean/std and median/MAD for stability.
Seasonality: weekday/hour-aware comparisons.
Streaming: sliding windows and consecutive-breach logic.
Scale: central metrics registry, alert templates, backtesting and tuning cadence.

Who this is for

Aspiring and practicing Data Engineers who maintain batch/stream pipelines.
Analytics Engineers responsible for data model reliability.
ML Engineers needing robust input data sanity checks.

Prerequisites

Basic SQL (aggregations, window functions helpful).
Comfort with descriptive statistics (mean, median, std, quantiles).
Familiarity with your orchestration tool and alert channel.

Next steps

Instrument 3–5 checks on one critical table; review in a week.
Add a weekday-aware baseline for volume.
Implement a cooldown or consecutive-breach rule to reduce noise.

Mini challenge

Your orders table doubled volume today, but revenue is flat. In 3 lines, propose which two checks would have caught this earlier and what context you’d include in the alert.

Quick Test

Note: The quick test is available to everyone; only logged-in users will have progress saved.

Menu

Anomaly Detection Basics

Table of Contents