Why this matters
As a Data Engineer, you’re the guardian of pipelines. Anomaly detection helps you catch issues early so downstream dashboards, ML models, and stakeholders aren’t misled.
- Spot sudden drops/spikes in daily events before KPIs break.
- Catch null-rate surges after an upstream schema change.
- Detect late-arriving files or lagging streaming consumers.
- Identify distribution shifts (e.g., unusually high order amounts) that may signal bugs or fraud.
Concept explained simply
An anomaly is a data point or pattern that deviates from what you normally expect. It’s not always an error, but it deserves attention.
- Outlier: a single point that looks unusual relative to its neighbors.
- Shift/drift: the whole distribution changes (e.g., mean increases for several days).
- Break: abrupt structural change (e.g., schema change or missing batch).
Mental model
Think of a thermostat with a comfort range. You set a band where temperature is fine; anything outside triggers a check. Your job is to set a reasonable band (baseline) and a reaction plan.
Core signals to monitor
- Volume: rows per batch/window.
- Freshness/latency: time since last successful load.
- Null/empty rates per column.
- Distinct counts/uniqueness (e.g., duplicate IDs).
- Value ranges and distribution percentiles (p50/p95/p99).
- Schema changes: column added/removed/type changed.
Methods you can start with
- Rule-based thresholds: fixed min/max, or % change vs yesterday/last week.
- Rolling mean + z-score: flag when |value - mean| > k * std over a recent window.
- Median + MAD (robust): more stable when you have outliers in the baseline.
- Seasonal baseline: compare today to the same weekday/hour in prior weeks.
- Multivariate sanity rules: e.g., orders_count ≈ sum of per-region counts.
Show quick formulas
z-score: z = (x - mean) / std. Flag if |z| ≥ k (common k: 3).
Robust z-score: z_r = 0.6745 * (x - median) / MAD, where MAD = median(|x - median|).
Percent change: pct = (today - baseline) / baseline.
Worked examples
-
Daily events z-score.
Last 7 days (baseline): 1005, 995, 1015, 1003, 990, 1008, 995. Today: 1120.
- mean ≈ 1001.57; std ≈ 8.07.
- z ≈ (1120 - 1001.57) / 8.07 ≈ 14.7 → anomaly.
-
Null-rate percent change.
Baseline null rate avg = 1.0%. Today = 3.5%.
- pct change = (3.5 - 1.0) / 1.0 = 250% ↑.
- If your rule is “alert if >= 100% increase,” this triggers.
-
Freshness SLA.
Batch should arrive by 02:00 UTC daily. Today it arrived 05:10 UTC.
- Delay = 3h10m > 1h SLA ⇒ alert as freshness anomaly.
-
Distribution shift via quantiles.
Order amount p95 last 14 days ≈ $220. Today p95 = $350.
- Rule: flag if p95 > baseline_p95 + 40% → 220 * 1.4 = 308. Alert.
How to implement in practice
- Select signals: choose 3–6 per table (volume, freshness, key column nulls, p95 of amounts).
- Baseline: rolling 7–28 day window, weekday-aware if seasonal.
- Checks: start with rules + z/MAD thresholds.
- Run: execute in your orchestrator after loads (batches) or in stream windows (e.g., 5-min).
- Alerting: route to on-call channel; include context (table, metric, last good value, link to run).
- Noise control: cooldowns, require 2 consecutive breaches, or use higher thresholds for low-volume data.
- Review: log outcomes, tune thresholds monthly.
Example: simple SQL-style checks
-- Batch volume vs 7-day avg WITH base AS ( SELECT AVG(row_count) AS avg_rows, STDDEV_POP(row_count) AS sd_rows FROM daily_counts WHERE dt BETWEEN CURRENT_DATE - 7 AND CURRENT_DATE - 1 ), today AS ( SELECT row_count AS rows_today FROM daily_counts WHERE dt = CURRENT_DATE ) SELECT CASE WHEN ABS(rows_today - avg_rows) >= 3 * sd_rows THEN 'ALERT' ELSE 'OK' END AS status FROM base, today; -- Null-rate robust check (median + MAD) WITH hist AS ( SELECT null_rate FROM col_null_rates WHERE dt BETWEEN CURRENT_DATE - 14 AND CURRENT_DATE - 1 ), m AS ( SELECT PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY null_rate) AS med FROM hist ), dev AS ( SELECT ABS(null_rate - m.med) AS d FROM hist, m ), mad AS ( SELECT PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY d) AS mad FROM dev ), t AS ( SELECT null_rate AS today FROM col_null_rates WHERE dt = CURRENT_DATE ) SELECT CASE WHEN 0.6745 * (today - m.med) / NULLIF(mad.mad,0) >= 5 THEN 'ALERT' ELSE 'OK' END FROM t, m, mad;
Hands-on exercises
Try these small, realistic tasks. Compare your results with the solutions below each exercise card.
Exercise 1 — Rolling z-score on daily orders (matches ex1)
Data (rows/day): D-9..D-1 = 980, 1020, 1005, 995, 1015, 1003, 990, 1008, 995; Today D0 = 1120. Use the previous 7 days (D-7..D-1) as baseline to compute mean, std, and z for D0. Threshold: |z| ≥ 3.
Show solution
Baseline (D-7..D-1): 1005, 995, 1015, 1003, 990, 1008, 995.
mean ≈ 1001.57; std ≈ 8.07; z ≈ (1120 - 1001.57)/8.07 ≈ 14.7 → Alert.
Exercise 2 — Robust null-rate alert (matches ex2)
Last 14 days null rates (%): 1.1, 0.9, 1.0, 1.2, 0.8, 1.0, 1.1, 0.9, 1.3, 0.7, 1.0, 1.2, 0.9, 1.0. Today: 8.0. Use robust z: z_r = 0.6745 * (x - median) / MAD. Alert if z_r ≥ 5.
Show solution
median = 1.0; deviations median (MAD) = 0.1; z_r = 0.6745 * (8.0 - 1.0)/0.1 = 47.2 → Alert.
Exercise checklist
- I used only the specified baseline window (no peeking at today).
- I computed mean/median and std/MAD correctly (units consistent).
- I applied the threshold exactly as defined.
- I wrote down an alert message that includes value, baseline, and z-score.
Common mistakes and self-checks
- Too-sensitive thresholds: set k too low. Self-check: alert rate > 5% week-over-week? Raise k or require 2 consecutive breaches.
- Ignoring seasonality: compare Monday to Sunday. Self-check: stratify baselines by weekday/hour.
- One-off spikes causing future noise: outliers pollute mean/std. Self-check: prefer median+MAD for baselines.
- Alerting on low volume: small counts swing wildly. Self-check: add minimum volume guard before applying tests.
- Mixing missing vs zero: Null ≠ 0. Self-check: verify semantic meaning per column.
- No backtesting: thresholds unproven. Self-check: replay last 60–90 days and count true/false alerts.
Practical projects
- Warehouse table sentry: For a fact table, add checks: daily row count, p95 amount, null rate of key columns, freshness. Store results in an audit table and render a simple dashboard.
- Streaming heartbeat: In a stream, compute 5-min windows of events/sec and end-to-end latency. Alert if events/sec z ≥ 3 or latency > SLA for 2 windows.
- Seasonal baseline: Build weekday-aware baselines for volume and compare today vs median of the last 4 same-weekdays.
Learning path
- Start: rule-based thresholds (min/max, % change) on core signals.
- Add: rolling mean/std and median/MAD for stability.
- Seasonality: weekday/hour-aware comparisons.
- Streaming: sliding windows and consecutive-breach logic.
- Scale: central metrics registry, alert templates, backtesting and tuning cadence.
Who this is for
- Aspiring and practicing Data Engineers who maintain batch/stream pipelines.
- Analytics Engineers responsible for data model reliability.
- ML Engineers needing robust input data sanity checks.
Prerequisites
- Basic SQL (aggregations, window functions helpful).
- Comfort with descriptive statistics (mean, median, std, quantiles).
- Familiarity with your orchestration tool and alert channel.
Next steps
- Instrument 3–5 checks on one critical table; review in a week.
- Add a weekday-aware baseline for volume.
- Implement a cooldown or consecutive-breach rule to reduce noise.
Mini challenge
Your orders table doubled volume today, but revenue is flat. In 3 lines, propose which two checks would have caught this earlier and what context you’d include in the alert.
Quick Test
Note: The quick test is available to everyone; only logged-in users will have progress saved.