Why this matters
Anomalies in data can signal pipeline failures, late arrivals, schema drifts, spikes in nulls, or business events. As an ETL Developer, you will:
- Guard SLAs by catching missing or delayed batches before downstream jobs run.
- Protect dashboards and models from bad inputs (e.g., sudden null spikes or duplicates).
- Detect data drift and unexpected behavior after code or source changes.
- Reduce firefighting by turning unknown surprises into clear, actionable alerts.
Who this is for
- ETL/ELT developers implementing data quality checks.
- Data engineers adding monitoring to pipelines.
- Analysts who own critical metrics and need guardrails.
Prerequisites
- Basic SQL (SELECT, GROUP BY, window functions helpful but not required).
- Comfort with simple statistics (mean, median, standard deviation, percentiles).
- Understanding of your pipeline schedules and SLAs.
Concept explained simply
An anomaly is a data point or period that deviates from what you normally expect. You define “normal” with simple rules or a baseline from recent history, then flag anything too far away.
Mental model
- Baseline: What did this metric look like recently? (e.g., last 14 days)
- Distance: How far is today from that baseline? (z-score, MAD, or % change)
- Decision: Is the distance bigger than a threshold? If yes, alert.
Core techniques you can apply today
1) Static rules
- Example: row_count < 1000 or null_rate > 5%.
- Use for hard expectations (e.g., schema must not change).
2) Rolling mean + z-score
- Compute rolling mean and std over recent N days, flag if |(value - mean)/std| > k.
- Good for roughly stable metrics without strong outliers.
3) Median + MAD (robust)
- MAD = median(|x - median(x)|). Flag if |x - median| / (1.4826 * MAD) > k.
- More robust when data is skewed or has outliers.
4) Percentile/IQR rules
- Flag if value is outside [P5, P95] or beyond Q1 - 1.5*IQR or Q3 + 1.5*IQR.
- Simple and interpretable without assuming normal distribution.
5) Seasonal baselines
- Compare Monday vs previous Mondays, hour-of-day vs same hour last week.
- Controls for weekly/hourly patterns to reduce false positives.
Worked examples
Example 1: Daily row count spike (rolling z-score)
Dates: 2025-01-01...2025-01-14
Row_count: [100k, 101k, 99k, 100k, 100k, 102k, 99k, 100k, 101k, 100k, 100k, 98k, 100k, 130k]
Baseline window: last 7 days
Rule: Flag if |z| > 3
Result: 2025-01-14 has z ≈ +8.0 → anomaly
Example 2: Null rate spike in a column
Recent null_rate over 10 days ~ 0.2% to 0.6%
Today null_rate = 4.0%
Rule: Flag if today's null_rate > median(last 10) + 6*MAD
Median ≈ 0.4%, MAD small → 4.0% easily exceeds → anomaly
Example 3: Transaction amount outliers (MAD)
Amounts (USD): [12, 14, 11, 15, 13, 12, 13, 14, 11, 1000]
Median ≈ 13
MAD = median(|x - 13|) = 1
Robust z ≈ |x - 13| / (1.4826 * 1)
For 1000: (987)/1.4826 ~ 665.7 → anomaly
How to set thresholds (practical)
- Collect history: at least 14–30 days of the metric you want to monitor.
- Choose a method: static, z-score, MAD, or percentile (start simple).
- Pick k: z-score k=3, MAD k=6–8, percentile P2–P98 to start.
- Control seasonality: use same weekday/hour baselines if patterns exist.
- Pilot silently: log detections for a week to measure noise.
- Enable alerts: route to a channel with clear, actionable text.
How to evaluate alerts
- Check freshness: is the partition late or missing?
- Check upstream: schema changes, source outages, or new backfills?
- Check scope: one table, many tables, or only a column?
- Decide action: rerun, hotfix transform, or temporarily suppress with a note.
Exercises
Complete these two hands-on tasks. The same instructions are also available in the Exercises panel below this lesson. Everyone can try them; only logged-in users will have progress saved.
Exercise 1 — Rolling z-score for daily row counts
Given sample daily counts, compute a 7-day rolling mean and std (excluding today), then flag anomalies where |z| > 3.
Data (date, row_count):
D1 100000
D2 101000
D3 99000
D4 100000
D5 100000
D6 102000
D7 99000
D8 100000
D9 101000
D10 100000
D11 100000
D12 98000
D13 100000
D14 130000
- Compute z for D8–D14 using D-7..D-1 as baseline.
- List dates flagged as anomalies.
Show solution
Baseline around 100k with std ≈ 1k. D14 at 130k gives z ≈ +8 → anomaly. Others near baseline → not anomalies.
Exercise 2 — MAD-based detector for transaction amounts
Use median and MAD to flag anomalies where robust_z = |x - median| / (1.4826*MAD) > 6.
Amounts: [12, 14, 11, 15, 13, 12, 13, 14, 11, 1000]
- Compute median and MAD.
- Flag any values with robust_z > 6.
Show solution
Median=13, MAD=1. Robust z for 1000 is ~665.7 → anomaly. Others <= ~2.7 → normal.
Checklist: before you deploy
- Metric has at least 14 days of history.
- Seasonality accounted for (weekday/hour).
- Thresholds chosen and piloted silently for a week.
- Alert includes table/column, time window, metric value, baseline, suggested action.
- False-positive handling agreed (temporary suppression rules).
Common mistakes and self-check
- Using mean/std on skewed data → switch to median/MAD.
- Ignoring seasonality → compare to same weekday/hour baseline.
- Alerting on a single noisy point → require persistence (e.g., 2 consecutive anomalies).
- Mixing incident types → separate freshness, volume, and column-quality alerts.
- No action plan → include runbook steps in alert message.
Practical projects you can ship
- Table freshness monitor: alert if the latest partition is older than expected.
- Daily volume guard: z-score anomaly for row_count per table.
- Column null-rate watcher: MAD-based check per critical column.
- Seasonal baseline: weekday-aware detector for sessions_per_day.
Learning path
- Start with static thresholds for critical must-not-break rules.
- Add rolling z-score on stable metrics without heavy skew.
- Upgrade to MAD for skewed or heavy-tailed metrics.
- Incorporate weekday/hour baselines to reduce noise.
- Introduce alert persistence and severity levels.
Next steps
- Implement one detector per category: freshness, volume, null-rate.
- Pilot silently for 7 days, measure precision/recall (manual review).
- Turn on alerts with clear actions and owners.
Mini challenge
Pick a metric with weekly seasonality (e.g., orders_per_day). Build a baseline using the last 4 same-weekday values and a MAD threshold. Require 2 consecutive anomalies to trigger. Write down the alert text you would send to your team.
Progress and test
The quick test is available to everyone. If you are logged in, your progress and answers will be saved.