Topic Not Found

Why this matters

Anomalies in data can signal pipeline failures, late arrivals, schema drifts, spikes in nulls, or business events. As an ETL Developer, you will:

Guard SLAs by catching missing or delayed batches before downstream jobs run.
Protect dashboards and models from bad inputs (e.g., sudden null spikes or duplicates).
Detect data drift and unexpected behavior after code or source changes.
Reduce firefighting by turning unknown surprises into clear, actionable alerts.

Who this is for

ETL/ELT developers implementing data quality checks.
Data engineers adding monitoring to pipelines.
Analysts who own critical metrics and need guardrails.

Prerequisites

Basic SQL (SELECT, GROUP BY, window functions helpful but not required).
Comfort with simple statistics (mean, median, standard deviation, percentiles).
Understanding of your pipeline schedules and SLAs.

Concept explained simply

An anomaly is a data point or period that deviates from what you normally expect. You define “normal” with simple rules or a baseline from recent history, then flag anything too far away.

Mental model

Baseline: What did this metric look like recently? (e.g., last 14 days)
Distance: How far is today from that baseline? (z-score, MAD, or % change)
Decision: Is the distance bigger than a threshold? If yes, alert.

Core techniques you can apply today

1) Static rules

Example: row_count < 1000 or null_rate > 5%.
Use for hard expectations (e.g., schema must not change).

2) Rolling mean + z-score

Compute rolling mean and std over recent N days, flag if |(value - mean)/std| > k.
Good for roughly stable metrics without strong outliers.

3) Median + MAD (robust)

MAD = median(|x - median(x)|). Flag if |x - median| / (1.4826 * MAD) > k.
More robust when data is skewed or has outliers.

4) Percentile/IQR rules

Flag if value is outside [P5, P95] or beyond Q1 - 1.5*IQR or Q3 + 1.5*IQR.
Simple and interpretable without assuming normal distribution.

5) Seasonal baselines

Compare Monday vs previous Mondays, hour-of-day vs same hour last week.
Controls for weekly/hourly patterns to reduce false positives.

Worked examples

Example 1: Daily row count spike (rolling z-score)

Dates: 2025-01-01...2025-01-14
Row_count: [100k, 101k, 99k, 100k, 100k, 102k, 99k, 100k, 101k, 100k, 100k, 98k, 100k, 130k]
Baseline window: last 7 days
Rule: Flag if |z| > 3
Result: 2025-01-14 has z ≈ +8.0 → anomaly

Example 2: Null rate spike in a column

Recent null_rate over 10 days ~ 0.2% to 0.6%
Today null_rate = 4.0%
Rule: Flag if today's null_rate > median(last 10) + 6*MAD
Median ≈ 0.4%, MAD small → 4.0% easily exceeds → anomaly

Example 3: Transaction amount outliers (MAD)

Amounts (USD): [12, 14, 11, 15, 13, 12, 13, 14, 11, 1000]
Median ≈ 13
MAD = median(|x - 13|) = 1
Robust z ≈ |x - 13| / (1.4826 * 1)
For 1000: (987)/1.4826 ~ 665.7 → anomaly

How to set thresholds (practical)

Collect history: at least 14–30 days of the metric you want to monitor.
Choose a method: static, z-score, MAD, or percentile (start simple).
Pick k: z-score k=3, MAD k=6–8, percentile P2–P98 to start.
Control seasonality: use same weekday/hour baselines if patterns exist.
Pilot silently: log detections for a week to measure noise.
Enable alerts: route to a channel with clear, actionable text.

How to evaluate alerts

Check freshness: is the partition late or missing?
Check upstream: schema changes, source outages, or new backfills?
Check scope: one table, many tables, or only a column?
Decide action: rerun, hotfix transform, or temporarily suppress with a note.

Exercises

Complete these two hands-on tasks. The same instructions are also available in the Exercises panel below this lesson. Everyone can try them; only logged-in users will have progress saved.

Exercise 1 — Rolling z-score for daily row counts

Given sample daily counts, compute a 7-day rolling mean and std (excluding today), then flag anomalies where |z| > 3.

Data (date, row_count):
D1 100000
D2 101000
D3  99000
D4 100000
D5 100000
D6 102000
D7  99000
D8 100000
D9 101000
D10 100000
D11 100000
D12  98000
D13 100000
D14 130000

Compute z for D8–D14 using D-7..D-1 as baseline.
List dates flagged as anomalies.

Show solution

Baseline around 100k with std ≈ 1k. D14 at 130k gives z ≈ +8 → anomaly. Others near baseline → not anomalies.

Exercise 2 — MAD-based detector for transaction amounts

Use median and MAD to flag anomalies where robust_z = |x - median| / (1.4826*MAD) > 6.

Amounts: [12, 14, 11, 15, 13, 12, 13, 14, 11, 1000]

Compute median and MAD.
Flag any values with robust_z > 6.

Show solution

Median=13, MAD=1. Robust z for 1000 is ~665.7 → anomaly. Others <= ~2.7 → normal.

Checklist: before you deploy

Metric has at least 14 days of history.
Seasonality accounted for (weekday/hour).
Thresholds chosen and piloted silently for a week.
Alert includes table/column, time window, metric value, baseline, suggested action.
False-positive handling agreed (temporary suppression rules).

Common mistakes and self-check

Using mean/std on skewed data → switch to median/MAD.
Ignoring seasonality → compare to same weekday/hour baseline.
Alerting on a single noisy point → require persistence (e.g., 2 consecutive anomalies).
Mixing incident types → separate freshness, volume, and column-quality alerts.
No action plan → include runbook steps in alert message.

Practical projects you can ship

Table freshness monitor: alert if the latest partition is older than expected.
Daily volume guard: z-score anomaly for row_count per table.
Column null-rate watcher: MAD-based check per critical column.
Seasonal baseline: weekday-aware detector for sessions_per_day.

Learning path

Start with static thresholds for critical must-not-break rules.
Add rolling z-score on stable metrics without heavy skew.
Upgrade to MAD for skewed or heavy-tailed metrics.
Incorporate weekday/hour baselines to reduce noise.
Introduce alert persistence and severity levels.

Next steps

Implement one detector per category: freshness, volume, null-rate.
Pilot silently for 7 days, measure precision/recall (manual review).
Turn on alerts with clear actions and owners.

Mini challenge

Pick a metric with weekly seasonality (e.g., orders_per_day). Build a baseline using the last 4 same-weekday values and a MAD threshold. Require 2 consecutive anomalies to trigger. Write down the alert text you would send to your team.

Progress and test

The quick test is available to everyone. If you are logged in, your progress and answers will be saved.

Menu

Anomaly Detection Basics

Table of Contents