Who this is for
- Data Platform Engineers who operate batch or streaming pipelines.
- Data Engineers adding reliability guardrails to ingestion, transformation, and serving layers.
Prerequisites
- [ ] Basic statistics: mean/median, standard deviation, percentiles.
- [ ] Comfort with SQL and one scripting language (e.g., Python) to compute metrics.
- [ ] Familiarity with your pipelines’ SLAs/SLOs (freshness, latency, data volume).
- [ ] Access to historical pipeline run metadata (counts, durations, error rates).
Learning path
- Map your pipeline: stages, inputs/outputs, SLOs.
- Choose signals: volume, freshness, schema, distribution, duplicates, business KPIs.
- Pick detection methods: thresholds, seasonal baselines, robust z-score, change-points.
- Design alert rules: severity, noise reduction, escalation, runbook links.
- Iterate with feedback: tune baselines, suppress expected spikes, improve context.
Why this matters
In real teams, you will be asked to:
- Detect sudden drops in ingestion volume before dashboards go blank.
- Catch schema drift (e.g., new column type) before jobs fail.
- Flag latency regressions that threaten freshness SLOs.
- Spot distribution shifts (e.g., null-rate or categorical mix changes) that break models.
- Reduce alert fatigue by suppressing known seasonal spikes and correlating related alerts.
Concept explained simply
Anomaly detection answers: “Is today’s behavior too different from what we usually see?” You define a baseline from historical behavior, measure current signals, compare, and alert when deviation crosses a threshold. Good systems also add context, suppress noise, and learn from feedback.
Mental model
- Observe: Collect signals per run/window (volume, freshness, etc.).
- Baseline: Estimate “normal” (seasonal medians, quantiles, EWMA).
- Detect: Compare now vs baseline (z-score, percent deviation, change-point).
- Decide: Apply rules (severity, consecutive breaches, cool-down).
- Act: Alert with context and a short runbook. Record feedback.
Core signals to monitor
- Volume: rows/files/bytes per run/window; new vs late-arriving rows.
- Freshness: max event_time, lag to now, staleness vs SLO.
- Latency: job duration, end-to-end time, queue time.
- Schema: column add/drop/type change; incompatible encodings.
- Quality: null/blank rate, duplicates, out-of-range values, referential integrity.
- Distribution: numeric percentiles, categorical proportions, drift metrics (e.g., PSI).
- Failures: error rate, retry count, dead-letter queue size.
Detection methods (when to use what)
- Static thresholds: quick guardrails for hard SLOs (e.g., freshness < 2h).
- Seasonal baselines: day-of-week or hour-of-day medians/quantiles for periodic patterns.
- Robust z-score: use median and MAD for skewed metrics. Formula:
robust_z = 0.6745 * (x - median) / MAD
- EWMA/EMA: smooth noisy metrics and detect gradual drifts.
- Change-point detection: detect regime shifts (e.g., sudden mean change).
- Drift metrics: PSI/JS divergence to track distribution shifts in features.
Noise reduction patterns
- Consecutive breaches: alert only after N consecutive anomalies.
- Hysteresis: different enter/exit thresholds to avoid flapping.
- Cool-down windows: pause alerts after acknowledgement.
- Correlation: group related alerts from the same run or dataset.
Worked examples
Example 1: Volume with weekday seasonality
Scenario: Daily ingestion has weekly patterns. Build a per-weekday baseline using the last 4 weeks.
- For each weekday, compute median count and MAD from the last 4 same-weekday runs.
- Compute robust z for today. If |z| > 3.5, mark anomaly.
- Severity: moderate if 3.5–5, high if > 5 or absolute drop > 25%.
Small numeric demo
Last 4 Mondays (rows in millions): 10.1, 10.3, 9.9, 10.2 Median = 10.15 Absolute deviations: 0.05, 0.15, 0.25, 0.05 -> MAD = 0.075 Today = 7.9 robust_z = 0.6745*(7.9 - 10.15)/0.075 ≈ -20.2 (high anomaly)
Example 2: Null rate spike
Scenario: email column null_rate jumps from typical 1–2% to 15%.
- Baseline: rolling day-of-week 90th percentile null_rate.
- Rule: alert if today > baseline + 5 percentage points AND > 2x baseline.
- Auto-silence if upstream outage already acknowledged for the same window.
Reasoning
Percentile baseline handles skew; dual-threshold reduces noise from tiny increases.
Example 3: Latency regression
Scenario: Job normally completes in 25–35 minutes. Today at 55 minutes and rising.
- Compute EWMA of duration with alpha=0.3.
- Alert if duration > EWMA + 3*rolling MAD OR absolute > 45 minutes.
- Severity high if SLO breach risk within the next run window.
Tip
Always keep a hard cap for SLOs in addition to statistical rules.
Implementation checklist
- [ ] List critical datasets and their SLOs.
- [ ] Define per-dataset signals and owners.
- [ ] Choose baselines per signal (static, seasonal, robust).
- [ ] Set noise controls: consecutive breaches, cool-down, hysteresis.
- [ ] Attach context to alerts: last normal value, chart, upstream run IDs.
- [ ] Create short runbooks for top 5 failure modes.
- [ ] Review weekly: tune thresholds and retire noisy alerts.
Common mistakes and self-check
- Mistake: Using mean/stdev on skewed metrics. Self-check: Compare median/MAD vs mean/stdev; pick the one with fewer false alarms.
- Mistake: Ignoring seasonality. Self-check: Plot metric by weekday/hour; if patterns exist, switch to seasonal baselines.
- Mistake: Alerting on single-point spikes. Self-check: Require 2–3 consecutive breaches for non-critical signals.
- Mistake: No root-cause context. Self-check: Include upstream job status and recent schema changes in the alert.
- Mistake: One-size-fits-all thresholds. Self-check: Calibrate per dataset and severity.
Practical projects
- Project A: Add anomaly detection to one high-impact dataset. Deliver: signal list, baselines, alert rules, and a runbook. Success: meaningful alert fired in dry-run without excessive noise.
- Project B: Build a weekly anomaly review. Deliver: dashboard of anomalies, suppression reasons, and tuning actions. Success: 30–50% reduction in noisy alerts in 2 weeks.
Exercises
Complete Exercises 1–2 and verify with the solutions. Use this checklist while working:
- [ ] Identify the right baseline for the signal.
- [ ] Compute the detection score (z-score, percent deviation, etc.).
- [ ] Decide on thresholds and severity.
- [ ] Add at least one noise-reduction rule.
Exercise 1 — Mirror of ex1
Compute robust z-score for today’s volume vs a weekday baseline and decide if it is an anomaly.
Exercise 2 — Mirror of ex2
Design an alerting rule for a null rate with weekly seasonality and propose suppression logic.
Mini challenge
Your streaming topic suddenly shows a 5x spike in message rate between 18:00–19:00 daily for a week. Propose a detection and alerting plan that avoids paging every evening but still catches unexpected changes. Include: baseline choice, detection rule, and noise controls.
Next steps
- Implement one method per signal (volume, freshness, latency) this week.
- Add context to alerts and create concise runbooks.
- Take the quick test to check your understanding.
Quick Test
Available to everyone for free. Your progress is saved if you are logged in.