Why this matters
As a Data Architect, you set standards that keep data trustworthy. An anomaly detection strategy lets teams catch issues earlyâlike sudden drops in orders, late-arriving batches, schema drifts, or spikes in nullsâso dashboards and ML models donât make bad decisions.
- Real tasks: define baselines and thresholds; choose metrics (volume, freshness, null rate, distribution); set alert severity and runbooks; design seasonality-aware detection; plan rollout and shadow mode.
- Outcomes: fewer false positives, faster incident triage, and measurable data reliability.
Who this is for
- Data Architects and Senior Data Engineers who design platform standards.
- Analytics Engineers owning data quality SLIs/SLOs.
- Platform owners adding observability to pipelines and lakes/warehouses.
Prerequisites
- Comfort with SQL and basic statistics (mean/median, variance, percentiles).
- Understanding of your data platformâs scheduling, partitioning, and lineage.
- Knowledge of core data quality dimensions (freshness, volume, completeness, schema).
Concept explained simply
An anomaly is a data behavior thatâs unlikely under ânormalâ conditions (e.g., order count is far below typical Tuesdays). Detection compares current metrics to a learned baseline and flags outliers.
Mental model
Think of a âthermostatâ for data: you pick what to monitor (metrics), learn normal ranges (baselines), decide whatâs urgent (severity), and set what happens when temperature goes off (alerts and runbooks). Keep it stable by accounting for seasons (weekends, month-end), noise (small fluctuations), and late arrivals.
Strategy blueprint
- Choose metrics per asset: volume, row count, distinct keys, null rate, freshness, outliers in numeric fields, schema changes.
- Pick baseline type: static thresholds (quick), rolling statistics (median/MAD, EWMA), seasonal baselines (per weekday), or ML-based (only when needed).
- Define alert policy: severity levels (P1âP3), gating (minimum volume, minimum partitions affected), de-duplication, cooldowns, and routing.
- Handle data realities: partition-level checks, seasonality/holidays, late data windows, backfills, and replays.
- Roll out safely: shadow mode (no alerts) to learn baseline, then gradual enablement with tighter thresholds over time.
- Create runbooks: steps to validate anomaly, known false-positive patterns, owners, and rollback/patch strategies.
Baseline options: quick reference
- Static threshold: simple, good for SLAs like freshness by 02:00.
- Rolling median + MAD: robust to outliers for counts and rates.
- EWMA/control charts: smooths noise; good for freshness and latency.
- Seasonal baselines: separate normal ranges by weekday/month-end.
- ML (e.g., isolation forest): for multivariate signals after simpler methods work.
Worked examples
1) Daily order count with weekday seasonality
Metric: daily row_count for orders table. Last 7 days: 980, 1010, 995, 1005, 990, 1002, 740.
- Baseline: rolling median = 995; MAD = median(|x-995|) = 10; robust sigma ~ 1.4826*MAD = 14.826.
- Thresholds (3 sigma): 995 ± 3*14.826 â [950.5, 1039.5].
- Today: 740 < 950.5 â anomaly.
- Severity: P1 if also revenue-impacting table and >20% drop; else P2.
Seasonality tip: compare Tuesdays to prior Tuesdays, not global median, once you have enough history.
2) Freshness SLA for daily batch
Metric: data_time_lag (now - max(event_time)) for table users_snapshot. SLA: data ready by 02:00 daily.
- Rule: P1 if lag > 2 hours after 02:15; P2 if lag > 1 hour after 02:00.
- Late data window: allow 30 minutes grace because upstream is variable.
- EWMA of arrival time to auto-adjust thresholds if the process shifts gradually (investigate if trend worsens for 3 days).
3) Null rate spike on critical column
Metric: pct_null(email) by partition_date for marketing_events.
- Gating: only alert if row_count >= 10,000 (avoid noise on tiny partitions).
- Baseline: rolling median null rate with 3*MAD bounds.
- If null rate jumps from 0.5% to 12%, and volume is normal, raise P1 and include recent schema changes in alert context.
Exercises (hands-on)
Mirror of the exercises below. Try them here, then check solutions.
Exercise 1: Build a robust threshold for daily counts
Data (last 7 days counts): 980, 1010, 995, 1005, 990, 1002, 740.
- Step 1: Compute the rolling median.
- Step 2: Compute MAD and robust sigma (~1.4826*MAD).
- Step 3: Build 3-sigma bounds and decide if 740 is an anomaly.
- Step 4: Propose a severity mapping.
Exercise 2: Design an alert policy for freshness
Context: Table sales_daily expected by 02:00. Typical arrival 01:20 ± 15 min.
- Define: SLA, alert levels, gating, cooldown, and deduplication.
- Include: runbook first checks (upstream status, recent deploys).
Self-checklist
- I selected metrics aligned to data contract/business impact.
- Baselines account for seasonality or used robust statistics.
- I set severity levels and gating to reduce alert noise.
- I included late data windows and backfill handling.
- I drafted a clear runbook for responders.
Common mistakes and how to self-check
Using global mean/standard deviation on skewed data
Fix: prefer median/MAD or percentiles; add seasonal baselines.
No gating on tiny partitions
Fix: require minimum rows/partitions before alerting.
Ignoring late-arriving events
Fix: incorporate allowed_lateness windows/watermarks into freshness logic.
Alert fatigue from duplicate pages
Fix: dedupe alerts per asset and dimension; add cooldowns and escalate only on persistence.
Deploying with alerts enabled on day 1
Fix: shadow mode for 1â2 weeks to learn baseline; then gradually enable.
Practical projects
- Project 1: Instrument 3 critical tables with row_count, freshness, and null-rate checks. Deliver: documented baselines, thresholds, and a sample alert with context fields.
- Project 2: Design an alerting policy (P1âP3, routing, gating, cooldown). Deliver: a one-page runbook and two simulated incidents with resolutions.
- Project 3: Seasonal baseline for a weekly-peaked metric. Deliver: per-weekday thresholds and a report comparing false positive rates before/after.
Learning path
- Review data quality dimensions and contracts for your domain.
- Implement basic checks (static/freshness) on a pilot table.
- Add robust baselines (median/MAD, seasonal) and alert policy.
- Scale to partition-level detection and add runbooks.
- Optional: experiment with multivariate/ML methods after basics are stable.
Next steps
- Adopt shadow mode on one pipeline for 2 weeks.
- Roll out severity/gating and measure alert precision/recall.
- Document runbooks and owners; run a game day to practice incidents.
Mini challenge
You manage a table with strong month-end spikes and occasional backfills. Draft a short policy (3â5 bullets) covering: baseline choice, month-end special handling, backfill safe-mode, and alert cooldown.
Show a sample answer
- Baseline: per-weekday with extra month-end bucket; robust median/MAD.
- Month-end: widen upper bound by 1 sigma; pre-approved spike window.
- Backfills: disable freshness alerts; enable volume alerts with 2x upper cap.
- Cooldown: 30 minutes; escalate only if anomaly persists across 2 checks.
Before you take the Quick Test
Anyone can take the test for free. Only logged-in users have their progress saved.