Menu

Topic 4 of 8

Anomaly Detection Strategy

Learn Anomaly Detection Strategy for free with explanations, exercises, and a quick test (for Data Architect).

Published: January 18, 2026 | Updated: January 18, 2026

Why this matters

As a Data Architect, you set standards that keep data trustworthy. An anomaly detection strategy lets teams catch issues early—like sudden drops in orders, late-arriving batches, schema drifts, or spikes in nulls—so dashboards and ML models don’t make bad decisions.

  • Real tasks: define baselines and thresholds; choose metrics (volume, freshness, null rate, distribution); set alert severity and runbooks; design seasonality-aware detection; plan rollout and shadow mode.
  • Outcomes: fewer false positives, faster incident triage, and measurable data reliability.

Who this is for

  • Data Architects and Senior Data Engineers who design platform standards.
  • Analytics Engineers owning data quality SLIs/SLOs.
  • Platform owners adding observability to pipelines and lakes/warehouses.

Prerequisites

  • Comfort with SQL and basic statistics (mean/median, variance, percentiles).
  • Understanding of your data platform’s scheduling, partitioning, and lineage.
  • Knowledge of core data quality dimensions (freshness, volume, completeness, schema).

Concept explained simply

An anomaly is a data behavior that’s unlikely under “normal” conditions (e.g., order count is far below typical Tuesdays). Detection compares current metrics to a learned baseline and flags outliers.

Mental model

Think of a “thermostat” for data: you pick what to monitor (metrics), learn normal ranges (baselines), decide what’s urgent (severity), and set what happens when temperature goes off (alerts and runbooks). Keep it stable by accounting for seasons (weekends, month-end), noise (small fluctuations), and late arrivals.

Strategy blueprint

  1. Choose metrics per asset: volume, row count, distinct keys, null rate, freshness, outliers in numeric fields, schema changes.
  2. Pick baseline type: static thresholds (quick), rolling statistics (median/MAD, EWMA), seasonal baselines (per weekday), or ML-based (only when needed).
  3. Define alert policy: severity levels (P1–P3), gating (minimum volume, minimum partitions affected), de-duplication, cooldowns, and routing.
  4. Handle data realities: partition-level checks, seasonality/holidays, late data windows, backfills, and replays.
  5. Roll out safely: shadow mode (no alerts) to learn baseline, then gradual enablement with tighter thresholds over time.
  6. Create runbooks: steps to validate anomaly, known false-positive patterns, owners, and rollback/patch strategies.
Baseline options: quick reference
  • Static threshold: simple, good for SLAs like freshness by 02:00.
  • Rolling median + MAD: robust to outliers for counts and rates.
  • EWMA/control charts: smooths noise; good for freshness and latency.
  • Seasonal baselines: separate normal ranges by weekday/month-end.
  • ML (e.g., isolation forest): for multivariate signals after simpler methods work.

Worked examples

1) Daily order count with weekday seasonality

Metric: daily row_count for orders table. Last 7 days: 980, 1010, 995, 1005, 990, 1002, 740.

  • Baseline: rolling median = 995; MAD = median(|x-995|) = 10; robust sigma ~ 1.4826*MAD = 14.826.
  • Thresholds (3 sigma): 995 ± 3*14.826 ≈ [950.5, 1039.5].
  • Today: 740 < 950.5 ⇒ anomaly.
  • Severity: P1 if also revenue-impacting table and >20% drop; else P2.

Seasonality tip: compare Tuesdays to prior Tuesdays, not global median, once you have enough history.

2) Freshness SLA for daily batch

Metric: data_time_lag (now - max(event_time)) for table users_snapshot. SLA: data ready by 02:00 daily.

  • Rule: P1 if lag > 2 hours after 02:15; P2 if lag > 1 hour after 02:00.
  • Late data window: allow 30 minutes grace because upstream is variable.
  • EWMA of arrival time to auto-adjust thresholds if the process shifts gradually (investigate if trend worsens for 3 days).
3) Null rate spike on critical column

Metric: pct_null(email) by partition_date for marketing_events.

  • Gating: only alert if row_count >= 10,000 (avoid noise on tiny partitions).
  • Baseline: rolling median null rate with 3*MAD bounds.
  • If null rate jumps from 0.5% to 12%, and volume is normal, raise P1 and include recent schema changes in alert context.

Exercises (hands-on)

Mirror of the exercises below. Try them here, then check solutions.

Exercise 1: Build a robust threshold for daily counts

Data (last 7 days counts): 980, 1010, 995, 1005, 990, 1002, 740.

  • Step 1: Compute the rolling median.
  • Step 2: Compute MAD and robust sigma (~1.4826*MAD).
  • Step 3: Build 3-sigma bounds and decide if 740 is an anomaly.
  • Step 4: Propose a severity mapping.
Exercise 2: Design an alert policy for freshness

Context: Table sales_daily expected by 02:00. Typical arrival 01:20 ± 15 min.

  • Define: SLA, alert levels, gating, cooldown, and deduplication.
  • Include: runbook first checks (upstream status, recent deploys).

Self-checklist

  • I selected metrics aligned to data contract/business impact.
  • Baselines account for seasonality or used robust statistics.
  • I set severity levels and gating to reduce alert noise.
  • I included late data windows and backfill handling.
  • I drafted a clear runbook for responders.

Common mistakes and how to self-check

Using global mean/standard deviation on skewed data

Fix: prefer median/MAD or percentiles; add seasonal baselines.

No gating on tiny partitions

Fix: require minimum rows/partitions before alerting.

Ignoring late-arriving events

Fix: incorporate allowed_lateness windows/watermarks into freshness logic.

Alert fatigue from duplicate pages

Fix: dedupe alerts per asset and dimension; add cooldowns and escalate only on persistence.

Deploying with alerts enabled on day 1

Fix: shadow mode for 1–2 weeks to learn baseline; then gradually enable.

Practical projects

  • Project 1: Instrument 3 critical tables with row_count, freshness, and null-rate checks. Deliver: documented baselines, thresholds, and a sample alert with context fields.
  • Project 2: Design an alerting policy (P1–P3, routing, gating, cooldown). Deliver: a one-page runbook and two simulated incidents with resolutions.
  • Project 3: Seasonal baseline for a weekly-peaked metric. Deliver: per-weekday thresholds and a report comparing false positive rates before/after.

Learning path

  1. Review data quality dimensions and contracts for your domain.
  2. Implement basic checks (static/freshness) on a pilot table.
  3. Add robust baselines (median/MAD, seasonal) and alert policy.
  4. Scale to partition-level detection and add runbooks.
  5. Optional: experiment with multivariate/ML methods after basics are stable.

Next steps

  • Adopt shadow mode on one pipeline for 2 weeks.
  • Roll out severity/gating and measure alert precision/recall.
  • Document runbooks and owners; run a game day to practice incidents.

Mini challenge

You manage a table with strong month-end spikes and occasional backfills. Draft a short policy (3–5 bullets) covering: baseline choice, month-end special handling, backfill safe-mode, and alert cooldown.

Show a sample answer
  • Baseline: per-weekday with extra month-end bucket; robust median/MAD.
  • Month-end: widen upper bound by 1 sigma; pre-approved spike window.
  • Backfills: disable freshness alerts; enable volume alerts with 2x upper cap.
  • Cooldown: 30 minutes; escalate only if anomaly persists across 2 checks.
Before you take the Quick Test

Anyone can take the test for free. Only logged-in users have their progress saved.

Practice Exercises

2 exercises to complete

Instructions

Data (last 7 days counts): 980, 1010, 995, 1005, 990, 1002, 740.

  • Compute the rolling median and MAD.
  • Convert MAD to robust sigma (~1.4826*MAD).
  • Create 3-sigma bounds and decide if 740 is an anomaly.
  • Propose a severity mapping (P1/P2) given a 25% drop on a revenue-critical table.
Expected Output
Median ≈ 995, MAD ≈ 10, bounds ≈ [950.5, 1039.5]. 740 is anomalous. Severity: P1 if business-critical; otherwise P2.

Anomaly Detection Strategy — Quick Test

Test your knowledge with 8 questions. Pass with 70% or higher.

8 questions70% to pass

Have questions about Anomaly Detection Strategy?

AI Assistant

Ask questions about this tool