Who this is for
Data Architects,
Prerequisites
- Basic knowledge of data pipelines (batch and/or streaming)
- Familiarity with data quality checks (freshness, schema, null rates, duplicates)
- Understanding of SLIs/SLOs and on-call concepts
Why this matters
In real teams, you will be asked to: define SLOs for data products, decide which metrics to monitor, reduce noisy alerts, set escalation paths, and ensure incidents are actionable with clear runbooks. Good design protects downstream analytics, ML models, and business decisions.
Concept explained simply
Monitoring observes signals that describe system and data health. Alerting routes a subset of those signals to humans when action is needed. Your job: choose the right signals, thresholds, and routing so issues are found early, understood quickly, and fixed fast.
Mental model
- Signals: what you measure (freshness, volume, schema, nulls, latency, costs)
- Policies: what good/bad looks like (SLOs, thresholds, windows)
- Pipelines: where checks run (ingest, transform, publish)
- Routing: who hears about it (owner group, escalation chain, business hours)
- Runbooks: how to act (diagnostics, rollback, disable, reprocess)
- Feedback: tune thresholds, suppress noise, add context
Tip: Separate monitors vs alerts
Collect many monitors. Alert on the few that require human action. Everything else should be visible on dashboards or weekly reports.
Data monitoring golden signals
- Freshness: is data updated on time? (e.g., within 2 hours of schedule)
- Completeness/Volume: row counts vs baseline; missing partitions
- Schema: unexpected columns/types, breaking changes
- Quality rules: null/duplicate rates, referential integrity, domain constraints
- Latency: job run time, end-to-end time to availability
- Lineage health: upstream dependencies status
- Cost/Throughput: anomalies that hint at runaway jobs or throttling
Design checklist (use during planning)
- Define data product SLOs: freshness, completeness, quality, latency
- Choose SLIs and baselines per SLO
- Decide thresholds and windows (absolute and relative)
- Route by ownership and business impact
- Add context to alerts (links to runbook, lineage, last success, recent deploy)
- Implement noise controls (dedupe, grouping, rate limits, maintenance windows)
- Set escalation policy (time-based or severity-based)
- Test alert end-to-end (simulate failures)
- Review monthly: false positives, MTTR, coverage gaps
Worked examples
Example 1: Daily batch table freshness SLO
Context: A sales_facts table updates daily by 02:00. Business dashboards load at 07:00.
- SLO: 99% of days, table is refreshed by 02:30.
- SLI: max(partition_date=YESTERDAY) ingested_at <= 02:30.
- Alert policy: warn at 02:15 (if late), page at 02:30, auto-suppress during planned maintenance.
- Routing: data-platform on-call (night), analytics leads (business hours for warn only).
- Runbook snippet: check upstream extract, verify partition arrival, trigger one-time backfill.
Example 2: Streaming event rate drop
Context: Clickstream topic expected 10k events/min, fluctuates ±20% normally.
- SLO: 95% of minutes have volume within -30% to +50% of 7-day median for that minute-of-day.
- Alert policy: if 5-minute rolling average < lower bound for 10 minutes, page stream-oncall.
- Noise control: suppress during known release windows; group alerts by partition/region.
- Diagnostics: check broker lag, producer error rate, schema registry compatibility.
Example 3: dbt model tests spiking
Context: Transformation layer has tests on not_null and accepted_values.
- Policy: do not page on first failure; page if the same test fails across 2 consecutive runs or affects >=10% rows.
- Routing: model owner team; CC platform if failures coincide with recent infra changes.
- Runbook: view model lineage, identify upstream null source, apply hotfix, re-run affected models only.
Alert design principles that reduce noise
- Actionability: every page must include owner, impact hypothesis, and next steps.
- Statefulness: page on breach persistence (X minutes/runs), not single spikes.
- Aggregation: group similar alerts (by table, data product, region).
- Context: include last success time, recent deploys, upstream status, sample errors.
- Dedupe: prevent repeated pages for the same condition within a cooldown window.
- Maintenance: honor silences for planned backfills and migrations.
Example alert payload template
{
"title": "Freshness breach: sales_facts",
"severity": "high",
"owner": "#data-platform-oncall",
"slo": "99% by 02:30",
"observed": "last_partition 2026-01-17, now 03:02",
"last_success": "2026-01-17 02:04",
"recent_change": "ETL version 4.2 deployed 01:30",
"lineage": ["extract_sales", "stg_sales"],
"next_steps": ["Check upstream job logs", "Trigger backfill job"],
"severity_escalates_in": "30m if unresolved"
}Escalation and runbooks
- Page on-call for high severity; notify owner for medium; log-only for low.
- Escalate if not acknowledged in 5–10 minutes, or not resolved within SLO error budget.
- Runbooks contain: probable causes, validation queries, rollback/backfill steps, contacts.
Runbook skeleton
- Confirm alert state and timeframe
- Check upstream job status and recent releases
- Run validation query (e.g., select max(partition_date))
- Apply fix (re-run step, backfill, revert schema)
- Communicate impact and ETA
- Post-incident review: update thresholds or tests
Setting metrics and SLOs
- Freshness SLI: now - last_success_time
- Completeness SLI: rows_today / expected_rows_today
- Quality SLI: 1 - (failed_rule_rows / total_rows)
- Latency SLI: job_end - data_arrival
Use rolling statistical baselines (e.g., median by day-of-week and hour) plus guardrails (absolute caps). Start conservative, then tighten.
Scope and environment
- Pipeline stage: validate early (ingest), verify at publish (data product SLA)
- Environment: dev alerts to PR owners; prod alerts to on-call
- Tenancy: route by domain (finance vs marketing) to correct responders
Exercises
These mirror the tasks below. If you are logged in, your exercise and quick test progress is saved; otherwise you can still complete everything for free.
- Exercise 1: Design a freshness alert policy for a daily table.
- Exercise 2: Draft a triage runbook for a stale data alert.
- Checklist before submitting:
- Clear SLO and SLI defined
- Thresholds and timing windows stated
- Routing, escalation, and noise controls included
- Runbook steps are actionable
Common mistakes and self-check
- Alerting on every failed check. Self-check: Does this alert demand human action now?
- Static thresholds ignoring seasonality. Self-check: Did you use time-of-day/day-of-week baselines?
- No context in alerts. Self-check: Does your payload contain owner, last success, recent changes?
- Missing suppression windows. Self-check: Do you silence during planned backfills/migrations?
- Orphan alerts. Self-check: Is there a named owner and escalation path?
- No post-incident tuning. Self-check: Do you review false positives monthly?
Practical projects
- Project A: Implement freshness and volume monitors for two data products; design alert payloads and routing; simulate failures and refine thresholds.
- Project B: Add schema drift detection to a pipeline and create a runbook that includes rollback and revalidation steps.
- Project C: Build a dashboard showing SLO attainment, MTTR, false positive rate, and top noisy monitors; propose three policy changes.
Learning path
- Before this: Data quality rules and validation patterns
- Now: Monitoring and alerting design (this lesson)
- Next: Incident management, postmortems, and SLO governance
Next steps
- Finish both exercises and compare with the provided solutions.
- Take the Quick Test to check your understanding.
- Apply one alerting improvement to a real or sample pipeline this week.
Mini challenge
Your hourly aggregation job is sometimes 20 minutes late during end-of-month. Propose an alert design that avoids paging during expected spikes but still catches true regressions. Include SLI, thresholds, routing, and suppression rules.
Note: The quick test is available to everyone for free. Only logged-in learners have progress saved over time.