Why this matters in the job
Healthy data pipelines deliver fresh, complete, and trustworthy data on time. Monitoring pipeline health lets you detect delays, data quality drifts, and failures before stakeholders do. In real projects, you will:
- Track freshness and completeness for critical tables and events.
- Alert the right people with the right severity when SLAs or SLOs are at risk.
- Reduce noise by correlating retries, maintenance windows, and known incidents.
- Provide dashboards and runbooks so on-call engineers can act fast.
Who this is for
- Data Engineers and Analytics Engineers who operate scheduled or event-driven pipelines.
- Platform/Infra engineers supporting orchestration tools (e.g., Airflow, Dagster, Prefect, cloud schedulers).
- Team leads setting SLOs and on-call policies for data delivery.
Prerequisites
- Basic SQL (SELECT, WHERE, GROUP BY, date/time functions).
- Familiarity with a workflow orchestrator and task retries.
- Understanding of your data sources and target warehouse/lake.
Learning path
- Define golden signals: freshness, completeness, quality, timeliness, reliability, and cost.
- Instrument pipelines to emit events and metrics.
- Set SLOs/SLAs and thresholds; reduce alert noise with grace windows and correlation.
- Build dashboards and write concise runbooks.
- Drill incident response: simulate failures and verify alerts.
Concept explained simply
Monitoring pipeline health means measuring if data arrives when expected, in the right amount, and without errors—and alerting when it doesn’t.
Mental model
Think of pipelines like public transit:
- Freshness = how late the bus is.
- Completeness = how many passengers made it aboard vs expected.
- Quality = did passengers get to the right stop without being dropped off randomly?
- Reliability = buses run on schedule with acceptable failure/retry rates.
- Timeliness = total travel time stays within the timetable.
- Cost/Resource = fuel used to run the route.
Measure, compare to targets (SLOs), and page only when action is needed.
Key metrics and signals
- Freshness lag: now() minus last good event time or partition time.
- Completeness: delivered rows vs expected rows (absolute or percentage).
- Data quality checks: schema, null rates, duplicates, referential integrity, value ranges.
- Timeliness: task and DAG duration (p50/p90/p95), queue wait time.
- Reliability: success rate, retry count, mean time to recovery (MTTR).
- Cost/Resource: compute hours, memory spill, retries cost.
Checklist — define SLOs for a pipeline
- Freshness SLO: e.g., data less than 30 minutes old during business hours.
- Completeness SLO: e.g., 95–105% of yesterday’s weekday volume.
- Quality SLO: critical checks must pass 99.9% of runs.
- Timeliness SLO: DAG completes in under 20 minutes p95.
- Reliability SLO: monthly success rate ≥ 99.5%.
Designing alerts that don’t wake you up unnecessarily
- Alert only on symptoms users feel: upstream outage affecting critical data.
- Add grace periods and require N consecutive breaches before paging.
- Use severity levels: INFO (notice), WARNING (investigate), CRITICAL (page).
- Suppress during maintenance windows and while retries are in progress.
- Include runbook links, owner, last change note, and quick diagnostic steps in the alert body.
Worked examples
Example 1 — Freshness alert for a table partition
Goal: Alert if table orders_fact is more than 45 minutes stale during 08:00–20:00.
- Compute freshness lag = now() - max(event_time).
- Alert if lag > 45 minutes for two consecutive checks (5-minute interval).
- Severity: WARNING at first breach, CRITICAL after 2 consecutive breaches.
SQL (warehouse-agnostic pseudocode)
SELECT
DATE_TRUNC('minute', CURRENT_TIMESTAMP) AS checked_at,
MAX(event_time) AS last_event_time,
EXTRACT(EPOCH FROM (CURRENT_TIMESTAMP - MAX(event_time)))/60 AS lag_minutes
FROM raw.orders_events;
-- Evaluate lag > 45 and business hours before paging
Example 2 — Completeness check with weekday baseline
Goal: Detect if today’s row count deviates by more than 20% from average of last 4 same weekdays.
WITH baseline AS (
SELECT AVG(cnt) AS avg_cnt
FROM (
SELECT DATE(day) AS d, COUNT(*) AS cnt
FROM staging.orders
WHERE day BETWEEN CURRENT_DATE - INTERVAL '28 days' AND CURRENT_DATE - INTERVAL '1 day'
AND EXTRACT(ISODOW FROM day) = EXTRACT(ISODOW FROM CURRENT_DATE)
GROUP BY 1
) x
), today AS (
SELECT COUNT(*) AS cnt FROM staging.orders WHERE day = CURRENT_DATE
)
SELECT (today.cnt - baseline.avg_cnt) / NULLIF(baseline.avg_cnt,0) AS pct_diff
FROM today, baseline;
Action: If |pct_diff| > 0.2 for 2 runs and today’s volume > 10k, escalate to CRITICAL.
Example 3 — Retry-aware failure alert
Goal: Don’t page on first failure when retries are configured.
- Task policy: max_retries = 3, retry_delay = 2m.
- Alert routing: WARNING on first failure; CRITICAL only if all retries exhausted or total delay > 10m.
Example 4 — Duration SLO regression
Goal: Alert if DAG duration p95 doubles vs 14-day baseline.
Compute rolling baseline p95 per DAG per hour. If current p95 > 2x baseline for 30 minutes, send WARNING; if for 1 hour, CRITICAL.
Hands-on exercises
Complete these, then open the Quick Test below. Everyone can take the test; only logged-in users get saved progress.
Exercise 1 — Freshness lag detector (matches ex1)
Given this ingestion log snapshot at 2026-01-08 10:00:00 UTC:
table_name,max_event_time users,2026-01-08T09:30:00Z orders,2026-01-08T08:55:00Z inventory,2026-01-08T09:10:00Z
- Write SQL or pseudocode to compute lag_minutes = now - max_event_time.
- Output tables where lag_minutes > 45.
- Create an example alert payload with fields: component, metric, observed, threshold, severity.
Tip
Use epoch difference in minutes. Assume now = 2026-01-08 10:00:00 UTC.
Exercise 2 — Alert policy design (matches ex2)
For DAG orders_daily with tasks extract_api, transform_orders, dq_orders_rowcount, and load_dw:
- Define thresholds, grace periods, and severity for each task.
- Include owners, channels, and suppression during maintenance 00:30–00:45 UTC.
- Represent the policy as a small JSON object.
Checklist — before you ship monitoring
- Metrics emitted: freshness, row counts, durations, retries.
- SLOs documented with on-call owner per DAG.
- Alerts grouped and deduplicated; retries respected.
- Dashboards show last 24h and 14d baselines.
- Runbook exists with quick checks and rollback/backfill steps.
Common mistakes and self-check
- Alerting on every failure even when retries succeed.
Self-check
Verify alerts trigger only after retries are exhausted or total delay exceeds SLO.
- Using fixed thresholds that ignore weekday seasonality.
Self-check
Compare to same weekday baselines or use percentage-of-baseline thresholds.
- Missing ownership in alerts.
Self-check
Ensure each alert contains owner, escalation path, and a runbook section.
- Only monitoring task status, not data quality.
Self-check
Add checks for nulls, duplicates, referential integrity, and schema drift.
Practical projects
- Instrument one DAG to emit freshness, completeness, and duration metrics; build a 14-day dashboard with SLO overlays.
- Create an alert policy file for two pipelines, including grace windows and maintenance suppression.
- Simulate an upstream outage, observe alerts, and document a runbook entry with mitigation steps.
Next steps
- Automate baseline calculations (weekday-aware) and store them for comparisons.
- Add cost and retry metrics to control resource usage.
- Introduce event-correlation to reduce duplicate alerts across dependent DAGs.
Quick Test
You can take the Quick Test for free. Only logged-in users get saved progress.
Mini challenge
Your nightly sales table finished on time, but analysts report today’s BI dashboard is missing some regions. Propose a 6-step investigation: which metrics and logs do you check first, how do you decide severity, and what temporary mitigation can you apply while you fix the root cause?