luvv to helpDiscover the Best Free Online Tools
Topic 5 of 8

Monitoring Pipeline Health

Learn Monitoring Pipeline Health for free with explanations, exercises, and a quick test (for Data Engineer).

Published: January 8, 2026 | Updated: January 8, 2026

Why this matters in the job

Healthy data pipelines deliver fresh, complete, and trustworthy data on time. Monitoring pipeline health lets you detect delays, data quality drifts, and failures before stakeholders do. In real projects, you will:

  • Track freshness and completeness for critical tables and events.
  • Alert the right people with the right severity when SLAs or SLOs are at risk.
  • Reduce noise by correlating retries, maintenance windows, and known incidents.
  • Provide dashboards and runbooks so on-call engineers can act fast.

Who this is for

  • Data Engineers and Analytics Engineers who operate scheduled or event-driven pipelines.
  • Platform/Infra engineers supporting orchestration tools (e.g., Airflow, Dagster, Prefect, cloud schedulers).
  • Team leads setting SLOs and on-call policies for data delivery.

Prerequisites

  • Basic SQL (SELECT, WHERE, GROUP BY, date/time functions).
  • Familiarity with a workflow orchestrator and task retries.
  • Understanding of your data sources and target warehouse/lake.

Learning path

  1. Define golden signals: freshness, completeness, quality, timeliness, reliability, and cost.
  2. Instrument pipelines to emit events and metrics.
  3. Set SLOs/SLAs and thresholds; reduce alert noise with grace windows and correlation.
  4. Build dashboards and write concise runbooks.
  5. Drill incident response: simulate failures and verify alerts.

Concept explained simply

Monitoring pipeline health means measuring if data arrives when expected, in the right amount, and without errors—and alerting when it doesn’t.

Mental model

Think of pipelines like public transit:

  • Freshness = how late the bus is.
  • Completeness = how many passengers made it aboard vs expected.
  • Quality = did passengers get to the right stop without being dropped off randomly?
  • Reliability = buses run on schedule with acceptable failure/retry rates.
  • Timeliness = total travel time stays within the timetable.
  • Cost/Resource = fuel used to run the route.

Measure, compare to targets (SLOs), and page only when action is needed.

Key metrics and signals

  • Freshness lag: now() minus last good event time or partition time.
  • Completeness: delivered rows vs expected rows (absolute or percentage).
  • Data quality checks: schema, null rates, duplicates, referential integrity, value ranges.
  • Timeliness: task and DAG duration (p50/p90/p95), queue wait time.
  • Reliability: success rate, retry count, mean time to recovery (MTTR).
  • Cost/Resource: compute hours, memory spill, retries cost.
Checklist — define SLOs for a pipeline
  • Freshness SLO: e.g., data less than 30 minutes old during business hours.
  • Completeness SLO: e.g., 95–105% of yesterday’s weekday volume.
  • Quality SLO: critical checks must pass 99.9% of runs.
  • Timeliness SLO: DAG completes in under 20 minutes p95.
  • Reliability SLO: monthly success rate ≥ 99.5%.

Designing alerts that don’t wake you up unnecessarily

  • Alert only on symptoms users feel: upstream outage affecting critical data.
  • Add grace periods and require N consecutive breaches before paging.
  • Use severity levels: INFO (notice), WARNING (investigate), CRITICAL (page).
  • Suppress during maintenance windows and while retries are in progress.
  • Include runbook links, owner, last change note, and quick diagnostic steps in the alert body.

Worked examples

Example 1 — Freshness alert for a table partition

Goal: Alert if table orders_fact is more than 45 minutes stale during 08:00–20:00.

  1. Compute freshness lag = now() - max(event_time).
  2. Alert if lag > 45 minutes for two consecutive checks (5-minute interval).
  3. Severity: WARNING at first breach, CRITICAL after 2 consecutive breaches.
SQL (warehouse-agnostic pseudocode)
SELECT
  DATE_TRUNC('minute', CURRENT_TIMESTAMP) AS checked_at,
  MAX(event_time) AS last_event_time,
  EXTRACT(EPOCH FROM (CURRENT_TIMESTAMP - MAX(event_time)))/60 AS lag_minutes
FROM raw.orders_events;
-- Evaluate lag > 45 and business hours before paging
Example 2 — Completeness check with weekday baseline

Goal: Detect if today’s row count deviates by more than 20% from average of last 4 same weekdays.

WITH baseline AS (
  SELECT AVG(cnt) AS avg_cnt
  FROM (
    SELECT DATE(day) AS d, COUNT(*) AS cnt
    FROM staging.orders
    WHERE day BETWEEN CURRENT_DATE - INTERVAL '28 days' AND CURRENT_DATE - INTERVAL '1 day'
      AND EXTRACT(ISODOW FROM day) = EXTRACT(ISODOW FROM CURRENT_DATE)
    GROUP BY 1
  ) x
), today AS (
  SELECT COUNT(*) AS cnt FROM staging.orders WHERE day = CURRENT_DATE
)
SELECT (today.cnt - baseline.avg_cnt) / NULLIF(baseline.avg_cnt,0) AS pct_diff
FROM today, baseline;

Action: If |pct_diff| > 0.2 for 2 runs and today’s volume > 10k, escalate to CRITICAL.

Example 3 — Retry-aware failure alert

Goal: Don’t page on first failure when retries are configured.

  • Task policy: max_retries = 3, retry_delay = 2m.
  • Alert routing: WARNING on first failure; CRITICAL only if all retries exhausted or total delay > 10m.
Example 4 — Duration SLO regression

Goal: Alert if DAG duration p95 doubles vs 14-day baseline.

Compute rolling baseline p95 per DAG per hour.
If current p95 > 2x baseline for 30 minutes, send WARNING; if for 1 hour, CRITICAL.

Hands-on exercises

Complete these, then open the Quick Test below. Everyone can take the test; only logged-in users get saved progress.

Exercise 1 — Freshness lag detector (matches ex1)

Given this ingestion log snapshot at 2026-01-08 10:00:00 UTC:

table_name,max_event_time
users,2026-01-08T09:30:00Z
orders,2026-01-08T08:55:00Z
inventory,2026-01-08T09:10:00Z
  • Write SQL or pseudocode to compute lag_minutes = now - max_event_time.
  • Output tables where lag_minutes > 45.
  • Create an example alert payload with fields: component, metric, observed, threshold, severity.
Tip

Use epoch difference in minutes. Assume now = 2026-01-08 10:00:00 UTC.

Exercise 2 — Alert policy design (matches ex2)

For DAG orders_daily with tasks extract_api, transform_orders, dq_orders_rowcount, and load_dw:

  • Define thresholds, grace periods, and severity for each task.
  • Include owners, channels, and suppression during maintenance 00:30–00:45 UTC.
  • Represent the policy as a small JSON object.
Checklist — before you ship monitoring
  • Metrics emitted: freshness, row counts, durations, retries.
  • SLOs documented with on-call owner per DAG.
  • Alerts grouped and deduplicated; retries respected.
  • Dashboards show last 24h and 14d baselines.
  • Runbook exists with quick checks and rollback/backfill steps.

Common mistakes and self-check

  • Alerting on every failure even when retries succeed.
    Self-check

    Verify alerts trigger only after retries are exhausted or total delay exceeds SLO.

  • Using fixed thresholds that ignore weekday seasonality.
    Self-check

    Compare to same weekday baselines or use percentage-of-baseline thresholds.

  • Missing ownership in alerts.
    Self-check

    Ensure each alert contains owner, escalation path, and a runbook section.

  • Only monitoring task status, not data quality.
    Self-check

    Add checks for nulls, duplicates, referential integrity, and schema drift.

Practical projects

  • Instrument one DAG to emit freshness, completeness, and duration metrics; build a 14-day dashboard with SLO overlays.
  • Create an alert policy file for two pipelines, including grace windows and maintenance suppression.
  • Simulate an upstream outage, observe alerts, and document a runbook entry with mitigation steps.

Next steps

  • Automate baseline calculations (weekday-aware) and store them for comparisons.
  • Add cost and retry metrics to control resource usage.
  • Introduce event-correlation to reduce duplicate alerts across dependent DAGs.

Quick Test

You can take the Quick Test for free. Only logged-in users get saved progress.

Mini challenge

Your nightly sales table finished on time, but analysts report today’s BI dashboard is missing some regions. Propose a 6-step investigation: which metrics and logs do you check first, how do you decide severity, and what temporary mitigation can you apply while you fix the root cause?

Practice Exercises

2 exercises to complete

Instructions

Given this table at now = 2026-01-08T10:00:00Z:

table_name,max_event_time
users,2026-01-08T09:30:00Z
orders,2026-01-08T08:55:00Z
inventory,2026-01-08T09:10:00Z
  • Write SQL/pseudocode to compute lag_minutes = now - max_event_time.
  • Return tables where lag_minutes > 45.
  • Create an example alert payload with: component, metric, observed, threshold, severity.
Expected Output
Tables flagged: orders, inventory. Example payload with severity=CRITICAL for orders (65m > 45m) and inventory (50m > 45m).

Monitoring Pipeline Health — Quick Test

Test your knowledge with 8 questions. Pass with 70% or higher.

8 questions70% to pass

Have questions about Monitoring Pipeline Health?

AI Assistant

Ask questions about this tool