Menu

Data Quality And Observability

Learn Data Quality And Observability for Data Platform Engineer for free: roadmap, examples, subskills, and a skill exam.

Published: January 11, 2026 | Updated: January 11, 2026

Why this matters for a Data Platform Engineer

Data Quality and Observability make your platform trustworthy. As a Data Platform Engineer, you enable teams to ship reliable data products by measuring freshness, completeness, and volume, instrumenting pipelines, and setting clear SLIs/SLOs. You’ll reduce incident time, prevent silent data failures, and build confidence for analytics, ML, and downstream services.

What good looks like
  • Every critical dataset has defined SLIs (freshness, volume, schema validity) and SLO targets.
  • Alerts are actionable, low-noise, and include run IDs, owners, and links to logs and traces.
  • Incidents follow a runbook; postmortems are blameless and lead to concrete prevention actions.
  • Quality checks run close to ingestion and fail fast; issues are visible in a central view.

Who this is for

  • Data Platform Engineers building and operating ingestion, transformation, and serving layers.
  • Data Engineers owning pipelines who need reliable, debuggable workflows.

Prerequisites

  • Comfort with SQL and at least one orchestration tool (e.g., Airflow, Dagster, Prefect).
  • Basic Python for pipeline utilities and check automation.
  • Familiarity with metrics/logging concepts (counters, gauges, histograms) and JSON logs.
  • Understanding of version-controlled transformations (e.g., dbt) is a plus.

Learning path

  1. Define SLIs/SLOs: Choose freshness, completeness, and volume SLIs per critical dataset; set initial SLOs.
  2. Instrument pipelines: Emit metrics, logs, and traces with run_id, dataset, and owner.
  3. Quality checks: Add freshness, completeness, and volume checks at ingestion and model outputs.
  4. Anomaly detection: Add baselines and outlier detection to catch silent drifts.
  5. Alerting: Route alerts with severity, context, and runbook links; tune to reduce noise.
  6. Incident management: Establish on-call, triage, escalation, and postmortems.
  7. Integrate frameworks: Standardize with a DQ framework (e.g., Great Expectations/dbt tests).

Worked examples

Example 1: SQL checks for freshness, completeness, and volume

Run near ingestion to fail fast. Replace table/column names as needed.

-- Freshness: alert if last event older than 30 minutes
SELECT NOW() - MAX(event_timestamp) AS freshness_lag
FROM raw_events;

-- Completeness: required columns non-null ratio
SELECT
  100.0 * SUM(CASE WHEN user_id IS NOT NULL THEN 1 ELSE 0 END) / COUNT(*) AS pct_user_id_present,
  100.0 * SUM(CASE WHEN event_type IS NOT NULL THEN 1 ELSE 0 END) / COUNT(*) AS pct_event_type_present
FROM raw_events
WHERE event_date = CURRENT_DATE;

-- Volume: expected daily row count >= baseline * 0.9
WITH baseline AS (
  SELECT AVG(daily_count) AS avg_count
  FROM (
    SELECT event_date, COUNT(*) AS daily_count
    FROM raw_events
    WHERE event_date BETWEEN CURRENT_DATE - INTERVAL '14 days' AND CURRENT_DATE - INTERVAL '1 day'
    GROUP BY event_date
  ) x
)
SELECT COUNT(*) AS today_count, b.avg_count
FROM raw_events, baseline b
WHERE event_date = CURRENT_DATE;
Pass/Fail logic idea
  • freshness_lag <= 30 minutes
  • pct_user_id_present ≥ 99.5%
  • today_count ≥ 0.9 * avg_count

Example 2: Python anomaly detection on pipeline metrics

Detect unusual drops in rows written per hour using z-score.

import pandas as pd
import numpy as np

# df has columns: hour_ts, rows_written
# Ensure hourly aggregation upstream
z_window = 48  # two days of hourly data

# Compute rolling stats
roll_mean = df['rows_written'].rolling(z_window, min_periods=24).mean()
roll_std = df['rows_written'].rolling(z_window, min_periods=24).std(ddof=0)

z = (df['rows_written'] - roll_mean) / (roll_std.replace(0, np.nan))

# Flag anomalies when significantly below expected
df['is_anomaly'] = (z < -3) & (df['rows_written'] < roll_mean * 0.8)

recent_anomalies = df.tail(6)[df['is_anomaly']]
if not recent_anomalies.empty:
    print("ALERT: possible volume anomaly in last 6 hours")
Tuning tips
  • Use seasonal baselines (hour-of-day) for business cycles.
  • Apply minimum std-dev floor to avoid divide-by-zero.
  • Require consecutive anomalies before alerting to cut noise.

Example 3: SLIs/SLOs definition (YAML)

slo:
  product: "dim_users"
  owner: "data-platform"
  indicators:
    - name: "freshness_minutes"
      objective: { target: 99.5, window: "30d" }
      error_budget_policy:
        warn_at_percent_budget: 50
        page_at_percent_budget: 100
      thresholds:
        warn: 20   # minutes since last update
        critical: 60
    - name: "completeness_user_id_pct"
      objective: { target: 99.9, window: "30d" }
      thresholds:
        warn: 99.7
        critical: 99.5
    - name: "daily_volume_rows"
      objective: { target: 99.0, window: "30d" }
      baseline: "14d_moving_avg"
      thresholds:
        warn_drop_pct: 10
        critical_drop_pct: 20

Store alongside code and review with stakeholders.

Example 4: Centralized logging + metrics

// JSON log (one line)
{"ts":"2026-01-11T02:34:11Z","severity":"INFO","pipeline":"ingest_users","event":"rows_written","count":12345,"dataset":"stg_users","run_id":"2026-01-11-0230","trace_id":"f8b2...","owner":"data-platform"}

# Prometheus-style metrics
pipeline_rows_written{pipeline="ingest_users",dataset="stg_users"} 12345
pipeline_run_duration_seconds{pipeline="ingest_users"} 312.7
pipeline_errors_total{pipeline="ingest_users",stage="load"} 0
Include these fields
  • run_id, trace_id, pipeline, dataset, stage
  • owner, environment, version (git SHA)
  • counts, durations, error messages

Example 5: Data quality framework integration (Great Expectations-like)

# great_expectations.yml style snippet
expectations:
  dataset: "dim_users"
  owner: "data-platform"
  tests:
    - expect_table_row_count_to_be_between:
        min_value: 0.9 * ref("dim_users", "rowcount_14d_avg")
    - expect_column_values_to_not_be_null:
        column: "user_id"
        mostly: 0.999
    - expect_column_values_to_be_in_set:
        column: "status"
        value_set: ["active","inactive","trial"]
    - expect_table_columns_to_match_ordered_list:
        column_list: ["user_id","email","status","created_at","updated_at"]

Run checks in CI for schema changes, and in production runs for data drift.

Drills & exercises

  • Define three SLIs for a critical table and propose initial SLO targets. ☐
  • Write SQL to compute freshness lag and completeness for two required columns. ☐
  • Add run_id and trace_id to an existing pipeline’s logs. ☐
  • Create a 14-day rolling baseline for daily volume and flag a 15% drop. ☐
  • Draft an alert message template with owner, severity, and remediation steps. ☐
  • Simulate a failure; write a five-step incident runbook to triage and recover. ☐

Common mistakes

  • Measuring too much, alerting on everything: Pick a small set of SLIs tied to user impact.
  • No owners in alerts: Every dataset/pipeline should have a clear owner and on-call rotation.
  • Checks only at the warehouse: Add checks at ingestion and staging to fail fast.
  • Static thresholds in seasonal data: Use baselines by hour/day to avoid false positives.
  • Skipping postmortems: Without learning loops, the same incidents repeat.

Debugging playbook

Freshness breach
  1. Check orchestrator: did the job trigger? If not, inspect scheduler logs.
  2. If triggered, inspect source API/queue lag; compare source timestamps vs load time.
  3. Re-run incremental load with a safe backfill window.
  4. Add a guard: fail pipeline if source watermark doesn’t advance.
Volume drop
  1. Compare today vs 14-day baseline by hour-of-day.
  2. Check filters or schema changes that reduced joins.
  3. Validate deduplication logic didn’t over-trim rows.
  4. Sample raw files/messages for late arrivals.
Completeness drop
  1. Identify columns with rising nulls; trace back to upstream change.
  2. Add defaulting or backfill steps; notify data producers.
  3. Introduce a contract test to block future regressions.

Mini project: Observability for a critical dataset

Goal: Make dim_users reliable and observable.

  1. Define SLIs/SLOs for freshness, completeness, and volume.
  2. Add SQL checks in staging and model outputs.
  3. Instrument your pipeline with logs and Prometheus-style metrics.
  4. Implement a simple anomaly detector for hourly rows written.
  5. Create two alert routes: warn (chat) and critical (paging).
  6. Write a 1-page incident runbook and a postmortem template.
Deliverables checklist
  • YAML with SLIs/SLOs and owners
  • SQL check scripts and pass/fail criteria
  • Metric emits and example logs
  • Anomaly detection code and thresholds
  • Alert templates and runbook

Subskills

  • Freshness Completeness Volume Checks — Build baseline checks at ingestion and model outputs; define pass/fail and ownership.
  • Anomaly Detection For Pipelines — Detect drift using rolling baselines and z-scores or seasonality-aware methods.
  • SLIs SLOs For Data Products — Turn data expectations into measurable service health targets.
  • Centralized Logging Metrics Tracing — Standardize structured logs, metrics, and traces with correlation IDs.
  • Incident Management And On Call — Establish triage, escalation, and communication patterns.
  • Root Cause Analysis And Postmortems — Run blameless reviews and track prevention actions.
  • Alert Tuning And Noise Reduction — Reduce flapping with hysteresis, deduping, and severity routing.
  • Data Quality Framework Integration — Use a framework to standardize checks and reporting.

Next steps

  • Pick one critical dataset and implement the full loop: SLIs/SLOs → checks → alerts → runbook.
  • Measure alert noise for a week and tune thresholds and routing.
  • Roll out a DQ framework to two more datasets; standardize owners and documentation.
Progress note

The exam is available to everyone. Sign in to save your progress and track completion across lessons.

Data Quality And Observability — Skill Exam

This exam checks practical understanding of data quality and observability for platform engineers. You can take it for free. Sign in to save progress and resume later. You can retry questions; your best score counts.

12 questions70% to pass

Have questions about Data Quality And Observability?

AI Assistant

Ask questions about this tool