Skill Not Found

Why this matters for a Data Platform Engineer

Data Quality and Observability make your platform trustworthy. As a Data Platform Engineer, you enable teams to ship reliable data products by measuring freshness, completeness, and volume, instrumenting pipelines, and setting clear SLIs/SLOs. You’ll reduce incident time, prevent silent data failures, and build confidence for analytics, ML, and downstream services.

What good looks like

Every critical dataset has defined SLIs (freshness, volume, schema validity) and SLO targets.
Alerts are actionable, low-noise, and include run IDs, owners, and links to logs and traces.
Incidents follow a runbook; postmortems are blameless and lead to concrete prevention actions.
Quality checks run close to ingestion and fail fast; issues are visible in a central view.

Who this is for

Data Platform Engineers building and operating ingestion, transformation, and serving layers.
Data Engineers owning pipelines who need reliable, debuggable workflows.

Prerequisites

Comfort with SQL and at least one orchestration tool (e.g., Airflow, Dagster, Prefect).
Basic Python for pipeline utilities and check automation.
Familiarity with metrics/logging concepts (counters, gauges, histograms) and JSON logs.
Understanding of version-controlled transformations (e.g., dbt) is a plus.

Learning path

Define SLIs/SLOs: Choose freshness, completeness, and volume SLIs per critical dataset; set initial SLOs.
Instrument pipelines: Emit metrics, logs, and traces with run_id, dataset, and owner.
Quality checks: Add freshness, completeness, and volume checks at ingestion and model outputs.
Anomaly detection: Add baselines and outlier detection to catch silent drifts.
Alerting: Route alerts with severity, context, and runbook links; tune to reduce noise.
Incident management: Establish on-call, triage, escalation, and postmortems.
Integrate frameworks: Standardize with a DQ framework (e.g., Great Expectations/dbt tests).

Worked examples

Example 1: SQL checks for freshness, completeness, and volume

Run near ingestion to fail fast. Replace table/column names as needed.

-- Freshness: alert if last event older than 30 minutes
SELECT NOW() - MAX(event_timestamp) AS freshness_lag
FROM raw_events;

-- Completeness: required columns non-null ratio
SELECT
  100.0 * SUM(CASE WHEN user_id IS NOT NULL THEN 1 ELSE 0 END) / COUNT(*) AS pct_user_id_present,
  100.0 * SUM(CASE WHEN event_type IS NOT NULL THEN 1 ELSE 0 END) / COUNT(*) AS pct_event_type_present
FROM raw_events
WHERE event_date = CURRENT_DATE;

-- Volume: expected daily row count >= baseline * 0.9
WITH baseline AS (
  SELECT AVG(daily_count) AS avg_count
  FROM (
    SELECT event_date, COUNT(*) AS daily_count
    FROM raw_events
    WHERE event_date BETWEEN CURRENT_DATE - INTERVAL '14 days' AND CURRENT_DATE - INTERVAL '1 day'
    GROUP BY event_date
  ) x
)
SELECT COUNT(*) AS today_count, b.avg_count
FROM raw_events, baseline b
WHERE event_date = CURRENT_DATE;

Pass/Fail logic idea

freshness_lag <= 30 minutes
pct_user_id_present ≥ 99.5%
today_count ≥ 0.9 * avg_count

Example 2: Python anomaly detection on pipeline metrics

Detect unusual drops in rows written per hour using z-score.

import pandas as pd
import numpy as np

# df has columns: hour_ts, rows_written
# Ensure hourly aggregation upstream
z_window = 48  # two days of hourly data

# Compute rolling stats
roll_mean = df['rows_written'].rolling(z_window, min_periods=24).mean()
roll_std = df['rows_written'].rolling(z_window, min_periods=24).std(ddof=0)

z = (df['rows_written'] - roll_mean) / (roll_std.replace(0, np.nan))

# Flag anomalies when significantly below expected
df['is_anomaly'] = (z < -3) & (df['rows_written'] < roll_mean * 0.8)

recent_anomalies = df.tail(6)[df['is_anomaly']]
if not recent_anomalies.empty:
    print("ALERT: possible volume anomaly in last 6 hours")

Tuning tips

Use seasonal baselines (hour-of-day) for business cycles.
Apply minimum std-dev floor to avoid divide-by-zero.
Require consecutive anomalies before alerting to cut noise.

Example 3: SLIs/SLOs definition (YAML)

slo:
  product: "dim_users"
  owner: "data-platform"
  indicators:
    - name: "freshness_minutes"
      objective: { target: 99.5, window: "30d" }
      error_budget_policy:
        warn_at_percent_budget: 50
        page_at_percent_budget: 100
      thresholds:
        warn: 20   # minutes since last update
        critical: 60
    - name: "completeness_user_id_pct"
      objective: { target: 99.9, window: "30d" }
      thresholds:
        warn: 99.7
        critical: 99.5
    - name: "daily_volume_rows"
      objective: { target: 99.0, window: "30d" }
      baseline: "14d_moving_avg"
      thresholds:
        warn_drop_pct: 10
        critical_drop_pct: 20

Store alongside code and review with stakeholders.

Example 4: Centralized logging + metrics

// JSON log (one line)
{"ts":"2026-01-11T02:34:11Z","severity":"INFO","pipeline":"ingest_users","event":"rows_written","count":12345,"dataset":"stg_users","run_id":"2026-01-11-0230","trace_id":"f8b2...","owner":"data-platform"}

# Prometheus-style metrics
pipeline_rows_written{pipeline="ingest_users",dataset="stg_users"} 12345
pipeline_run_duration_seconds{pipeline="ingest_users"} 312.7
pipeline_errors_total{pipeline="ingest_users",stage="load"} 0

Include these fields

run_id, trace_id, pipeline, dataset, stage
owner, environment, version (git SHA)
counts, durations, error messages

Example 5: Data quality framework integration (Great Expectations-like)

# great_expectations.yml style snippet
expectations:
  dataset: "dim_users"
  owner: "data-platform"
  tests:
    - expect_table_row_count_to_be_between:
        min_value: 0.9 * ref("dim_users", "rowcount_14d_avg")
    - expect_column_values_to_not_be_null:
        column: "user_id"
        mostly: 0.999
    - expect_column_values_to_be_in_set:
        column: "status"
        value_set: ["active","inactive","trial"]
    - expect_table_columns_to_match_ordered_list:
        column_list: ["user_id","email","status","created_at","updated_at"]

Run checks in CI for schema changes, and in production runs for data drift.

Drills & exercises

Define three SLIs for a critical table and propose initial SLO targets. ☐
Write SQL to compute freshness lag and completeness for two required columns. ☐
Add run_id and trace_id to an existing pipeline’s logs. ☐
Create a 14-day rolling baseline for daily volume and flag a 15% drop. ☐
Draft an alert message template with owner, severity, and remediation steps. ☐
Simulate a failure; write a five-step incident runbook to triage and recover. ☐

Common mistakes

Measuring too much, alerting on everything: Pick a small set of SLIs tied to user impact.
No owners in alerts: Every dataset/pipeline should have a clear owner and on-call rotation.
Checks only at the warehouse: Add checks at ingestion and staging to fail fast.
Static thresholds in seasonal data: Use baselines by hour/day to avoid false positives.
Skipping postmortems: Without learning loops, the same incidents repeat.

Debugging playbook

Freshness breach

Check orchestrator: did the job trigger? If not, inspect scheduler logs.
If triggered, inspect source API/queue lag; compare source timestamps vs load time.
Re-run incremental load with a safe backfill window.
Add a guard: fail pipeline if source watermark doesn’t advance.

Volume drop

Compare today vs 14-day baseline by hour-of-day.
Check filters or schema changes that reduced joins.
Validate deduplication logic didn’t over-trim rows.
Sample raw files/messages for late arrivals.

Completeness drop

Identify columns with rising nulls; trace back to upstream change.
Add defaulting or backfill steps; notify data producers.
Introduce a contract test to block future regressions.

Mini project: Observability for a critical dataset

Goal: Make dim_users reliable and observable.

Define SLIs/SLOs for freshness, completeness, and volume.
Add SQL checks in staging and model outputs.
Instrument your pipeline with logs and Prometheus-style metrics.
Implement a simple anomaly detector for hourly rows written.
Create two alert routes: warn (chat) and critical (paging).
Write a 1-page incident runbook and a postmortem template.

Deliverables checklist

YAML with SLIs/SLOs and owners
SQL check scripts and pass/fail criteria
Metric emits and example logs
Anomaly detection code and thresholds
Alert templates and runbook

Subskills

Freshness Completeness Volume Checks — Build baseline checks at ingestion and model outputs; define pass/fail and ownership.
Anomaly Detection For Pipelines — Detect drift using rolling baselines and z-scores or seasonality-aware methods.
SLIs SLOs For Data Products — Turn data expectations into measurable service health targets.
Centralized Logging Metrics Tracing — Standardize structured logs, metrics, and traces with correlation IDs.
Incident Management And On Call — Establish triage, escalation, and communication patterns.
Root Cause Analysis And Postmortems — Run blameless reviews and track prevention actions.
Alert Tuning And Noise Reduction — Reduce flapping with hysteresis, deduping, and severity routing.
Data Quality Framework Integration — Use a framework to standardize checks and reporting.

Next steps

Pick one critical dataset and implement the full loop: SLIs/SLOs → checks → alerts → runbook.
Measure alert noise for a week and tune thresholds and routing.
Roll out a DQ framework to two more datasets; standardize owners and documentation.

Progress note

The exam is available to everyone. Sign in to save your progress and track completion across lessons.

Menu

Data Quality And Observability

Table of Contents

Why this matters for a Data Platform Engineer

Who this is for

Prerequisites

Learning path

Worked examples

Example 1: SQL checks for freshness, completeness, and volume

Example 2: Python anomaly detection on pipeline metrics

Example 3: SLIs/SLOs definition (YAML)

Example 4: Centralized logging + metrics

Example 5: Data quality framework integration (Great Expectations-like)

Drills & exercises

Common mistakes

Debugging playbook

Mini project: Observability for a critical dataset

Subskills

Next steps

Data Quality And Observability — Skill Exam

Topics

Freshness Completeness Volume Checks

Anomaly Detection For Pipelines

SLIs SLOs For Data Products

Centralized Logging Metrics Tracing

Incident Management And On Call

Root Cause Analysis And Postmortems

Alert Tuning And Noise Reduction

Data Quality Framework Integration

Have questions about Data Quality And Observability?

AI Assistant