Why this matters for a Data Platform Engineer
Data Quality and Observability make your platform trustworthy. As a Data Platform Engineer, you enable teams to ship reliable data products by measuring freshness, completeness, and volume, instrumenting pipelines, and setting clear SLIs/SLOs. You’ll reduce incident time, prevent silent data failures, and build confidence for analytics, ML, and downstream services.
What good looks like
- Every critical dataset has defined SLIs (freshness, volume, schema validity) and SLO targets.
- Alerts are actionable, low-noise, and include run IDs, owners, and links to logs and traces.
- Incidents follow a runbook; postmortems are blameless and lead to concrete prevention actions.
- Quality checks run close to ingestion and fail fast; issues are visible in a central view.
Who this is for
- Data Platform Engineers building and operating ingestion, transformation, and serving layers.
- Data Engineers owning pipelines who need reliable, debuggable workflows.
Prerequisites
- Comfort with SQL and at least one orchestration tool (e.g., Airflow, Dagster, Prefect).
- Basic Python for pipeline utilities and check automation.
- Familiarity with metrics/logging concepts (counters, gauges, histograms) and JSON logs.
- Understanding of version-controlled transformations (e.g., dbt) is a plus.
Learning path
- Define SLIs/SLOs: Choose freshness, completeness, and volume SLIs per critical dataset; set initial SLOs.
- Instrument pipelines: Emit metrics, logs, and traces with run_id, dataset, and owner.
- Quality checks: Add freshness, completeness, and volume checks at ingestion and model outputs.
- Anomaly detection: Add baselines and outlier detection to catch silent drifts.
- Alerting: Route alerts with severity, context, and runbook links; tune to reduce noise.
- Incident management: Establish on-call, triage, escalation, and postmortems.
- Integrate frameworks: Standardize with a DQ framework (e.g., Great Expectations/dbt tests).
Worked examples
Example 1: SQL checks for freshness, completeness, and volume
Run near ingestion to fail fast. Replace table/column names as needed.
-- Freshness: alert if last event older than 30 minutes
SELECT NOW() - MAX(event_timestamp) AS freshness_lag
FROM raw_events;
-- Completeness: required columns non-null ratio
SELECT
100.0 * SUM(CASE WHEN user_id IS NOT NULL THEN 1 ELSE 0 END) / COUNT(*) AS pct_user_id_present,
100.0 * SUM(CASE WHEN event_type IS NOT NULL THEN 1 ELSE 0 END) / COUNT(*) AS pct_event_type_present
FROM raw_events
WHERE event_date = CURRENT_DATE;
-- Volume: expected daily row count >= baseline * 0.9
WITH baseline AS (
SELECT AVG(daily_count) AS avg_count
FROM (
SELECT event_date, COUNT(*) AS daily_count
FROM raw_events
WHERE event_date BETWEEN CURRENT_DATE - INTERVAL '14 days' AND CURRENT_DATE - INTERVAL '1 day'
GROUP BY event_date
) x
)
SELECT COUNT(*) AS today_count, b.avg_count
FROM raw_events, baseline b
WHERE event_date = CURRENT_DATE;Pass/Fail logic idea
- freshness_lag <= 30 minutes
- pct_user_id_present ≥ 99.5%
- today_count ≥ 0.9 * avg_count
Example 2: Python anomaly detection on pipeline metrics
Detect unusual drops in rows written per hour using z-score.
import pandas as pd
import numpy as np
# df has columns: hour_ts, rows_written
# Ensure hourly aggregation upstream
z_window = 48 # two days of hourly data
# Compute rolling stats
roll_mean = df['rows_written'].rolling(z_window, min_periods=24).mean()
roll_std = df['rows_written'].rolling(z_window, min_periods=24).std(ddof=0)
z = (df['rows_written'] - roll_mean) / (roll_std.replace(0, np.nan))
# Flag anomalies when significantly below expected
df['is_anomaly'] = (z < -3) & (df['rows_written'] < roll_mean * 0.8)
recent_anomalies = df.tail(6)[df['is_anomaly']]
if not recent_anomalies.empty:
print("ALERT: possible volume anomaly in last 6 hours")Tuning tips
- Use seasonal baselines (hour-of-day) for business cycles.
- Apply minimum std-dev floor to avoid divide-by-zero.
- Require consecutive anomalies before alerting to cut noise.
Example 3: SLIs/SLOs definition (YAML)
slo:
product: "dim_users"
owner: "data-platform"
indicators:
- name: "freshness_minutes"
objective: { target: 99.5, window: "30d" }
error_budget_policy:
warn_at_percent_budget: 50
page_at_percent_budget: 100
thresholds:
warn: 20 # minutes since last update
critical: 60
- name: "completeness_user_id_pct"
objective: { target: 99.9, window: "30d" }
thresholds:
warn: 99.7
critical: 99.5
- name: "daily_volume_rows"
objective: { target: 99.0, window: "30d" }
baseline: "14d_moving_avg"
thresholds:
warn_drop_pct: 10
critical_drop_pct: 20Store alongside code and review with stakeholders.
Example 4: Centralized logging + metrics
// JSON log (one line)
{"ts":"2026-01-11T02:34:11Z","severity":"INFO","pipeline":"ingest_users","event":"rows_written","count":12345,"dataset":"stg_users","run_id":"2026-01-11-0230","trace_id":"f8b2...","owner":"data-platform"}
# Prometheus-style metrics
pipeline_rows_written{pipeline="ingest_users",dataset="stg_users"} 12345
pipeline_run_duration_seconds{pipeline="ingest_users"} 312.7
pipeline_errors_total{pipeline="ingest_users",stage="load"} 0Include these fields
- run_id, trace_id, pipeline, dataset, stage
- owner, environment, version (git SHA)
- counts, durations, error messages
Example 5: Data quality framework integration (Great Expectations-like)
# great_expectations.yml style snippet
expectations:
dataset: "dim_users"
owner: "data-platform"
tests:
- expect_table_row_count_to_be_between:
min_value: 0.9 * ref("dim_users", "rowcount_14d_avg")
- expect_column_values_to_not_be_null:
column: "user_id"
mostly: 0.999
- expect_column_values_to_be_in_set:
column: "status"
value_set: ["active","inactive","trial"]
- expect_table_columns_to_match_ordered_list:
column_list: ["user_id","email","status","created_at","updated_at"]Run checks in CI for schema changes, and in production runs for data drift.
Drills & exercises
- Define three SLIs for a critical table and propose initial SLO targets. ☐
- Write SQL to compute freshness lag and completeness for two required columns. ☐
- Add run_id and trace_id to an existing pipeline’s logs. ☐
- Create a 14-day rolling baseline for daily volume and flag a 15% drop. ☐
- Draft an alert message template with owner, severity, and remediation steps. ☐
- Simulate a failure; write a five-step incident runbook to triage and recover. ☐
Common mistakes
- Measuring too much, alerting on everything: Pick a small set of SLIs tied to user impact.
- No owners in alerts: Every dataset/pipeline should have a clear owner and on-call rotation.
- Checks only at the warehouse: Add checks at ingestion and staging to fail fast.
- Static thresholds in seasonal data: Use baselines by hour/day to avoid false positives.
- Skipping postmortems: Without learning loops, the same incidents repeat.
Debugging playbook
Freshness breach
- Check orchestrator: did the job trigger? If not, inspect scheduler logs.
- If triggered, inspect source API/queue lag; compare source timestamps vs load time.
- Re-run incremental load with a safe backfill window.
- Add a guard: fail pipeline if source watermark doesn’t advance.
Volume drop
- Compare today vs 14-day baseline by hour-of-day.
- Check filters or schema changes that reduced joins.
- Validate deduplication logic didn’t over-trim rows.
- Sample raw files/messages for late arrivals.
Completeness drop
- Identify columns with rising nulls; trace back to upstream change.
- Add defaulting or backfill steps; notify data producers.
- Introduce a contract test to block future regressions.
Mini project: Observability for a critical dataset
Goal: Make dim_users reliable and observable.
- Define SLIs/SLOs for freshness, completeness, and volume.
- Add SQL checks in staging and model outputs.
- Instrument your pipeline with logs and Prometheus-style metrics.
- Implement a simple anomaly detector for hourly rows written.
- Create two alert routes: warn (chat) and critical (paging).
- Write a 1-page incident runbook and a postmortem template.
Deliverables checklist
- YAML with SLIs/SLOs and owners
- SQL check scripts and pass/fail criteria
- Metric emits and example logs
- Anomaly detection code and thresholds
- Alert templates and runbook
Subskills
- Freshness Completeness Volume Checks — Build baseline checks at ingestion and model outputs; define pass/fail and ownership.
- Anomaly Detection For Pipelines — Detect drift using rolling baselines and z-scores or seasonality-aware methods.
- SLIs SLOs For Data Products — Turn data expectations into measurable service health targets.
- Centralized Logging Metrics Tracing — Standardize structured logs, metrics, and traces with correlation IDs.
- Incident Management And On Call — Establish triage, escalation, and communication patterns.
- Root Cause Analysis And Postmortems — Run blameless reviews and track prevention actions.
- Alert Tuning And Noise Reduction — Reduce flapping with hysteresis, deduping, and severity routing.
- Data Quality Framework Integration — Use a framework to standardize checks and reporting.
Next steps
- Pick one critical dataset and implement the full loop: SLIs/SLOs → checks → alerts → runbook.
- Measure alert noise for a week and tune thresholds and routing.
- Roll out a DQ framework to two more datasets; standardize owners and documentation.
Progress note
The exam is available to everyone. Sign in to save your progress and track completion across lessons.