Menu

Data Quality And Observability

Learn Data Quality And Observability for Data Architect for free: roadmap, examples, subskills, and a skill exam.

Published: January 18, 2026 | Updated: January 18, 2026

Why this skill matters for a Data Architect

Data quality and observability make your data platform trustworthy. As a Data Architect, you define how quality is measured, monitored, and remediated across ingestion, transformation, and serving layers. Strong observability unlocks reliable analytics, AI features, regulatory reporting, and faster incident recovery.

  • Prevent costly bad-data incidents and regulatory risk.
  • Shorten time-to-detect and time-to-resolve data issues.
  • Enable teams to ship changes confidently with measurable SLAs/SLOs.

What you will be able to do

  • Design a data quality framework with clear dimensions (freshness, completeness, accuracy, validity, consistency, uniqueness).
  • Implement checks, anomaly detection, and cross-system reconciliation.
  • Set up monitoring, alerting, SLAs/SLOs, and executive-quality reporting.
  • Create incident playbooks and a repeatable root cause analysis process.

Who this is for

  • Data Architects defining platform standards and governance.
  • Senior Data/Analytics Engineers building reliable pipelines.
  • Platform Engineers responsible for observability and SLOs.

Prerequisites

  • Comfortable with SQL and at least one data processing engine (e.g., Spark, Flink, or a warehouse).
  • Basic understanding of data modeling, pipelines/orchestration, and CI/CD.
  • Familiarity with metrics/monitoring concepts (dashboards, alerts, severity).

Learning path (practical roadmap)

1) Define the quality bar

  • Pick quality dimensions and draft SLAs/SLOs per critical dataset.
  • Choose thresholds for freshness, completeness, and accuracy; document owners and escalation policy.

2) Instrument baseline checks

  • Add freshness, completeness, and domain/accuracy tests close to the data (SQL/dbt/Great Expectations-style checks).
  • Store results and failures as events for trend analysis.

3) Establish reconciliation patterns

  • Compare record counts, sums, and hashes between source and warehouse by partition/date.
  • Automate daily drift detection jobs.

4) Add anomaly detection

  • Start with simple statistical baselines (rolling mean, z-score). Tune to reduce noise.
  • Graduate to seasonality-aware models if needed.

5) Monitoring and alerting

  • Publish metrics to your monitoring stack. Define severity levels and on-call rotation.
  • Route alerts with context (dataset, owner, runbook link, last healthy time).

6) Incident management and RCA

  • Create a playbook template. Practice drills. Track MTTR/MTTD.
  • Use lineage to localize the fault and prevent recurrences with action items.

7) Quality metrics and reporting

  • Build a dashboard: test coverage, failure rate, freshness SLO attainment, incident metrics.
  • Share summaries with stakeholders; adjust thresholds based on impact.
Milestone checklist
  • SLAs/SLOs defined and approved for top 10 critical tables.
  • Baseline checks deployed and logging results.
  • Daily reconciliation job running with alerts.
  • Anomaly rules tuned to low false-positive rate.
  • Runbooks published; on-call rota set.
  • Quality dashboard live with targets and trends.

Worked examples

1) Freshness, completeness, and accuracy checks (SQL)

-- Freshness (minutes since last update)
SELECT TIMESTAMP_DIFF(CURRENT_TIMESTAMP(), MAX(updated_at), MINUTE) AS freshness_minutes
FROM mart.customer_daily;

-- Completeness (null rate for critical column)
SELECT 100.0 * SUM(CASE WHEN email IS NULL THEN 1 ELSE 0 END) / COUNT(*) AS email_null_pct
FROM mart.customer_daily
WHERE load_date = CURRENT_DATE();

-- Accuracy / validity (domain and range checks)
SELECT COUNTIF(status NOT IN ('active','inactive','pending')) AS invalid_status_count,
       COUNTIF(amount < 0) AS negative_amount_count
FROM mart.orders
WHERE load_date = CURRENT_DATE();
How to act on results
  • Define thresholds (e.g., freshness <= 60 min, email_null_pct <= 1%).
  • On breach, include dataset name, owner, threshold, and last healthy time in the alert.

2) Reconciliation across systems

-- Partition-level count check between source and warehouse
WITH src AS (
  SELECT order_date, COUNT(*) AS c FROM staging_src.orders GROUP BY order_date
), wh AS (
  SELECT order_date, COUNT(*) AS c FROM wh.orders GROUP BY order_date
)
SELECT COALESCE(src.order_date, wh.order_date) AS order_date,
       src.c AS src_count, wh.c AS wh_count,
       (wh.c - src.c) AS diff
FROM src FULL OUTER JOIN wh USING (order_date)
WHERE (src.c IS NULL OR wh.c IS NULL OR src.c != wh.c);

-- Row-level hash comparison (stable business keys)
SELECT s.order_id
FROM (
  SELECT order_id, MD5(CONCAT(CAST(amount AS STRING),'|',status,'|',DATE(order_ts))) AS h
  FROM staging_src.orders WHERE order_date = CURRENT_DATE()
) s
LEFT JOIN (
  SELECT order_id, MD5(CONCAT(CAST(amount AS STRING),'|',status,'|',DATE(order_ts))) AS h
  FROM wh.orders WHERE order_date = CURRENT_DATE()
) w USING(order_id)
WHERE s.h != w.h OR w.order_id IS NULL;
Tip

Use deterministic hashes on business keys and normalized field formats to avoid false mismatches due to type/format differences.

3) Simple anomaly detection with rolling baseline (SQL)

-- Detect anomalous daily record count using z-score
WITH daily AS (
  SELECT DATE(load_ts) AS d, COUNT(*) AS c FROM wh.events GROUP BY 1
), stats AS (
  SELECT d, c,
         AVG(c) OVER (ORDER BY d ROWS BETWEEN 14 PRECEDING AND 1 PRECEDING) AS mean_c,
         STDDEV_SAMP(c) OVER (ORDER BY d ROWS BETWEEN 14 PRECEDING AND 1 PRECEDING) AS sd_c
  FROM daily
)
SELECT d, c, (c - mean_c)/NULLIF(sd_c,0) AS z
FROM stats
WHERE sd_c IS NOT NULL AND ABS((c - mean_c)/sd_c) >= 3;
Tuning guidance
  • Use a lookback window that covers typical cycles (e.g., 14 days).
  • Start with |z| ≥ 3; adjust per business impact and seasonality.

4) Monitoring and alerting: config sketch

# Pseudo YAML for a dataset quality monitor
checks:
  - name: customer_freshness
    type: freshness
    query: SELECT TIMESTAMP_DIFF(CURRENT_TIMESTAMP(), MAX(updated_at), MINUTE) FROM mart.customer_daily
    threshold: 60
    severity: high
    on_breach:
      notify: ["data-oncall"]
      include: ["dataset","owner","threshold","last_success_ts","runbook"]

  - name: orders_completeness_email
    type: null_rate
    query: |
      SELECT 100.0 * SUM(CASE WHEN email IS NULL THEN 1 ELSE 0 END) / COUNT(*) FROM mart.orders
    threshold: 1
    severity: medium
    on_breach:
      notify: ["analytics-engineering"]
      ticket: true
Alert hygiene
  • Deduplicate repeated alerts; add cooldowns to avoid noise storms.
  • Always link to a runbook and recent context (last pass, owner, change history).

5) Incident playbook snippet

Incident: Data Freshness Breach (High Severity)
1. Triage (max 15 min): confirm breach, identify impacted tables and consumers.
2. Contain: pause downstream jobs if serving stale/incorrect data to critical outputs.
3. Diagnose: check last successful run, recent code/config changes, upstream source health.
4. Fix: rollback or hotfix; re-run idempotent steps from the minimal checkpoint.
5. Recover: validate with targeted checks; re-enable downstream.
6. Learn: RCA within 48h; create action items (tests, alerts, guards) with owners and due dates.

Message template:
"High-sev freshness breach in mart.customer_daily. Last healthy: 09:10 UTC, threshold: 60 min.
Potential impact: Marketing dashboards. Runbook: <internal path>. Owner: Data Platform On-call."

6) Root cause analysis cues

-- Example lineage probe (conceptual)
SELECT child_table, parent_table, last_run_ts, status
FROM lineage_runs
WHERE child_table = 'mart.customer_daily'
ORDER BY last_run_ts DESC
LIMIT 20;
RCA workflow
  • Reproduce locally or in staging; confirm breach signal.
  • Check change log (code, schema, data contracts, source outages) in the last 24–48 hours.
  • Trace lineage one hop upstream; compare inputs vs expected contracts.
  • Test hypotheses with targeted queries and backfills.

Drills / exercises

  • Create a freshness and completeness check for a table you own; choose thresholds and define an owner.
  • Implement a daily reconciliation between an ingestion table and a warehouse table using counts and a hash on a business key.
  • Add a rolling z-score anomaly detector for a metric of your choice; tune threshold to reduce false positives.
  • Draft an incident playbook for a high-severity data breach scenario relevant to your org.
  • Build a small dashboard showing test pass rate, freshness SLO attainment, and MTTR over the last 30 days.

Common mistakes and how to avoid them

  • Only checking downstream marts: instrument quality as early as possible near ingestion.
  • Thresholds too strict or too loose: iterate using historical data and stakeholder input.
  • Noisy alerts: add severity levels, cooldowns, and context; prefer actionable alerts.
  • Non-idempotent re-runs: design pipelines with dedupe keys and safe retries.
  • Skipping reconciliation: counts and hashes catch silent drift and schema quirks.
  • Unowned incidents: assign owners and escalation paths; include runbooks in every alert.
Debugging tips
  • Check whether the signal is real: validate with a direct query and a control table.
  • Validate clocks and time zones when diagnosing freshness breaches.
  • Compare a small partition/date; sample row-level diffs before full backfills.
  • Roll back to last known good artifacts if hotfix risk is high.

Mini project: Quality guardrails for a critical dataset

Goal: Add reliable guardrails to a high-impact table (e.g., orders_daily) and publish quality KPIs.

  • Design: define SLAs/SLOs, owners, and severity levels.
  • Build: implement freshness, completeness, and domain checks; add a rolling anomaly detector for daily volume.
  • Reconcile: schedule a daily source-vs-warehouse reconciliation job.
  • Operate: configure alerts with runbook references and a simple on-call rotation.
  • Report: create a dashboard with test coverage, pass rate, freshness attainment, incident count, and MTTR.
Stretch goals
  • Add seasonality-aware anomalies (weekday profiles).
  • Publish a weekly stakeholder summary with trends and action items.

Subskills

  • Data Quality Framework Design — Establish dimensions, SLAs/SLOs, ownership, and governance.
  • Freshness Completeness Accuracy Checks — Implement baseline tests near data.
  • Reconciliation Patterns — Detect drift across systems with counts and hashes.
  • Anomaly Detection Strategy — Catch outliers with rolling baselines and tuned thresholds.
  • Monitoring And Alerting Design — Route actionable alerts with severity and runbooks.
  • Incident Management Playbooks — Standardize triage, containment, fix, and learn steps.
  • Root Cause Analysis Process — Use lineage, contracts, and experiments to find causes.
  • Quality Metrics And Reporting — Track coverage, failure rates, freshness, MTTR.

Next steps

  • Pick one production dataset and implement at least three checks plus a reconciliation job.
  • Define alerting and publish a small dashboard with your first quality KPIs.
  • Schedule a monthly review to refine thresholds and update runbooks.

Have questions about Data Quality And Observability?

AI Assistant

Ask questions about this tool