Menu

Topic 1 of 8

Data Quality Framework Design

Learn Data Quality Framework Design for free with explanations, exercises, and a quick test (for Data Architect).

Published: January 18, 2026 | Updated: January 18, 2026

Why this matters

As a Data Architect, you define how an organization protects itself from bad data. A data quality (DQ) framework is the set of standards, controls, monitoring, and ownership that applies to every dataset and pipeline. You will use it to design data contracts, choose check types, set thresholds and SLOs/SLAs, wire alerts, and guide incident response.

  • Real tasks: define critical data elements, choose quality dimensions, set thresholds and SLAs, embed checks in pipelines, route alerts, and track improvements.
  • Impact: fewer broken dashboards, faster issue detection, clearer ownership, and safer decision-making.

Concept explained simply

A DQ framework is a repeatable recipe to make data fit for purpose. It answers: What to check, where to check, how to measure, who owns it, how to alert, and how to improve.

Mental model

Think of three Gs:

  • Guardrails: standards and controls (contracts, constraints, rules).
  • Gauges: metrics, SLOs, and monitors that show health in real time.
  • Garage: runbooks, roles, and post-incident learning to fix and prevent issues.

Core design components

  • Scope: datasets and critical data elements (CDEs) covered.
  • Dimensions: completeness, accuracy, validity, uniqueness, consistency, timeliness/freshness, integrity/referential.
  • Controls: schema constraints, rule checks, reconciliation, anomaly detection, and data contracts.
  • Observability: metrics (volume, nulls, distribution), lineage, logs, and alerts.
  • Targets: SLOs (internal) and SLAs (promise to consumers) with thresholds.
  • Ownership: data product owner, steward, on-call rotation, escalation path.
  • Lifecycle: versioning, exceptions, change approvals, and documentation.
See a compact DQ spec template
dataset: <name>
domain: <business domain>
purpose: <what decisions depend on it>
critical_data_elements:
  - field: <name>
    dimension_targets:
      completeness: >= 99.9%
      validity: allowed_values / regex
      accuracy: >= 99% (method: reconciliation)
checks:
  - type: schema_contract
    rule: non_null & primary_key
  - type: uniqueness
    key: ["id"]
  - type: referential_integrity
    foreign_key: customer_id -> customers.id
  - type: freshness
    max_delay: 1h
observability:
  metrics: [row_count, null_rate, distribution_drift]
  alerting:
    severity: [P1, P2, P3]
    channels: [on-call, email]
    routing: data_product_team
targets:
  SLOs:
    freshness: 99% of days <= 1h delay
    completeness: >= 99.9%
  SLAs:
    daily delivery by 08:00 UTC on business days
ownership:
  product_owner: <name/role>
  steward: <name/role>
  on_call: <rotation>
runbooks:
  - freshness_breach.md
change_management:
  contract_versioning: semver
  approval: data_governance_review

Worked examples

Example 1: Customer master table (batch)

  • CDEs: customer_id, email, country_code, created_at.
  • Controls: primary key uniqueness on customer_id; email validity (regex); country_code in ISO list; referential integrity to countries table.
  • Freshness: table updated by 03:00 UTC daily; SLO 99% of days.
  • Observability: watch null_rate(email), row_count trend, duplicate_rate(customer_id) <= 0.001%.
  • Runbook: on P1 duplicate spike, disable downstream publish, dedupe with deterministic rules, backfill, RCA.

Example 2: Clickstream events (streaming)

  • Contract: fields event_id (UUID), user_id, event_time, event_type, payload.
  • Controls: idempotency via event_id; schema evolution with versioned contract; validity on event_type; lateness window 24h with watermark.
  • Freshness: 99.9% events processed within 5 minutes; SLA to analytics: sessions table lag <= 10 minutes.
  • Observability: volume anomaly detection vs 7-day baseline; distribution drift for event_type; dead-letter queue rate < 0.1%.
  • Runbook: if lag > 10 minutes, scale consumers, check broker health, drain DLQ, replay window.

Example 3: Monthly revenue metric (warehouse)

  • Controls: reconciliation between orders.amount and payments.amount within ±0.5%; validity: amounts >= 0; currency consistency.
  • Completeness: 100% months since go-live; no missing partitions.
  • Freshness: monthly close by T+2; SLO 99% quarters.
  • Observability: anomaly check on ARPU trend; lineage documented from source to BI model.
Scoring example: dataset health score
# Example scoring (weighting) used for dashboards
score = 0.3*completeness + 0.2*accuracy + 0.15*validity + 0.15*uniqueness + 0.2*freshness
# Each subscore is 0..1 based on SLO attainment over last 7 days

Implementation steps (practical)

  1. Gather requirements: list consumers, CDEs, business risks, and delivery times.
  2. Select minimum viable dimensions: typically completeness, uniqueness, validity, and freshness.
  3. Define control points: at ingestion (contracts), at storage (constraints), and at transformation (rules, reconciliations).
  4. Set targets: SLOs and any SLAs; choose alert severities and routing.
  5. Instrument observability: collect row_count, null_rate, drift metrics; wire alerts.
  6. Document runbooks: who responds, what to check first, rollback/backfill steps.
  7. Pilot on 1–2 datasets; adjust thresholds; scale by template.

How to choose checks

  • Risk-driven: start with fields that break decisions if wrong.
  • Data shape: streams need freshness/lateness and schema evolution; warehouses need constraints and reconciliation.
  • Cost-aware: prefer low-noise, high-signal checks; avoid alert storms.
  • Automatable: template checks so every new dataset starts with a baseline.

Metrics and KPIs to track

  • Coverage: % datasets with baseline checks and owners assigned.
  • MTTD/MTTR: mean time to detect and resolve quality incidents.
  • SLO attainment: % periods meeting freshness/completeness targets.
  • Alert quality: false positive rate and repeat incidents rate.

Exercises

Note: Exercises and the quick test are available to everyone; only logged-in users get saved progress.

  1. Exercise 1 (ex1): Create a DQ spec for an Orders dataset. Include CDEs, dimensions with thresholds, observability, SLO/SLA, ownership, and runbooks.
  2. Exercise 2 (ex2): Design alerting and a runbook for a freshness SLO breach on a daily Sales feed.
Starter templates
# Orders DQ spec skeleton
dataset: orders
domain: commerce
purpose: revenue reporting, fulfillment
critical_data_elements:
  - field: order_id
  - field: customer_id
  - field: order_amount
  - field: order_date
checks: []
observability:
  metrics: []
  alerting: {}
targets:
  SLOs: {}
  SLAs: {}
ownership: {}
runbooks: []

Exercise checklist

  • [ ] CDEs identified and justified
  • [ ] At least 4 dimensions with numeric targets
  • [ ] Clear control points (ingest, storage, transform)
  • [ ] Alert routing with severity
  • [ ] Owner and on-call named
  • [ ] Runbook steps include diagnose, mitigate, fix, prevent

Common mistakes and self-check

  • Too many checks on non-critical fields. Fix: start with a baseline and grow by risk.
  • Only technical controls, no process. Fix: define owners and runbooks.
  • Static thresholds. Fix: use baselines and revisit after pilot.
  • No contract versioning. Fix: semver and approval steps for breaking changes.
  • Alert storms. Fix: deduplicate, route by severity, daily summary for low-priority.
  • Forgetting historical backfill checks. Fix: validate after reprocessing, not just on new data.
Self-check questions
  • Can you list CDEs and their targets for your top 3 datasets?
  • What is your MTTD/MTTR last month? Is it trending down?
  • Do alerts name an owner and a runbook?
  • How do you prevent schema-breaking changes from reaching consumers?

Practical projects

  • Project 1: Baseline DQ for a critical dataset using the template; ship a dashboard with health score.
  • Project 2: Add anomaly detection for volume and null rate; compare against 14-day baseline.
  • Project 3: Run an incident tabletop for a duplicate spike; update the runbook with learnings.

Who this is for

  • Data Architects defining standards across teams.
  • Data Engineers and Analytics Engineers implementing checks and monitoring.
  • Data Stewards owning data products and SLAs.

Prerequisites

  • Basic data modeling (keys, constraints, lineage).
  • Pipeline stages (ingest, transform, serve).
  • Familiarity with monitoring/alerting concepts.

Learning path

  1. Identify business-critical datasets and CDEs.
  2. Pick minimum dimensions and targets.
  3. Decide control points and checks.
  4. Instrument observability and alerts.
  5. Define ownership and runbooks.
  6. Pilot, measure, iterate, and scale via templates.

Mini challenge

Choose a dataset you know. In 10 minutes, write a one-page DQ spec: scope, 4 dimensions with targets, 3 checks, alert routing, and a 6-step runbook. Share it with a teammate and ask: What failure would still slip through? Adjust your design.

Next steps

  • Extend to data contracts across services and teams.
  • Add reconciliation checks between key systems.
  • Automate baseline templates in your project scaffolding.
  • Track monthly SLO attainment and incident trends.

Practice Exercises

2 exercises to complete

Instructions

Create a one-page DQ spec for dataset "orders". Include:

  • Purpose and consumers
  • 4+ CDEs with dimension targets (completeness, validity, uniqueness, freshness)
  • Controls at ingestion, storage, and transformation
  • Observability metrics and alerts with severity and routing
  • SLOs and any SLAs
  • Ownership (product owner, steward, on-call)
  • Runbook reference with high-level steps

Keep it concise and actionable.

Expected Output
A structured spec (YAML or bullet list) covering scope, CDEs with targets, checks, observability, SLO/SLA, ownership, and runbooks.

Data Quality Framework Design — Quick Test

Test your knowledge with 8 questions. Pass with 70% or higher.

8 questions70% to pass

Have questions about Data Quality Framework Design?

AI Assistant

Ask questions about this tool