How to learn Data Quality And Validation for ETL Developer for free

Why this skill matters for ETL Developers

Data pipelines only deliver value when the data is trustworthy. As an ETL Developer, you design, implement, and operate checks that keep data complete, accurate, consistent, and timely. Strong data quality (DQ) and validation practices help you:

Catch issues before they hit analytics, ML models, or dashboards.
Shorten incident resolution with clear quarantine, logging, and root-cause analysis.
Build stakeholder trust with repeatable, auditable quality metrics.
Reduce rework by preventing bad data from cascading across systems.

What DQ looks like in your day-to-day

Row counts and hash totals after each load.
Null/duplicate/range checks on critical columns.
Referential integrity checks between facts and dimensions.
Anomaly alerts when volumes, delays, or distributions drift.
Quarantine and reprocessing workflows for rejects.
Daily/weekly DQ reports with pass/fail rates and trend lines.

Who this is for

ETL/ELT developers building batch or streaming pipelines.
Data engineers responsible for data ingestion and transformations.
Analytics engineers who need reliable warehouse layers.
QA engineers and data stewards supporting data platforms.

Prerequisites

Comfort with SQL (joins, window functions, aggregates).
Basic ETL/ELT concepts (staging, transformations, target models).
Familiarity with one programming or orchestration environment (e.g., Python/PySpark, dbt, Airflow, SQL-based ETL).

Learning path and milestones

Foundations: Define quality dimensions (completeness, accuracy, consistency, uniqueness, timeliness). Implement row counts and simple null/duplicate checks.
Constraints & integrity: Add referential integrity and range checks; implement reject/quarantine flow.
Monitoring: Build a DQ metrics table with pass/fail rates, volumes, and freshness; schedule reporting.
Anomaly basics: Add simple statistical baselines (rolling averages, z-scores) for volumes and lateness.
Root cause analysis: Create a playbook to trace failures upstream, inspect diffs, and reproduce locally.
Maintainability: Make checks metadata-driven and versioned; document ownership and SLAs.

Worked examples

1) Row counts and reconciliation

Goal: ensure we loaded everything we expected, and nothing extra.

-- Completeness: staging vs target row counts for today's load
SELECT 'orders' AS table_name,
       SUM(CASE WHEN table_ref = 'staging' THEN cnt ELSE 0 END) AS staging_cnt,
       SUM(CASE WHEN table_ref = 'target'  THEN cnt ELSE 0 END) AS target_cnt
FROM (
  SELECT 'staging' AS table_ref, COUNT(*) AS cnt
  FROM staging.orders
  WHERE load_date = CURRENT_DATE
  UNION ALL
  SELECT 'target' AS table_ref, COUNT(*) AS cnt
  FROM dw.fact_orders
  WHERE load_date = CURRENT_DATE
) t;

-- Accuracy proxy: hash totals on stable fields to detect anomalies
SELECT
  SUM(ABS(HASH(order_id))) AS id_hash_total,
  SUM(order_amount) AS amount_sum
FROM staging.orders
WHERE load_date = CURRENT_DATE;

When counts match but data differs

Compare business aggregates (e.g., SUM(order_amount) by day or region) between source and target. Mismatched sums with equal row counts usually signal transformation errors or duplicated/omitted keys.

2) Null, duplicate, and range checks

-- Nulls & empties on critical columns
SELECT COUNT(*) AS null_customer_id
FROM staging.orders
WHERE customer_id IS NULL;

SELECT COUNT(*) AS empty_status
FROM staging.orders
WHERE COALESCE(TRIM(status), '') = '';

-- Duplicate business keys
SELECT order_id, COUNT(*) AS c
FROM staging.orders
GROUP BY order_id
HAVING COUNT(*) > 1;

-- Range checks (numeric & date)
SELECT COUNT(*) AS bad_amount
FROM staging.orders
WHERE order_amount < 0 OR order_amount > 100000;

SELECT COUNT(*) AS out_of_window
FROM staging.orders
WHERE order_date < CURRENT_DATE - INTERVAL '365 days' OR order_date > CURRENT_DATE;

Tip: avoid false positives

Use business-approved thresholds. Example: negative amounts may be valid for refunds; model them separately or exclude by status = 'REFUND'.

3) Referential integrity checks

-- Orphan facts: orders referencing missing customers
SELECT o.order_id
FROM dw.fact_orders o
LEFT JOIN dw.dim_customer c ON o.customer_key = c.customer_key
WHERE c.customer_key IS NULL;

-- Pre-load check in staging by natural keys
SELECT s.order_id
FROM staging.orders s
LEFT JOIN dw.dim_customer d ON s.customer_id = d.customer_id
WHERE d.customer_id IS NULL;

Handling late-arriving dimensions

Create an 'Unknown' dimension member (e.g., customer_key = 0). Load facts with 0 when the dimension is late, then backfill keys when the dimension arrives. Track counts of 0-key references and alert if above threshold.

4) Simple anomaly detection for volumes

-- Daily volume z-score: flag days with unusually high/low counts
WITH daily AS (
  SELECT load_date, COUNT(*) AS cnt
  FROM dw.fact_orders
  GROUP BY load_date
), stats AS (
  SELECT load_date, cnt,
         AVG(cnt)  OVER (ORDER BY load_date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS avg7,
         STDDEV(cnt) OVER (ORDER BY load_date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS sd7
  FROM daily
)
SELECT load_date, cnt,
       CASE WHEN sd7 = 0 OR sd7 IS NULL THEN 0 ELSE (cnt - avg7)/sd7 END AS z
FROM stats
WHERE ABS(CASE WHEN sd7 = 0 OR sd7 IS NULL THEN 0 ELSE (cnt - avg7)/sd7 END) > 3;

When to alert

Only alert if volume anomaly coincides with freshness delay (e.g., latest event_ts < now() - 2 hours). Combining signals reduces noise.

5) Reject handling and quarantine tables

-- Quarantine table
CREATE TABLE IF NOT EXISTS dq.quarantine_orders (
  batch_id        STRING,
  record_id       STRING,
  error_code      STRING,
  error_message   STRING,
  raw_payload     STRING,
  detected_at     TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- Insert rejects during load
INSERT INTO dq.quarantine_orders (batch_id, record_id, error_code, error_message, raw_payload)
SELECT :batch_id,
       CAST(order_id AS STRING),
       'NULL_CUSTOMER',
       'customer_id is NULL',
       TO_JSON_STRING(t)
FROM staging.orders t
WHERE customer_id IS NULL;

Reprocessing strategy

Fix upstream (e.g., patch missing customer), then replay rejects by batch_id. Always make reruns idempotent: use MERGE/UPSERT semantics in targets.

6) Data quality reporting metrics

-- DQ metrics table
CREATE TABLE IF NOT EXISTS dq.metrics (
  metric_date     DATE,
  dataset         STRING,
  check_name      STRING,
  status          STRING, -- PASS/FAIL
  expected        BIGINT,
  observed        BIGINT,
  pass_rate       DOUBLE,
  notes           STRING,
  recorded_at     TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- Example: record null check result
INSERT INTO dq.metrics (metric_date, dataset, check_name, status, expected, observed, pass_rate, notes)
SELECT CURRENT_DATE, 'orders', 'null_customer_id',
       CASE WHEN cnt = 0 THEN 'PASS' ELSE 'FAIL' END,
       0, cnt,
       CASE WHEN cnt = 0 THEN 1.0 ELSE 0.0 END,
       'customer_id must not be NULL'
FROM (
  SELECT COUNT(*) AS cnt
  FROM staging.orders
  WHERE customer_id IS NULL
) s;

Reporting tips

Use consistent check names and severities (BLOCKER, WARN).
Trend pass_rate week over week to prove improvements.
Store batch_id/run_id so you can link metrics to pipeline runs.

Drills and exercises

[ ] Write row count, sum(amount), and distinct(order_id) checks for any dataset you own.
[ ] Create duplicate detection by business key and decide a de-duplication rule.
[ ] Implement three referential integrity checks between a fact table and two dims.
[ ] Build a quarantine table and route at least two error types into it.
[ ] Add a daily DQ metric insert for nulls, duplicates, and freshness.
[ ] Implement a 7-day rolling z-score on volume; test with synthetic anomalies.
[ ] Practice a rerun: fix a source error and reprocess only affected rejects.

Common mistakes and debugging tips

Mistake: Matching only total row counts.
Fix: Also reconcile business aggregates by key dimensions (e.g., sums per day/region).
Mistake: Treating empty strings as valid non-null values.
Fix: Normalize with TRIM and NULLIF before checks.
Mistake: Missing late-arriving dimension strategy.
Fix: Use unknown members (0 keys) and backfill process. Alert on elevated 0-key rates.
Mistake: Non-idempotent reruns causing duplicates.
Fix: Always load with MERGE/UPSERT on stable keys; log batch/run IDs.
Mistake: Over-alerting on single signals.
Fix: Combine volume, freshness, and distribution checks; add minimum volume thresholds.
Debug tip: Diff samples of source vs target by key to locate transformation steps that diverge.
Debug tip: Reproduce with a minimal batch locally; strip complex joins until the issue disappears.
Debug tip: Validate time zones and data types (e.g., implicit string-to-int truncation).

Mini project: Retail Orders DQ Guardrails

Build a minimal but complete DQ layer for a retail Orders pipeline (staging → dimension/fact).

Define critical fields and thresholds (customer_id not null, order_amount 0–100k, order_date within 365 days).
Implement pre-load checks in staging: nulls, duplicates by order_id, and referential check to dim_customer by natural key.
Route violations to dq.quarantine_orders with error codes and batch_id.
Load into dw.dim_customer and dw.fact_orders with MERGE. Use unknown dimension rows for missing references.
Insert DQ metrics for every check (PASS/FAIL, counts, pass_rate).
Add anomaly detection for daily order counts and freshness.
Document a runbook: how to investigate a fail, how to reprocess rejects.

Stretch goals

Make checks metadata-driven (table, column, check_type, threshold).
Add severity levels and stop-the-world behavior for BLOCKERs.
Create weekly trend report summarizing pass rates and top error codes.

Subskills

Row Counts And Reconciliation: Compare counts and business aggregates between stages; use hash totals to detect silent drift.
Null Duplicate And Range Checks: Enforce non-null/required fields, unique keys, and valid numeric/date ranges.
Referential Integrity Checks: Detect orphan facts; manage late-arriving dimensions with unknown members.
Anomaly Detection Basics: Add rolling baselines and z-scores for volume and freshness to catch unexpected change.
Reject Handling And Quarantine Tables: Capture invalid records with error codes; support reprocessing.
Data Quality Reporting: Record check results to metrics tables; trend pass rates and alert on breaches.
Root Cause Analysis For Load Issues: Trace failures upstream, isolate minimal reproductions, and verify schema/typing assumptions.
Defining And Maintaining Data Checks: Specify checks as metadata; version, document ownership, and review regularly.

Practical projects

Subscription events: Validate event ordering, deduplicate by event_id, enforce user foreign keys, report freshness.
Payments pipeline: Range checks on amounts, currency codes whitelist, reconcile totals per settlement date.
Product catalog: Enforce uniqueness on SKU, check referential links to categories, quarantine invalid attributes.

Next steps

Expand checks to slowly changing dimensions (SCD) and historical consistency.
Introduce severity-based gating to stop bad loads automatically.
Adopt a metadata-driven check framework and add documentation with owners and SLAs.

Menu

Data Quality And Validation

Table of Contents

Why this skill matters for ETL Developers

Who this is for

Prerequisites

Learning path and milestones

Worked examples

1) Row counts and reconciliation

2) Null, duplicate, and range checks

3) Referential integrity checks

4) Simple anomaly detection for volumes

5) Reject handling and quarantine tables

6) Data quality reporting metrics

Drills and exercises

Common mistakes and debugging tips

Mini project: Retail Orders DQ Guardrails

Subskills

Practical projects

Next steps

Data Quality And Validation — Skill Exam

Topics

Row Counts And Reconciliation

Null Duplicate And Range Checks

Referential Integrity Checks

Anomaly Detection Basics

Reject Handling And Quarantine Tables

Data Quality Reporting

Defining And Maintaining Data Checks

Root Cause Analysis For Load Issues

Have questions about Data Quality And Validation?

AI Assistant