Why this skill matters for ETL Developers
Data pipelines only deliver value when the data is trustworthy. As an ETL Developer, you design, implement, and operate checks that keep data complete, accurate, consistent, and timely. Strong data quality (DQ) and validation practices help you:
- Catch issues before they hit analytics, ML models, or dashboards.
- Shorten incident resolution with clear quarantine, logging, and root-cause analysis.
- Build stakeholder trust with repeatable, auditable quality metrics.
- Reduce rework by preventing bad data from cascading across systems.
What DQ looks like in your day-to-day
- Row counts and hash totals after each load.
- Null/duplicate/range checks on critical columns.
- Referential integrity checks between facts and dimensions.
- Anomaly alerts when volumes, delays, or distributions drift.
- Quarantine and reprocessing workflows for rejects.
- Daily/weekly DQ reports with pass/fail rates and trend lines.
Who this is for
- ETL/ELT developers building batch or streaming pipelines.
- Data engineers responsible for data ingestion and transformations.
- Analytics engineers who need reliable warehouse layers.
- QA engineers and data stewards supporting data platforms.
Prerequisites
- Comfort with SQL (joins, window functions, aggregates).
- Basic ETL/ELT concepts (staging, transformations, target models).
- Familiarity with one programming or orchestration environment (e.g., Python/PySpark, dbt, Airflow, SQL-based ETL).
Learning path and milestones
- Foundations: Define quality dimensions (completeness, accuracy, consistency, uniqueness, timeliness). Implement row counts and simple null/duplicate checks.
- Constraints & integrity: Add referential integrity and range checks; implement reject/quarantine flow.
- Monitoring: Build a DQ metrics table with pass/fail rates, volumes, and freshness; schedule reporting.
- Anomaly basics: Add simple statistical baselines (rolling averages, z-scores) for volumes and lateness.
- Root cause analysis: Create a playbook to trace failures upstream, inspect diffs, and reproduce locally.
- Maintainability: Make checks metadata-driven and versioned; document ownership and SLAs.
Worked examples
1) Row counts and reconciliation
Goal: ensure we loaded everything we expected, and nothing extra.
-- Completeness: staging vs target row counts for today's load
SELECT 'orders' AS table_name,
SUM(CASE WHEN table_ref = 'staging' THEN cnt ELSE 0 END) AS staging_cnt,
SUM(CASE WHEN table_ref = 'target' THEN cnt ELSE 0 END) AS target_cnt
FROM (
SELECT 'staging' AS table_ref, COUNT(*) AS cnt
FROM staging.orders
WHERE load_date = CURRENT_DATE
UNION ALL
SELECT 'target' AS table_ref, COUNT(*) AS cnt
FROM dw.fact_orders
WHERE load_date = CURRENT_DATE
) t;
-- Accuracy proxy: hash totals on stable fields to detect anomalies
SELECT
SUM(ABS(HASH(order_id))) AS id_hash_total,
SUM(order_amount) AS amount_sum
FROM staging.orders
WHERE load_date = CURRENT_DATE;
When counts match but data differs
Compare business aggregates (e.g., SUM(order_amount) by day or region) between source and target. Mismatched sums with equal row counts usually signal transformation errors or duplicated/omitted keys.
2) Null, duplicate, and range checks
-- Nulls & empties on critical columns
SELECT COUNT(*) AS null_customer_id
FROM staging.orders
WHERE customer_id IS NULL;
SELECT COUNT(*) AS empty_status
FROM staging.orders
WHERE COALESCE(TRIM(status), '') = '';
-- Duplicate business keys
SELECT order_id, COUNT(*) AS c
FROM staging.orders
GROUP BY order_id
HAVING COUNT(*) > 1;
-- Range checks (numeric & date)
SELECT COUNT(*) AS bad_amount
FROM staging.orders
WHERE order_amount < 0 OR order_amount > 100000;
SELECT COUNT(*) AS out_of_window
FROM staging.orders
WHERE order_date < CURRENT_DATE - INTERVAL '365 days' OR order_date > CURRENT_DATE;
Tip: avoid false positives
Use business-approved thresholds. Example: negative amounts may be valid for refunds; model them separately or exclude by status = 'REFUND'.
3) Referential integrity checks
-- Orphan facts: orders referencing missing customers
SELECT o.order_id
FROM dw.fact_orders o
LEFT JOIN dw.dim_customer c ON o.customer_key = c.customer_key
WHERE c.customer_key IS NULL;
-- Pre-load check in staging by natural keys
SELECT s.order_id
FROM staging.orders s
LEFT JOIN dw.dim_customer d ON s.customer_id = d.customer_id
WHERE d.customer_id IS NULL;
Handling late-arriving dimensions
Create an 'Unknown' dimension member (e.g., customer_key = 0). Load facts with 0 when the dimension is late, then backfill keys when the dimension arrives. Track counts of 0-key references and alert if above threshold.
4) Simple anomaly detection for volumes
-- Daily volume z-score: flag days with unusually high/low counts
WITH daily AS (
SELECT load_date, COUNT(*) AS cnt
FROM dw.fact_orders
GROUP BY load_date
), stats AS (
SELECT load_date, cnt,
AVG(cnt) OVER (ORDER BY load_date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS avg7,
STDDEV(cnt) OVER (ORDER BY load_date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS sd7
FROM daily
)
SELECT load_date, cnt,
CASE WHEN sd7 = 0 OR sd7 IS NULL THEN 0 ELSE (cnt - avg7)/sd7 END AS z
FROM stats
WHERE ABS(CASE WHEN sd7 = 0 OR sd7 IS NULL THEN 0 ELSE (cnt - avg7)/sd7 END) > 3;
When to alert
Only alert if volume anomaly coincides with freshness delay (e.g., latest event_ts < now() - 2 hours). Combining signals reduces noise.
5) Reject handling and quarantine tables
-- Quarantine table
CREATE TABLE IF NOT EXISTS dq.quarantine_orders (
batch_id STRING,
record_id STRING,
error_code STRING,
error_message STRING,
raw_payload STRING,
detected_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
-- Insert rejects during load
INSERT INTO dq.quarantine_orders (batch_id, record_id, error_code, error_message, raw_payload)
SELECT :batch_id,
CAST(order_id AS STRING),
'NULL_CUSTOMER',
'customer_id is NULL',
TO_JSON_STRING(t)
FROM staging.orders t
WHERE customer_id IS NULL;
Reprocessing strategy
Fix upstream (e.g., patch missing customer), then replay rejects by batch_id. Always make reruns idempotent: use MERGE/UPSERT semantics in targets.
6) Data quality reporting metrics
-- DQ metrics table
CREATE TABLE IF NOT EXISTS dq.metrics (
metric_date DATE,
dataset STRING,
check_name STRING,
status STRING, -- PASS/FAIL
expected BIGINT,
observed BIGINT,
pass_rate DOUBLE,
notes STRING,
recorded_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
-- Example: record null check result
INSERT INTO dq.metrics (metric_date, dataset, check_name, status, expected, observed, pass_rate, notes)
SELECT CURRENT_DATE, 'orders', 'null_customer_id',
CASE WHEN cnt = 0 THEN 'PASS' ELSE 'FAIL' END,
0, cnt,
CASE WHEN cnt = 0 THEN 1.0 ELSE 0.0 END,
'customer_id must not be NULL'
FROM (
SELECT COUNT(*) AS cnt
FROM staging.orders
WHERE customer_id IS NULL
) s;
Reporting tips
- Use consistent check names and severities (BLOCKER, WARN).
- Trend pass_rate week over week to prove improvements.
- Store batch_id/run_id so you can link metrics to pipeline runs.
Drills and exercises
- [ ] Write row count, sum(amount), and distinct(order_id) checks for any dataset you own.
- [ ] Create duplicate detection by business key and decide a de-duplication rule.
- [ ] Implement three referential integrity checks between a fact table and two dims.
- [ ] Build a quarantine table and route at least two error types into it.
- [ ] Add a daily DQ metric insert for nulls, duplicates, and freshness.
- [ ] Implement a 7-day rolling z-score on volume; test with synthetic anomalies.
- [ ] Practice a rerun: fix a source error and reprocess only affected rejects.
Common mistakes and debugging tips
- Mistake: Matching only total row counts.
Fix: Also reconcile business aggregates by key dimensions (e.g., sums per day/region). - Mistake: Treating empty strings as valid non-null values.
Fix: Normalize with TRIM and NULLIF before checks. - Mistake: Missing late-arriving dimension strategy.
Fix: Use unknown members (0 keys) and backfill process. Alert on elevated 0-key rates. - Mistake: Non-idempotent reruns causing duplicates.
Fix: Always load with MERGE/UPSERT on stable keys; log batch/run IDs. - Mistake: Over-alerting on single signals.
Fix: Combine volume, freshness, and distribution checks; add minimum volume thresholds. - Debug tip: Diff samples of source vs target by key to locate transformation steps that diverge.
- Debug tip: Reproduce with a minimal batch locally; strip complex joins until the issue disappears.
- Debug tip: Validate time zones and data types (e.g., implicit string-to-int truncation).
Mini project: Retail Orders DQ Guardrails
Build a minimal but complete DQ layer for a retail Orders pipeline (staging → dimension/fact).
- Define critical fields and thresholds (customer_id not null, order_amount 0–100k, order_date within 365 days).
- Implement pre-load checks in staging: nulls, duplicates by order_id, and referential check to dim_customer by natural key.
- Route violations to dq.quarantine_orders with error codes and batch_id.
- Load into dw.dim_customer and dw.fact_orders with MERGE. Use unknown dimension rows for missing references.
- Insert DQ metrics for every check (PASS/FAIL, counts, pass_rate).
- Add anomaly detection for daily order counts and freshness.
- Document a runbook: how to investigate a fail, how to reprocess rejects.
Stretch goals
- Make checks metadata-driven (table, column, check_type, threshold).
- Add severity levels and stop-the-world behavior for BLOCKERs.
- Create weekly trend report summarizing pass rates and top error codes.
Subskills
- Row Counts And Reconciliation: Compare counts and business aggregates between stages; use hash totals to detect silent drift.
- Null Duplicate And Range Checks: Enforce non-null/required fields, unique keys, and valid numeric/date ranges.
- Referential Integrity Checks: Detect orphan facts; manage late-arriving dimensions with unknown members.
- Anomaly Detection Basics: Add rolling baselines and z-scores for volume and freshness to catch unexpected change.
- Reject Handling And Quarantine Tables: Capture invalid records with error codes; support reprocessing.
- Data Quality Reporting: Record check results to metrics tables; trend pass rates and alert on breaches.
- Root Cause Analysis For Load Issues: Trace failures upstream, isolate minimal reproductions, and verify schema/typing assumptions.
- Defining And Maintaining Data Checks: Specify checks as metadata; version, document ownership, and review regularly.
Practical projects
- Subscription events: Validate event ordering, deduplicate by event_id, enforce user foreign keys, report freshness.
- Payments pipeline: Range checks on amounts, currency codes whitelist, reconcile totals per settlement date.
- Product catalog: Enforce uniqueness on SKU, check referential links to categories, quarantine invalid attributes.
Next steps
- Expand checks to slowly changing dimensions (SCD) and historical consistency.
- Introduce severity-based gating to stop bad loads automatically.
- Adopt a metadata-driven check framework and add documentation with owners and SLAs.