Why this matters
As an ETL Developer, you must prove that data moved correctly through each stage. Simple, fast checks like row counts and basic reconciliations catch broken jobs, missing partitions, and duplicate loads before stakeholders do.
- Validate that staging and warehouse tables have the expected number of rows.
- Confirm daily partitions landed (and only once).
- Verify CDC loads: inserts, updates, deletes add up correctly.
- Spot schema or filter mistakes that silently drop data.
Concept explained simply
Row counts compare how many rows you expect versus how many you got at each step. Reconciliation verifies that source and target agree at the chosen grain (for example, per day, per key, per file) and within a tolerance.
Mental model
Imagine your pipeline as a funnel with checkpoints. At each checkpoint, measure totals and distincts. If counts drift or spike, you stop the flow and investigate. This prevents bad data from sliding further into reporting and ML.
What to measure
- Total row counts (overall and per partition/date/batch).
- Distinct business keys (for example, order_id, customer_id).
- Control totals like sums of amounts or quantities.
- Rejected vs. accepted rows after validation rules.
- Orphans: foreign keys in facts with no matching dimensions.
Choosing tolerances
Some systems vary slightly day to day. Define per-check tolerances (for example, 0 for exact match, or 1–2% for clickstream). Document rationale and apply consistently.
Worked examples
Example 1: Full-load staging to warehouse
Goal: Ensure staging.customer rows match dwh.dim_customer after a full load.
-- Staging count
SELECT COUNT(*) AS src_cnt FROM staging.customer;
-- Target count
SELECT COUNT(*) AS tgt_cnt FROM dwh.dim_customer;
-- Reconcile
WITH s AS (SELECT COUNT(*) AS c FROM staging.customer),
t AS (SELECT COUNT(*) AS c FROM dwh.dim_customer)
SELECT s.c AS src_cnt, t.c AS tgt_cnt, (t.c - s.c) AS delta,
CASE WHEN t.c = s.c THEN 'PASS' ELSE 'FAIL' END AS status
FROM s, t;Investigation next if FAIL: check rejected rows, primary key violations, or transform filters.
Example 2: CDC (inserts/updates/deletes)
Goal: Validate that net target rows reflect CDC operations for a day.
-- Expected target change
-- expected = prior_target + inserts - deletes
-- Validate counts by day_key
WITH cdc AS (
SELECT day_key,
SUM(CASE WHEN op = 'I' THEN 1 ELSE 0 END) AS ins,
SUM(CASE WHEN op = 'D' THEN 1 ELSE 0 END) AS del
FROM staging.orders_cdc
GROUP BY day_key
), hist AS (
SELECT day_key, COUNT(*) AS tgt_rows
FROM dwh.orders
GROUP BY day_key
)
SELECT c.day_key, c.ins, c.del, h.tgt_rows
FROM cdc c
JOIN hist h ON h.day_key = c.day_key;Cross-check yesterday vs. the day before to see if Δ target rows ≈ inserts − deletes. If updates exist, they should not change counts unless your logic includes type-2 rows.
Example 3: Partition validation with tolerance
Goal: Validate daily partitions of fact_sales with ±1% tolerance vs. source.
WITH src AS (
SELECT sale_date, COUNT(*) AS src_cnt
FROM src.sales
WHERE sale_date BETWEEN DATE '2026-01-01' AND DATE '2026-01-07'
GROUP BY sale_date
), tgt AS (
SELECT sale_date, COUNT(*) AS tgt_cnt
FROM dwh.fact_sales
WHERE sale_date BETWEEN DATE '2026-01-01' AND DATE '2026-01-07'
GROUP BY sale_date
)
SELECT s.sale_date, s.src_cnt, t.tgt_cnt,
(t.tgt_cnt - s.src_cnt) AS delta,
ROUND(100.0 * ABS(t.tgt_cnt - s.src_cnt) / NULLIF(s.src_cnt,0), 2) AS pct_diff,
CASE WHEN s.src_cnt = 0 AND t.tgt_cnt = 0 THEN 'PASS'
WHEN s.src_cnt = 0 THEN 'FAIL'
WHEN ABS(t.tgt_cnt - s.src_cnt) <= 0.01 * s.src_cnt THEN 'PASS'
ELSE 'FAIL' END AS status
FROM src s LEFT JOIN tgt t USING (sale_date)How to implement (step-by-step)
- Define the grain: per file, per day, per batch, or per business key. Keep it consistent along the pipeline.
- Write source queries: totals, distincts, control sums. Record filters used in the job.
- Create a control table to store checks per run:
-- Example control table CREATE TABLE dwh.load_control ( run_id VARCHAR(50), check_name VARCHAR(200), grain_key VARCHAR(200), src_count BIGINT, tgt_count BIGINT, delta_count BIGINT, pct_diff DECIMAL(9,2), tolerance_pct DECIMAL(5,2), status VARCHAR(10), checked_at TIMESTAMP );
- Validate per stage: source → landing → staging → warehouse. Fail fast on mismatches.
- Reconcile exceptions: locate missing keys, duplicates, and rejected rows:
-- Missing keys SELECT s.key FROM staging.entity s LEFT JOIN dwh.entity t ON t.key = s.key WHERE t.key IS NULL; -- Duplicates SELECT key, COUNT(*) FROM staging.entity GROUP BY key HAVING COUNT(*) > 1;
- Automate: run checks, insert into load_control, and alert on FAIL.
- Review: compare trend lines over time to detect drift.
Practice exercises
Complete these in SQL or your preferred query tool. They mirror the exercises below and the Quick Test.
Exercise 1: Partition row-count reconciliation
Goal: Produce src_count, tgt_count, delta, and status for yesterday’s load_date across source and dwh tables. Tolerance 0%.
Expected output
A result set with columns: load_date, src_count, tgt_count, delta, status. One row for yesterday.
Hints
- Aggregate both tables by load_date.
- Join by load_date and compute delta = tgt − src.
- CASE for PASS/FAIL where delta = 0.
Show solution
WITH s AS (
SELECT load_date, COUNT(*) AS src_count
FROM src.sales
WHERE load_date = CURRENT_DATE - INTERVAL '1 day'
GROUP BY load_date
), t AS (
SELECT load_date, COUNT(*) AS tgt_count
FROM dwh.fact_sales
WHERE load_date = CURRENT_DATE - INTERVAL '1 day'
GROUP BY load_date
)
SELECT COALESCE(s.load_date, t.load_date) AS load_date,
COALESCE(s.src_count, 0) AS src_count,
COALESCE(t.tgt_count, 0) AS tgt_count,
COALESCE(t.tgt_count, 0) - COALESCE(s.src_count, 0) AS delta,
CASE WHEN COALESCE(t.tgt_count, 0) = COALESCE(s.src_count, 0) THEN 'PASS' ELSE 'FAIL' END AS status
FROM s FULL OUTER JOIN t USING (load_date);Exercise 2: Find missing dimension keys after load
Goal: After loading dwh.dim_customer, list customer_ids present in source but missing in target; also count orphans in dwh.fact_sales.
Expected output
- A list of missing customer_id values (or count).
- An integer count of orphan fact rows.
Hints
- LEFT JOIN source to target to find missing.
- LEFT JOIN fact to dim to find orphans.
- Add WHERE t.key IS NULL filter.
Show solution
-- Missing in dim SELECT s.customer_id FROM src.customer s LEFT JOIN dwh.dim_customer d ON d.customer_id = s.customer_id WHERE d.customer_id IS NULL; -- Orphan facts SELECT COUNT(*) AS orphan_facts FROM dwh.fact_sales f LEFT JOIN dwh.dim_customer d ON d.customer_id = f.customer_id WHERE d.customer_id IS NULL;
- Checklist: grain defined and documented.
- Checklist: counts for both sides computed at the same filter level.
- Checklist: delta and status derived, not eyeballed.
- Checklist: edge cases handled (no rows, null keys).
Common mistakes and self-check
- Comparing different filters (for example, source includes cancelations, target excludes). Self-check: print the exact WHERE clause used by the job.
- Ignoring null or default dates in partition checks. Self-check: add a bucket for unknown/NULL partitions.
- Assuming updates don’t affect counts in SCD2. Self-check: track type-2 row creations separately.
- Not handling zero-source days. Self-check: define PASS rule when both sides are zero.
- Only checking totals, not distinct keys. Self-check: compare distinct id counts as well.
Practical projects
- Build a load_control framework: capture counts and status per batch, store to a table, and generate a daily reconciliation report.
- Add CDC-aware reconciliation: validate inserts, updates, deletes per day and flag anomalies (for example, negative counts, unexpected spikes).
- Dimension-fact integrity audit: nightly job that finds missing dimension keys and summarizes orphan counts by partition.
Mini challenge
You loaded a day with these metrics: source_count=100, insert_count=102, update_count=5, reject_count=2, target_count increased by 100. Tolerance is 0%. Pass or fail? What’s the likely issue?
Reveal guidance
Expected net change should be inserts − rejects if updates don’t add rows. 102 − 2 = +100 matches target increase, despite 5 updates. PASS if updates do not create new rows; otherwise verify that updates are SCD2 and should add new versions.
Who this is for
- ETL Developers and Data Engineers who need strong operational data quality checks.
- Analytics Engineers who validate staging-to-mart pipelines.
Prerequisites
- Comfortable with SQL (aggregations, joins).
- Basic understanding of ETL stages and partitioning.
- Awareness of CDC and SCD concepts.
Learning path
- Start: Row count basics and per-partition checks (this lesson).
- Next: Control totals (sums, hashes) and referential integrity.
- Advance: Trend monitoring, anomaly thresholds, and alerting.
- Capstone: Build a reusable reconciliation module integrated into orchestration.
Next steps
- Do the quick test below to check your understanding. The test is available to everyone; only logged-in users will have their progress saved.
- Apply the example queries to one of your actual pipelines today.
- Document your tolerance rules and share them with your team.