How to learn Row Counts And Reconciliation for Data Quality And Validation in ETL Developer for free

Why this matters

As an ETL Developer, you must prove that data moved correctly through each stage. Simple, fast checks like row counts and basic reconciliations catch broken jobs, missing partitions, and duplicate loads before stakeholders do.

Validate that staging and warehouse tables have the expected number of rows.
Confirm daily partitions landed (and only once).
Verify CDC loads: inserts, updates, deletes add up correctly.
Spot schema or filter mistakes that silently drop data.

Concept explained simply

Row counts compare how many rows you expect versus how many you got at each step. Reconciliation verifies that source and target agree at the chosen grain (for example, per day, per key, per file) and within a tolerance.

Mental model

Imagine your pipeline as a funnel with checkpoints. At each checkpoint, measure totals and distincts. If counts drift or spike, you stop the flow and investigate. This prevents bad data from sliding further into reporting and ML.

What to measure

Total row counts (overall and per partition/date/batch).
Distinct business keys (for example, order_id, customer_id).
Control totals like sums of amounts or quantities.
Rejected vs. accepted rows after validation rules.
Orphans: foreign keys in facts with no matching dimensions.

Choosing tolerances

Some systems vary slightly day to day. Define per-check tolerances (for example, 0 for exact match, or 1–2% for clickstream). Document rationale and apply consistently.

Worked examples

Example 1: Full-load staging to warehouse

Goal: Ensure staging.customer rows match dwh.dim_customer after a full load.

-- Staging count
SELECT COUNT(*) AS src_cnt FROM staging.customer;
-- Target count
SELECT COUNT(*) AS tgt_cnt FROM dwh.dim_customer;
-- Reconcile
WITH s AS (SELECT COUNT(*) AS c FROM staging.customer),
     t AS (SELECT COUNT(*) AS c FROM dwh.dim_customer)
SELECT s.c AS src_cnt, t.c AS tgt_cnt, (t.c - s.c) AS delta,
       CASE WHEN t.c = s.c THEN 'PASS' ELSE 'FAIL' END AS status
FROM s, t;

Investigation next if FAIL: check rejected rows, primary key violations, or transform filters.

Example 2: CDC (inserts/updates/deletes)

Goal: Validate that net target rows reflect CDC operations for a day.

-- Expected target change
-- expected = prior_target + inserts - deletes
-- Validate counts by day_key
WITH cdc AS (
  SELECT day_key,
         SUM(CASE WHEN op = 'I' THEN 1 ELSE 0 END) AS ins,
         SUM(CASE WHEN op = 'D' THEN 1 ELSE 0 END) AS del
  FROM staging.orders_cdc
  GROUP BY day_key
), hist AS (
  SELECT day_key, COUNT(*) AS tgt_rows
  FROM dwh.orders
  GROUP BY day_key
)
SELECT c.day_key, c.ins, c.del, h.tgt_rows
FROM cdc c
JOIN hist h ON h.day_key = c.day_key;

Cross-check yesterday vs. the day before to see if Δ target rows ≈ inserts − deletes. If updates exist, they should not change counts unless your logic includes type-2 rows.

Example 3: Partition validation with tolerance

Goal: Validate daily partitions of fact_sales with ±1% tolerance vs. source.

WITH src AS (
  SELECT sale_date, COUNT(*) AS src_cnt
  FROM src.sales
  WHERE sale_date BETWEEN DATE '2026-01-01' AND DATE '2026-01-07'
  GROUP BY sale_date
), tgt AS (
  SELECT sale_date, COUNT(*) AS tgt_cnt
  FROM dwh.fact_sales
  WHERE sale_date BETWEEN DATE '2026-01-01' AND DATE '2026-01-07'
  GROUP BY sale_date
)
SELECT s.sale_date, s.src_cnt, t.tgt_cnt,
       (t.tgt_cnt - s.src_cnt) AS delta,
       ROUND(100.0 * ABS(t.tgt_cnt - s.src_cnt) / NULLIF(s.src_cnt,0), 2) AS pct_diff,
       CASE WHEN s.src_cnt = 0 AND t.tgt_cnt = 0 THEN 'PASS'
            WHEN s.src_cnt = 0 THEN 'FAIL'
            WHEN ABS(t.tgt_cnt - s.src_cnt) <= 0.01 * s.src_cnt THEN 'PASS'
            ELSE 'FAIL' END AS status
FROM src s LEFT JOIN tgt t USING (sale_date)

How to implement (step-by-step)

Define the grain: per file, per day, per batch, or per business key. Keep it consistent along the pipeline.
Write source queries: totals, distincts, control sums. Record filters used in the job.
Create a control table to store checks per run:

-- Example control table
CREATE TABLE dwh.load_control (
  run_id        VARCHAR(50),
  check_name    VARCHAR(200),
  grain_key     VARCHAR(200),
  src_count     BIGINT,
  tgt_count     BIGINT,
  delta_count   BIGINT,
  pct_diff      DECIMAL(9,2),
  tolerance_pct DECIMAL(5,2),
  status        VARCHAR(10),
  checked_at    TIMESTAMP
);

Validate per stage: source → landing → staging → warehouse. Fail fast on mismatches.
Reconcile exceptions: locate missing keys, duplicates, and rejected rows:

-- Missing keys
SELECT s.key
FROM staging.entity s
LEFT JOIN dwh.entity t ON t.key = s.key
WHERE t.key IS NULL;

-- Duplicates
SELECT key, COUNT(*)
FROM staging.entity
GROUP BY key
HAVING COUNT(*) > 1;

Automate: run checks, insert into load_control, and alert on FAIL.
Review: compare trend lines over time to detect drift.

Practice exercises

Complete these in SQL or your preferred query tool. They mirror the exercises below and the Quick Test.

Exercise 1: Partition row-count reconciliation

Goal: Produce src_count, tgt_count, delta, and status for yesterday’s load_date across source and dwh tables. Tolerance 0%.

Expected output

A result set with columns: load_date, src_count, tgt_count, delta, status. One row for yesterday.

Hints

Aggregate both tables by load_date.
Join by load_date and compute delta = tgt − src.
CASE for PASS/FAIL where delta = 0.

Show solution

WITH s AS (
  SELECT load_date, COUNT(*) AS src_count
  FROM src.sales
  WHERE load_date = CURRENT_DATE - INTERVAL '1 day'
  GROUP BY load_date
), t AS (
  SELECT load_date, COUNT(*) AS tgt_count
  FROM dwh.fact_sales
  WHERE load_date = CURRENT_DATE - INTERVAL '1 day'
  GROUP BY load_date
)
SELECT COALESCE(s.load_date, t.load_date) AS load_date,
       COALESCE(s.src_count, 0) AS src_count,
       COALESCE(t.tgt_count, 0) AS tgt_count,
       COALESCE(t.tgt_count, 0) - COALESCE(s.src_count, 0) AS delta,
       CASE WHEN COALESCE(t.tgt_count, 0) = COALESCE(s.src_count, 0) THEN 'PASS' ELSE 'FAIL' END AS status
FROM s FULL OUTER JOIN t USING (load_date);

Exercise 2: Find missing dimension keys after load

Goal: After loading dwh.dim_customer, list customer_ids present in source but missing in target; also count orphans in dwh.fact_sales.

Expected output

A list of missing customer_id values (or count).
An integer count of orphan fact rows.

Hints

LEFT JOIN source to target to find missing.
LEFT JOIN fact to dim to find orphans.
Add WHERE t.key IS NULL filter.

Show solution

-- Missing in dim
SELECT s.customer_id
FROM src.customer s
LEFT JOIN dwh.dim_customer d ON d.customer_id = s.customer_id
WHERE d.customer_id IS NULL;

-- Orphan facts
SELECT COUNT(*) AS orphan_facts
FROM dwh.fact_sales f
LEFT JOIN dwh.dim_customer d ON d.customer_id = f.customer_id
WHERE d.customer_id IS NULL;

Checklist: grain defined and documented.
Checklist: counts for both sides computed at the same filter level.
Checklist: delta and status derived, not eyeballed.
Checklist: edge cases handled (no rows, null keys).

Common mistakes and self-check

Comparing different filters (for example, source includes cancelations, target excludes). Self-check: print the exact WHERE clause used by the job.
Ignoring null or default dates in partition checks. Self-check: add a bucket for unknown/NULL partitions.
Assuming updates don’t affect counts in SCD2. Self-check: track type-2 row creations separately.
Not handling zero-source days. Self-check: define PASS rule when both sides are zero.
Only checking totals, not distinct keys. Self-check: compare distinct id counts as well.

Practical projects

Build a load_control framework: capture counts and status per batch, store to a table, and generate a daily reconciliation report.
Add CDC-aware reconciliation: validate inserts, updates, deletes per day and flag anomalies (for example, negative counts, unexpected spikes).
Dimension-fact integrity audit: nightly job that finds missing dimension keys and summarizes orphan counts by partition.

Mini challenge

You loaded a day with these metrics: source_count=100, insert_count=102, update_count=5, reject_count=2, target_count increased by 100. Tolerance is 0%. Pass or fail? What’s the likely issue?

Reveal guidance

Expected net change should be inserts − rejects if updates don’t add rows. 102 − 2 = +100 matches target increase, despite 5 updates. PASS if updates do not create new rows; otherwise verify that updates are SCD2 and should add new versions.

Who this is for

ETL Developers and Data Engineers who need strong operational data quality checks.
Analytics Engineers who validate staging-to-mart pipelines.

Prerequisites

Comfortable with SQL (aggregations, joins).
Basic understanding of ETL stages and partitioning.
Awareness of CDC and SCD concepts.

Learning path

Start: Row count basics and per-partition checks (this lesson).
Next: Control totals (sums, hashes) and referential integrity.
Advance: Trend monitoring, anomaly thresholds, and alerting.
Capstone: Build a reusable reconciliation module integrated into orchestration.

Next steps

Do the quick test below to check your understanding. The test is available to everyone; only logged-in users will have their progress saved.
Apply the example queries to one of your actual pipelines today.
Document your tolerance rules and share them with your team.

Menu

Row Counts And Reconciliation

Table of Contents

Why this matters

Concept explained simply

Mental model

Worked examples

How to implement (step-by-step)

Practice exercises

Exercise 1: Partition row-count reconciliation

Exercise 2: Find missing dimension keys after load

Common mistakes and self-check

Practical projects

Mini challenge

Who this is for

Prerequisites

Learning path

Next steps

Practice Exercises

Partition row-count reconciliation

Instructions

Expected Output

Find missing dimension keys after load

Row Counts And Reconciliation — Quick Test

Have questions about Row Counts And Reconciliation?

AI Assistant