How to learn Reconciliation Source To Target for Data Quality And Reliability in Data Engineer for free

Why this matters

Reconciliation (Source to Target) proves that data moved or transformed is complete and accurate. As a Data Engineer, you will:

Verify that daily warehouse loads match the raw source (counts and totals).
Check CDC pipelines so inserts, updates, and deletes in source appear correctly downstream.
Provide audit evidence for regulatory or financial reporting.
Catch drift, late-arriving events, duplicate loads, and mapping errors before stakeholders are affected.

Concept explained simply

Reconciliation compares the source dataset to the target dataset after ingestion or transformation. You confirm:

Completeness: every expected record made it to the target (no missing or extra rows).
Accuracy: key fields and aggregated values are preserved (within allowed tolerances).
Timeliness: the target reflects source changes within an agreed delay window.

Mental model

Think in three buckets after aligning by primary key and time window:

Same: rows present in both source and target with matching values.
Missing: rows in source not found in target (incompleteness).
Extra: rows in target not found in source (duplicates or stale data).

Then check control totals (sums, counts) and field-level equality for accuracy. For CDC, also confirm deletes, effective dates, and late-arriving data handling.

Methods and patterns

Partitioned counts and totals: compare COUNT(*) and SUM(amount) by date or batch_id.
Set comparisons: source MINUS target (missing), target MINUS source (extra).
Hash diffs: compute a stable hash of selected columns per primary key; compare hashes to find changed rows.
Thresholds/tolerances: allow minor numeric differences from rounding or currency conversions (e.g., difference <= 0.01 per row).
Watermarks/windows: reconcile only up to a safe point-in-time (e.g., source updated_at <= T) to handle late events.
Referential checks: ensure foreign keys in facts match dimension surrogate keys (no orphans).
Schema drift detection: compare expected columns, data types, and nullability.
Idempotency checks: detect duplicate loads via batch_id, checksum files, or row-level uniqueness.

Worked examples

Example 1: Partitioned row counts and control totals

Goal: Confirm the Sales fact table for 2025-12-01 is complete.

-- Source control totals
SELECT sale_date, COUNT(*) AS src_rows, SUM(amount) AS src_sum
FROM src.sales
WHERE sale_date = DATE '2025-12-01'
GROUP BY sale_date;

-- Target control totals
SELECT sale_date, COUNT(*) AS tgt_rows, SUM(amount) AS tgt_sum
FROM dw.fct_sales
WHERE sale_date = DATE '2025-12-01'
GROUP BY sale_date;

If rows match but sums differ by a few cents, apply a tolerance rule (e.g., absolute difference <= 0.5 USD for the partition). If counts do not match, proceed to set comparisons to find missing/extra rows.

Example 2: Set comparison to find missing and extra rows

-- Missing rows: in source, not in target
SELECT s.order_id
FROM src.orders s
LEFT JOIN dw.fct_orders t ON t.order_id = s.order_id
WHERE t.order_id IS NULL;

-- Extra rows: in target, not in source (potential duplicates or stale)
SELECT t.order_id
FROM dw.fct_orders t
LEFT JOIN src.orders s ON s.order_id = t.order_id
WHERE s.order_id IS NULL;

Investigate root causes: ingestion filters, CDC gaps, or incorrect joins.

Example 3: Hash-diff to detect changed records

Goal: Detect any row where selected business columns changed.

-- Example using a stable concatenation; adjust function names to your SQL dialect
WITH src_h AS (
  SELECT order_id,
         MD5(CONCAT_WS('|', customer_id, status, CAST(amount AS VARCHAR), TO_CHAR(updated_at, 'YYYY-MM-DD"T"HH24:MI:SS'))) AS h
  FROM src.orders
),
 tgt_h AS (
  SELECT order_id,
         MD5(CONCAT_WS('|', customer_id, status, CAST(amount AS VARCHAR), TO_CHAR(updated_at, 'YYYY-MM-DD"T"HH24:MI:SS'))) AS h
  FROM dw.fct_orders
)
SELECT s.order_id
FROM src_h s
JOIN tgt_h t USING(order_id)
WHERE s.h <> t.h;

Review differences and verify the transformation logic (e.g., currency conversion, rounding) or late-arriving updates.

Example 4: CDC windowed reconciliation

When events can arrive late, reconcile only up to a watermark.

-- Choose watermark T (e.g., now - 2 hours)
WITH s AS (
  SELECT * FROM src.orders WHERE updated_at <= :T
),
 t AS (
  SELECT * FROM dw.fct_orders WHERE updated_at_source <= :T
)
SELECT COUNT(*) FROM s;
SELECT COUNT(*) FROM t;
-- Then perform set and hash checks within the same T window.

This avoids false alarms caused by in-flight late events.

Who this is for

Data Engineers building batch or streaming pipelines.
Data Quality engineers implementing automated checks.

Prerequisites

Comfort with SQL joins, aggregates, and windowing.
Understanding of primary keys, foreign keys, and CDC basics.
Familiarity with partitioning (by date, batch_id, etc.).

Learning path

Start with partitioned counts and sums checks.
Learn set comparisons (source minus target, and vice versa).
Add hash-diff checks for field-level accuracy.
Implement tolerances and watermarks for practical reliability.
Automate reports and alerts with clear, actionable outputs.

Step-by-step: a dependable reconciliation routine

Scope: choose table(s), partition(s), and time window.
Completeness: compare counts by partition and find missing/extra rows.
Accuracy: compare control totals and hash-diffs on critical columns.
Constraints: validate referential integrity and uniqueness.
Windows: apply watermark for late data; recheck after window closes.
Thresholds: define acceptable tolerances and alert conditions.
Report: produce a concise pass/fail summary with diagnostics.

Exercises

Do these hands-on tasks. Hints and solutions are provided. The same exercises appear below in the Exercises section of this page for easy copy/paste.

Exercise 1: Counts, totals, and set comparison

Tables: src.orders(order_id, customer_id, amount, status, sale_date) and dw.fct_orders(order_id, customer_id, amount, status, sale_date). Reconcile sale_date = 2025-12-01:

Compare COUNT(*) and SUM(amount) by the date partition.
Find order_ids missing in target and extra in target.
Apply a tolerance of 0.50 on the total amount difference.

Hints

Use GROUP BY sale_date and aggregate functions.
Use LEFT JOIN with IS NULL to find missing/extra keys.
Compute absolute difference between sums to apply tolerance.

Show solution

-- Control totals
WITH s AS (
  SELECT COUNT(*) AS c, SUM(amount) AS s
  FROM src.orders WHERE sale_date = DATE '2025-12-01'
), t AS (
  SELECT COUNT(*) AS c, SUM(amount) AS s
  FROM dw.fct_orders WHERE sale_date = DATE '2025-12-01'
)
SELECT s.c AS src_rows, t.c AS tgt_rows,
       s.s AS src_sum, t.s AS tgt_sum,
       ABS(s.s - t.s) AS abs_diff
FROM s, t;

-- Missing in target
SELECT s.order_id
FROM src.orders s
LEFT JOIN dw.fct_orders t ON t.order_id = s.order_id
WHERE s.sale_date = DATE '2025-12-01' AND t.order_id IS NULL;

-- Extra in target
SELECT t.order_id
FROM dw.fct_orders t
LEFT JOIN src.orders s ON s.order_id = t.order_id
WHERE t.sale_date = DATE '2025-12-01' AND s.order_id IS NULL;

Decision: If abs_diff <= 0.50 and both missing/extra sets are empty, mark as PASS.

Exercise 2: Hash-diff accuracy check

Compute a hash over (customer_id, status, amount) per order_id in both tables and list mismatches.

Hints

Use a stable concatenation order and delimiter.
Cast numbers to strings consistently before hashing.
Join on order_id and compare hashes.

Show solution

WITH s AS (
  SELECT order_id,
         MD5(CONCAT_WS('|', customer_id, status, CAST(amount AS VARCHAR))) AS h
  FROM src.orders
), t AS (
  SELECT order_id,
         MD5(CONCAT_WS('|', customer_id, status, CAST(amount AS VARCHAR))) AS h
  FROM dw.fct_orders
)
SELECT s.order_id
FROM s
JOIN t USING(order_id)
WHERE s.h <> t.h;

Investigate listed order_id records to align business rules (e.g., rounding/scale).

Self-check checklist

I compared counts and control totals for the intended partition/window.
I checked both missing and extra keys with set comparisons.
I validated field-level accuracy with a hash or explicit column comparisons.
I applied a reasonable tolerance and documented it.
I considered late-arriving data via a watermark.

Common mistakes and how to self-check

Comparing unbounded data: always reconcile within a clear time window or partition.
Ignoring null semantics: normalize nulls in hash inputs to avoid false diffs.
Hashing PII directly: exclude sensitive fields or use safe derived values.
Not reconciling deletes: ensure source deletes or end-dates are reflected downstream.
One-sided checks: check both missing and extra keys.
Overly strict tolerances: allow minor numeric differences from formatting or conversions.

Practical projects

Build a daily reconciliation notebook that outputs pass/fail per table, partition, and totals.
Create a hash-diff job for top 5 critical tables, with a mismatch report per column.
Implement a CDC watermark reconciliation: run T-2h and T-24h checks and compare drift.

Next steps

Automate reconciliation in your orchestration tool as a post-load step.
Publish concise dashboards showing counts, totals, and mismatch trends.
Document tolerances and SLAs for each pipeline.

Mini challenge

Your finance fact table shows matching counts but a total amount higher in target by 1.2%. List three targeted queries you would run to isolate the cause, and explain your expected findings. Keep your answer focused and actionable.

Quick test

Anyone can take the quick test for free. Sign in to save your progress and see your history.

Menu

Reconciliation Source To Target

Table of Contents

Why this matters

Concept explained simply

Mental model

Methods and patterns

Worked examples

Who this is for

Prerequisites

Learning path

Step-by-step: a dependable reconciliation routine

Exercises

Self-check checklist

Common mistakes and how to self-check

Practical projects

Next steps

Mini challenge

Quick test

Practice Exercises

Counts, totals, and set comparison

Instructions

Expected Output

Hash-diff accuracy check

Reconciliation Source To Target — Quick Test

Have questions about Reconciliation Source To Target?

AI Assistant