Why this matters
Analytics Engineers must prove that what landed in the warehouse matches what left the source. Reconciliation tests catch data loss, duplication, and transformation errors before they impact dashboards, finance closes, and ML models.
- Verify nightly loads: row counts match, distinct keys match, and money totals align within a tolerance.
- Prove business metrics integrity after refactors or migrations.
- Triage incidents fast: pinpoint whether issues are in extraction, transformation, or loading.
Concept explained simply
Think of reconciliation as checking two ledgers: the source ledger and the target ledger. If you sum the same columns and count the same rows under the same rules, the totals should align. If not, you investigate.
Mental model
- Level 1 (Completeness): row counts and distinct key counts match for the same time window.
- Level 2 (Uniqueness & Integrity): no duplicates on keys; referential integrity holds; min/max dates align.
- Level 3 (Accuracy): aggregates (sums, averages) and selected field-level samples match within tolerances.
Core checks you can automate
- Row count and filtered window count equality (e.g., for last load date).
- Distinct primary key counts match; zero duplicates on target keys.
- Aggregate comparisons: SUM, COUNT, MIN, MAX across mapped fields.
- Nullability checks: null rates per column within allowed thresholds.
- Referential integrity: keys present in target exist in parent dimension/source.
- Hash-based or sampled row comparisons for critical columns.
- Tolerances: define absolute and/or relative (percent) thresholds to handle rounding and currencies.
Worked examples
Example 1: Orders fact — counts, distinct keys, duplicates
Goal: Ensure yesterday's Orders from source match target fact table.
-- Parameters
-- :d = load date (e.g., 2025-03-15)
-- Row count by date
SELECT 'src' AS side, COUNT(*) AS cnt
FROM src.orders
WHERE order_date = :d
UNION ALL
SELECT 'tgt', COUNT(*)
FROM wh.fact_orders
WHERE order_date = :d;
-- Distinct keys
SELECT 'src' AS side, COUNT(DISTINCT order_id) AS dkeys
FROM src.orders WHERE order_date = :d
UNION ALL
SELECT 'tgt', COUNT(DISTINCT order_id)
FROM wh.fact_orders WHERE order_date = :d;
-- Duplicates on target key
SELECT order_id, COUNT(*) AS c
FROM wh.fact_orders
WHERE order_date = :d
GROUP BY order_id
HAVING COUNT(*) > 1;
Expected: counts equal; distinct keys equal; zero duplicate rows.
Example 2: Amount totals with tolerance
Goal: Validate financial totals allow small rounding differences.
-- Compare SUM(amount) with relative tolerance 0.1% (0.001)
WITH s AS (
SELECT SUM(amount) AS s_sum
FROM src.orders WHERE order_date = :d
), t AS (
SELECT SUM(amount_usd) AS t_sum
FROM wh.fact_orders WHERE order_date = :d
)
SELECT CASE
WHEN s.s_sum = 0 AND t.t_sum = 0 THEN 'PASS'
WHEN s.s_sum = 0 THEN 'FAIL: zero source, non-zero target'
ELSE CASE WHEN ABS(t.t_sum - s.s_sum) / NULLIF(ABS(s.s_sum), 0) <= 0.001
THEN 'PASS' ELSE 'FAIL' END
END AS result,
s.s_sum AS src_sum,
t.t_sum AS tgt_sum
FROM s CROSS JOIN t;
Tip: Use both absolute and relative thresholds for stability on very small sums.
Example 3: Hash-based sample comparison
Goal: Detect silent field-level mismatches on a sample of rows.
-- Replace MD5 with your warehouse hash function
WITH src_sample AS (
SELECT order_id,
MD5(CONCAT_WS('|', COALESCE(customer_id, ''), COALESCE(status, ''), COALESCE(CAST(amount AS VARCHAR), ''))) AS h
FROM src.orders
WHERE order_date = :d
ORDER BY order_id
LIMIT 1000
), tgt_sample AS (
SELECT order_id,
MD5(CONCAT_WS('|', COALESCE(customer_id, ''), COALESCE(status, ''), COALESCE(CAST(amount_usd AS VARCHAR), ''))) AS h
FROM wh.fact_orders
WHERE order_date = :d
ORDER BY order_id
LIMIT 1000
)
SELECT s.order_id
FROM src_sample s
FULL OUTER JOIN tgt_sample t USING (order_id)
WHERE s.h IS DISTINCT FROM t.h;
Expected: zero mismatched hashes for sampled rows.
Step-by-step: design a reconciliation suite
- Define scope: time window (e.g., last load date) and business keys.
- Map columns: source to target, including transformations (e.g., currency, trimming, type casts).
- Write Level 1 checks: row counts, distinct key counts.
- Write Level 2 checks: duplicates, null thresholds, min/max date ranges, referential integrity.
- Write Level 3 checks: aggregates with tolerances; sampled or hash comparisons on critical fields.
- Parameterize: use variables (like :d) so queries run for any load date or batch id.
- Record results: write pass/fail and metrics into an audit table for trend monitoring.
- Triage playbook: on failure, identify step (extract/transform/load) and affected dimensions/facts.
Hands-on exercises
Do these to build muscle memory. Mirror results in your environment. Solutions are hidden below each exercise.
Exercise 1 — Daily Orders reconciliation (counts, keys, sums)
Tables:
- Source: src.orders(order_id, order_date, customer_id, status, amount)
- Target: wh.fact_orders(order_id, order_date, customer_id, status, amount_usd)
Task: For a given :d (load date), write queries to:
- Compare row counts and distinct order_id counts.
- Find duplicates on target order_id.
- Compare SUM(amount) vs SUM(amount_usd) with relative tolerance 0.1%.
Show solution
-- Counts
SELECT 'src' AS side, COUNT(*) AS cnt FROM src.orders WHERE order_date = :d
UNION ALL
SELECT 'tgt', COUNT(*) FROM wh.fact_orders WHERE order_date = :d;
-- Distinct keys
SELECT 'src' AS side, COUNT(DISTINCT order_id) AS dkeys FROM src.orders WHERE order_date = :d
UNION ALL
SELECT 'tgt', COUNT(DISTINCT order_id) FROM wh.fact_orders WHERE order_date = :d;
-- Duplicates on target
SELECT order_id, COUNT(*) AS c
FROM wh.fact_orders WHERE order_date = :d
GROUP BY order_id HAVING COUNT(*) > 1;
-- Sums with tolerance 0.1%
WITH s AS (SELECT SUM(amount) AS s_sum FROM src.orders WHERE order_date = :d),
t AS (SELECT SUM(amount_usd) AS t_sum FROM wh.fact_orders WHERE order_date = :d)
SELECT s.s_sum, t.t_sum,
CASE WHEN s.s_sum = 0 AND t.t_sum = 0 THEN 'PASS'
WHEN s.s_sum = 0 THEN 'FAIL'
WHEN ABS(t.t_sum - s.s_sum) / NULLIF(ABS(s.s_sum), 0) <= 0.001 THEN 'PASS'
ELSE 'FAIL' END AS result
FROM s CROSS JOIN t;
Exercise 2 — Hash-based sample check
Task: Compare 1000 sampled rows by hashing key fields to detect field-level mismatches.
- Pick fields: customer_id, status, amount/amount_usd.
- Use a hash function available in your warehouse (e.g., MD5).
- Return order_id values that mismatch.
Show solution
WITH src_sample AS (
SELECT order_id,
MD5(CONCAT_WS('|', COALESCE(customer_id, ''), COALESCE(status, ''), COALESCE(CAST(amount AS VARCHAR), ''))) AS h
FROM src.orders
WHERE order_date = :d
ORDER BY order_id
LIMIT 1000
), tgt_sample AS (
SELECT order_id,
MD5(CONCAT_WS('|', COALESCE(customer_id, ''), COALESCE(status, ''), COALESCE(CAST(amount_usd AS VARCHAR), ''))) AS h
FROM wh.fact_orders
WHERE order_date = :d
ORDER BY order_id
LIMIT 1000
)
SELECT s.order_id
FROM src_sample s
FULL OUTER JOIN tgt_sample t USING (order_id)
WHERE s.h IS DISTINCT FROM t.h;
Exercise checklist
- Counts compared for same window.
- Distinct key counts compared.
- Duplicate check written on the target.
- Aggregate comparison uses tolerance (absolute or relative).
- Sample/hash check returns IDs when mismatched.
Common mistakes and how to self-check
- Comparing different windows (e.g., source by event date vs target by load date). Self-check: print min/max dates on both sides for the filter.
- Ignoring duplicates. Self-check: always run a GROUP BY key HAVING COUNT(*) > 1 on the target.
- No tolerance on money totals. Self-check: define both absolute and relative thresholds.
- Hashing NULLs inconsistently. Self-check: coalesce nulls to a standard sentinel before hashing.
- Forgetting referential integrity. Self-check: anti-join fact foreign keys to dimension keys.
Practical projects
- Build a reconciliation pack for your top 3 fact tables: counts, keys, sums, and hash samples.
- Create an audit table that stores metric values (counts, sums, null rates) per run, with pass/fail and timestamps.
- Migrate one pipeline and prove parity: run parallel loads for a week and compare daily deltas.
Who this is for
- Analytics Engineers and BI Developers responsible for reliable transformations.
- Data Engineers building extraction/loading jobs who need validation.
Prerequisites
- Comfort with SQL aggregations and joins.
- Understanding of business keys and time windows.
- Basic knowledge of your warehouse functions (hashing, casting).
Learning path
- Start with Level 1 checks (counts, keys) for one table.
- Add Level 2 integrity checks (duplicates, nulls, min/max dates, referential integrity).
- Implement Level 3 accuracy checks (aggregates with tolerance; sampled/hash comparisons).
- Parameterize and schedule checks; write results to an audit table.
- Expand coverage to more tables and critical metrics.
Next steps
- Introduce freshness and completeness monitors for upstream sources.
- Add downstream BI validation: totals on dashboards equal warehouse totals.
- Document your reconciliation SLA and triage playbook.
Mini challenge
You migrated billing data to a new model. Yesterday: source had 50,120 rows and target had 50,190 rows. Distinct invoice_id matches at 50,120 on both sides. SUM(amount) differs by 0.04%.
- What check reveals the discrepancy type?
- Which follow-up query isolates the cause?
- What tolerance would you set for the amount comparison and why?
Suggested approach
Counts vs distinct key mismatch indicates duplicates on target. Run a duplicate query on invoice_id. For sums, 0.04% can pass if tolerance is 0.1% with absolute backup threshold.
Quick test and progress
Take the quick test below to check your understanding. Available to everyone for free. Note: only logged-in users have their progress saved.