How to learn Row Count And Reconciliation Checks for Data Quality Frameworks in Analytics Engineer for free

Who this is for

Analytics Engineers, BI Developers, and Data Analysts who build or maintain data pipelines and need reliable, quick checks to catch data loss, duplication, and mismatches between systems.

Prerequisites

Basic SQL (SELECT, GROUP BY, COUNT, SUM)
Understanding of data pipelines (ingest → stage → transform → publish)
Familiarity with incremental loads and partitions/dates

Why this matters

Row count and reconciliation checks are the fastest way to detect broken loads, duplicates, or missing records. In real projects, you will:

Verify daily loads from source systems reached the warehouse
Confirm that transformations didn’t drop or double rows
Reconcile key metrics (orders, revenue, users) between systems
Build early-warning monitors that catch issues before stakeholders do

Concept explained simply

Row count checks compare the number of rows between two steps in your pipeline (e.g., raw vs. staging) for the same time window. Reconciliation checks go further: they compare counts or sums between source-of-truth and destination for the same filters and grain (e.g., number of orders and total amount per day in source vs. warehouse).

Mental model

Think of your pipeline as a river with gates. Row counts are flow gauges at each gate: if the volume suddenly drops or spikes, something is wrong. Reconciliation is matching ledgers: both sides must agree on totals for the same period and rules.

Typical places to put checks

After ingestion: source system vs. raw tables
After staging transforms: raw vs. staging
After modeling: staging vs. marts
Between systems: source app/exports vs. warehouse

Worked examples

Example 1: Daily row count raw → staging

Goal: Ensure yesterday’s rows made it from raw to staging within a small tolerance.

-- Parameter: run_date = '2025-05-10'
WITH raw_counts AS (
  SELECT COUNT(*) AS c FROM raw.orders WHERE order_date = '2025-05-10'
), stg_counts AS (
  SELECT COUNT(*) AS c FROM stg.orders WHERE order_date = '2025-05-10'
)
SELECT r.c AS raw_c, s.c AS stg_c,
       ABS(r.c - s.c) AS diff,
       CASE WHEN s.c = 0 THEN NULL ELSE ABS(r.c - s.c) / s.c::decimal END AS pct_diff
FROM raw_counts r CROSS JOIN stg_counts s;

Interpretation: If diff <= 5 rows or pct_diff <= 1%, mark as OK. Otherwise, investigate upstream ingestion or filters.

Example 2: Detect duplicates via reconciliation (unique business key)

Goal: Ensure one order_id corresponds to one row post-transform.

SELECT order_id, COUNT(*) AS n
FROM mart.orders
WHERE order_date BETWEEN '2025-05-01' AND '2025-05-10'
GROUP BY order_id
HAVING COUNT(*) > 1;

No rows returned = good. If results appear, investigate join logic, incremental logic, or dedupe steps.

Example 3: Amount reconciliation source → mart per day

Goal: Compare source totals vs. warehouse totals per day.

WITH src AS (
  SELECT order_date, COUNT(*) AS orders_src, SUM(total_amount) AS amt_src
  FROM src_export.orders
  WHERE order_date BETWEEN '2025-05-01' AND '2025-05-10'
  GROUP BY order_date
), wh AS (
  SELECT order_date, COUNT(*) AS orders_wh, SUM(total_amount) AS amt_wh
  FROM mart.orders
  WHERE order_date BETWEEN '2025-05-01' AND '2025-05-10'
  GROUP BY order_date
)
SELECT COALESCE(s.order_date, w.order_date) AS order_date,
       s.orders_src, w.orders_wh,
       s.amt_src, w.amt_wh,
       (w.orders_wh - s.orders_src) AS diff_orders,
       (w.amt_wh - s.amt_src) AS diff_amt
FROM src s
FULL OUTER JOIN wh w USING (order_date)
ORDER BY 1;

Interpretation: Expect near-zero differences after accounting for late-arriving records and currency/rounding rules.

How to implement reliably

Pick the grain and window: e.g., by order_date for yesterday.
Align filters exactly: same date logic, status flags, soft-deletes.
Decide tolerance: absolute rows (±5) and/or percentage (±1%).
Automate: schedule queries and alert on threshold breaches.
Log results: store check outcomes (date, counts, diff, pass/fail).

Useful variations

Partition-level counts (per day/hour)
Key-set reconciliation (list of IDs present in one side but not the other)
Metric reconciliation (counts + sums + averages)
Null-rate checks on critical columns

Exercises

Do these now. The Quick Test at the end is available to everyone; only logged-in users get saved progress.

Exercise ex1: Daily row count comparison raw → staging for a given date. See details below.
Exercise ex2: Reconcile order totals (count and sum) per day between source and mart for a week. See details below.

Checklist before running:
- Same date range on both sides
- Same status filters (e.g., exclude canceled if excluded elsewhere)
- Account for soft deletes (is_deleted = false)
- Decide a pass threshold

Common mistakes and how to self-check

Mismatched filters: Different WHERE clauses create false alarms. Self-check by printing both sides’ WHERE conditions side-by-side.
Timezone drift: Source in UTC, warehouse in local time. Self-check by reconciling on both UTC and local windows.
Late-arriving data: Yesterday’s data may trickle today. Use a two-day rolling window and tolerances.
Soft deletes ignored: Excluding is_deleted rows on one side only. Standardize the rule.
Double counting after joins: Many-to-many joins inflate counts. Validate uniqueness of keys before joining.
Partial loads: Upstream job failed mid-run. Compare partition counts to historical medians to spot anomalies.

Self-check template

-- 1) Validate filters match
/* Paste both WHERE clauses here */

-- 2) Compare partition coverage
SELECT order_date, COUNT(*) FROM <table> GROUP BY 1 ORDER BY 1;

-- 3) Sanity vs history
SELECT order_date, COUNT(*) AS c,
       c / NULLIF(LAG(c) OVER (ORDER BY order_date),0) AS ratio_prev
FROM (
  SELECT order_date, COUNT(*) c FROM <table> GROUP BY 1
) t;

Practical projects

Build a daily data quality dashboard: Store results from 3 checks (row count raw→staging, dedupe check, amount reconciliation) and visualize pass/fail per day.
Create a reconciliation playbook: A parameterized SQL template that takes table names, date, and tolerance and returns a pass/fail row.
Late-arrival tolerant monitor: A job that rechecks the last 3 days and marks a check green once differences fall under tolerance.

Learning path

Start with row count checks on one pipeline.
Add key-set comparisons (missing IDs) for critical tables.
Include metric reconciliations (sums, averages).
Automate scheduling and alerting.
Centralize results into a quality report.

Next steps

Parameterize your queries (date, table names, tolerances).
Add schema drift checks (columns/Types).
Combine with freshness checks (max timestamp).

Mini challenge

Pick a critical metric (e.g., orders, revenue). Implement both a row count check and a per-day reconciliation for the last 7 days. Set a tolerance, run it daily for a week, and record outcomes.

Menu

Row Count And Reconciliation Checks

Table of Contents

Who this is for

Prerequisites

Why this matters

Concept explained simply

Mental model

Worked examples

Example 1: Daily row count raw → staging

Example 2: Detect duplicates via reconciliation (unique business key)

Example 3: Amount reconciliation source → mart per day

How to implement reliably

Exercises

Common mistakes and how to self-check

Practical projects

Learning path

Next steps

Mini challenge

Practice Exercises

Daily row count raw → staging

Instructions

Expected Output

Per-day reconciliation source → mart (counts and sums)

Row Count And Reconciliation Checks — Quick Test

Have questions about Row Count And Reconciliation Checks?

AI Assistant