Who this is for
Analytics Engineers, BI Developers, and Data Analysts who build or maintain data pipelines and need reliable, quick checks to catch data loss, duplication, and mismatches between systems.
Prerequisites
- Basic SQL (SELECT, GROUP BY, COUNT, SUM)
- Understanding of data pipelines (ingest → stage → transform → publish)
- Familiarity with incremental loads and partitions/dates
Why this matters
Row count and reconciliation checks are the fastest way to detect broken loads, duplicates, or missing records. In real projects, you will:
- Verify daily loads from source systems reached the warehouse
- Confirm that transformations didn’t drop or double rows
- Reconcile key metrics (orders, revenue, users) between systems
- Build early-warning monitors that catch issues before stakeholders do
Concept explained simply
Row count checks compare the number of rows between two steps in your pipeline (e.g., raw vs. staging) for the same time window. Reconciliation checks go further: they compare counts or sums between source-of-truth and destination for the same filters and grain (e.g., number of orders and total amount per day in source vs. warehouse).
Mental model
Think of your pipeline as a river with gates. Row counts are flow gauges at each gate: if the volume suddenly drops or spikes, something is wrong. Reconciliation is matching ledgers: both sides must agree on totals for the same period and rules.
Typical places to put checks
- After ingestion: source system vs. raw tables
- After staging transforms: raw vs. staging
- After modeling: staging vs. marts
- Between systems: source app/exports vs. warehouse
Worked examples
Example 1: Daily row count raw → staging
Goal: Ensure yesterday’s rows made it from raw to staging within a small tolerance.
-- Parameter: run_date = '2025-05-10'
WITH raw_counts AS (
SELECT COUNT(*) AS c FROM raw.orders WHERE order_date = '2025-05-10'
), stg_counts AS (
SELECT COUNT(*) AS c FROM stg.orders WHERE order_date = '2025-05-10'
)
SELECT r.c AS raw_c, s.c AS stg_c,
ABS(r.c - s.c) AS diff,
CASE WHEN s.c = 0 THEN NULL ELSE ABS(r.c - s.c) / s.c::decimal END AS pct_diff
FROM raw_counts r CROSS JOIN stg_counts s;Interpretation: If diff <= 5 rows or pct_diff <= 1%, mark as OK. Otherwise, investigate upstream ingestion or filters.
Example 2: Detect duplicates via reconciliation (unique business key)
Goal: Ensure one order_id corresponds to one row post-transform.
SELECT order_id, COUNT(*) AS n
FROM mart.orders
WHERE order_date BETWEEN '2025-05-01' AND '2025-05-10'
GROUP BY order_id
HAVING COUNT(*) > 1;No rows returned = good. If results appear, investigate join logic, incremental logic, or dedupe steps.
Example 3: Amount reconciliation source → mart per day
Goal: Compare source totals vs. warehouse totals per day.
WITH src AS (
SELECT order_date, COUNT(*) AS orders_src, SUM(total_amount) AS amt_src
FROM src_export.orders
WHERE order_date BETWEEN '2025-05-01' AND '2025-05-10'
GROUP BY order_date
), wh AS (
SELECT order_date, COUNT(*) AS orders_wh, SUM(total_amount) AS amt_wh
FROM mart.orders
WHERE order_date BETWEEN '2025-05-01' AND '2025-05-10'
GROUP BY order_date
)
SELECT COALESCE(s.order_date, w.order_date) AS order_date,
s.orders_src, w.orders_wh,
s.amt_src, w.amt_wh,
(w.orders_wh - s.orders_src) AS diff_orders,
(w.amt_wh - s.amt_src) AS diff_amt
FROM src s
FULL OUTER JOIN wh w USING (order_date)
ORDER BY 1;Interpretation: Expect near-zero differences after accounting for late-arriving records and currency/rounding rules.
How to implement reliably
- Pick the grain and window: e.g., by order_date for yesterday.
- Align filters exactly: same date logic, status flags, soft-deletes.
- Decide tolerance: absolute rows (±5) and/or percentage (±1%).
- Automate: schedule queries and alert on threshold breaches.
- Log results: store check outcomes (date, counts, diff, pass/fail).
Useful variations
- Partition-level counts (per day/hour)
- Key-set reconciliation (list of IDs present in one side but not the other)
- Metric reconciliation (counts + sums + averages)
- Null-rate checks on critical columns
Exercises
Do these now. The Quick Test at the end is available to everyone; only logged-in users get saved progress.
- Exercise ex1: Daily row count comparison raw → staging for a given date. See details below.
- Exercise ex2: Reconcile order totals (count and sum) per day between source and mart for a week. See details below.
- Checklist before running:
- Same date range on both sides
- Same status filters (e.g., exclude canceled if excluded elsewhere)
- Account for soft deletes (is_deleted = false)
- Decide a pass threshold
Common mistakes and how to self-check
- Mismatched filters: Different WHERE clauses create false alarms. Self-check by printing both sides’ WHERE conditions side-by-side.
- Timezone drift: Source in UTC, warehouse in local time. Self-check by reconciling on both UTC and local windows.
- Late-arriving data: Yesterday’s data may trickle today. Use a two-day rolling window and tolerances.
- Soft deletes ignored: Excluding is_deleted rows on one side only. Standardize the rule.
- Double counting after joins: Many-to-many joins inflate counts. Validate uniqueness of keys before joining.
- Partial loads: Upstream job failed mid-run. Compare partition counts to historical medians to spot anomalies.
Self-check template
-- 1) Validate filters match
/* Paste both WHERE clauses here */
-- 2) Compare partition coverage
SELECT order_date, COUNT(*) FROM <table> GROUP BY 1 ORDER BY 1;
-- 3) Sanity vs history
SELECT order_date, COUNT(*) AS c,
c / NULLIF(LAG(c) OVER (ORDER BY order_date),0) AS ratio_prev
FROM (
SELECT order_date, COUNT(*) c FROM <table> GROUP BY 1
) t;
Practical projects
- Build a daily data quality dashboard: Store results from 3 checks (row count raw→staging, dedupe check, amount reconciliation) and visualize pass/fail per day.
- Create a reconciliation playbook: A parameterized SQL template that takes table names, date, and tolerance and returns a pass/fail row.
- Late-arrival tolerant monitor: A job that rechecks the last 3 days and marks a check green once differences fall under tolerance.
Learning path
- Start with row count checks on one pipeline.
- Add key-set comparisons (missing IDs) for critical tables.
- Include metric reconciliations (sums, averages).
- Automate scheduling and alerting.
- Centralize results into a quality report.
Next steps
- Parameterize your queries (date, table names, tolerances).
- Add schema drift checks (columns/Types).
- Combine with freshness checks (max timestamp).
Mini challenge
Pick a critical metric (e.g., orders, revenue). Implement both a row count check and a per-day reconciliation for the last 7 days. Set a tolerance, run it daily for a week, and record outcomes.