Who this is for
BI Developers and Analytics Engineers who need to explain why numbers differ between dashboards, SQL queries, and source systems—and fix them fast.
Prerequisites
- Comfort writing SQL (SELECT, JOINs, GROUP BY, window functions)
- Basic understanding of data models (fact/dimension, data grain)
- Familiarity with your company’s KPI definitions
Why this matters
Data discrepancies are a daily reality. Common real-world tasks you’ll face:
- Executive dashboard KPIs don’t match Finance reports
- Yesterday’s revenue changed after a refresh
- Marketing’s user counts are higher than Product’s
- Teams use the same data but get different results
Being able to rapidly diagnose and resolve these mismatches builds trust, saves time, and prevents costly decisions based on wrong data.
Concept explained simply
A discrepancy is when two numbers that should match, don’t. The root causes typically fall into a few buckets:
- Source mismatches: comparing different systems of record (e.g., orders vs payments)
- Join/filter issues: inner joins dropping rows, or filters not aligned
- Grain issues: counting rows at different levels (order vs order_item)
- Aggregation differences: distinct vs non-distinct, daily vs monthly rollups
- Time boundaries: time zones, late-arriving data, windows vs calendar days
- Definition drift: different KPI rules or versions
Mental model: The 5W+H of a discrepancy
- Who: which users/customers/products differ?
- What: which metric (count, sum, avg) differs and by how much?
- Where: which tables, dashboards, or pipeline stages?
- When: which dates? Is it consistent or intermittent?
- Why/How: suspect joins, filters, time zones, duplicates, or definitions.
Then trace the value through the pipeline and compare at each step.
Debugging playbook (fast and reliable)
- Reproduce the issue
- Get the exact filter set (date range, segments, status flags).
- Freeze a small scope (e.g., one day, one region) to iterate quickly.
- Pick the source of truth and grain
- Example: revenue should use the payments ledger at the payment grain.
- Write the expected definition in one sentence. Keep it visible.
- Measure the gap
- Compute absolute and percent difference. Track it as you narrow down.
- Compare step-by-step
- Row counts and sums in raw → staging → marts → BI model
- Slice by date, source, or key to localize where divergence starts
- Check joins
- Look for dropped rows (LEFT JOIN + WHERE right.key IS NULL)
- Check many-to-many joins inflating counts (duplicates)
- Validate filters and time
- Time zone conversions before date truncation
- Include late-arriving data or refresh windows
- Check grain and deduplicate
- Aggregate to the correct grain before joining
- Use window functions to pick one record per entity when needed
- Confirm KPI definitions
- Same inclusive/exclusive rules? Same statuses? Same currency conversions?
- Document and guard
- Write up the root cause and the fix
- Add data tests or reconciliation checks so it doesn’t recur
Worked examples (with SQL)
Example 1: Inner join dropping revenue
Symptom: Dashboard revenue (orders JOIN payments) is lower than Finance (payments ledger).
Diagnosis and fix
Check totals by source:
-- Totals by stage for a given date (example date parameterized as {{d}})
SELECT 'orders' AS stage, COUNT(*) AS rows, SUM(amount_cents)/100.0 AS amount
FROM raw.orders
WHERE DATE(order_ts) = {{d}}
UNION ALL
SELECT 'payments', COUNT(*), SUM(amount_cents)/100.0
FROM raw.payments
WHERE DATE(paid_ts) = {{d}};
Find dropped rows from inner join:
-- Payments with no matching order (late ingest, soft-deletes, etc.)
SELECT p.order_id, SUM(p.amount_cents)/100.0 AS amt
FROM raw.payments p
LEFT JOIN raw.orders o ON o.id = p.order_id
WHERE DATE(p.paid_ts) = {{d}} AND o.id IS NULL
GROUP BY 1;
Fix: Use the payments ledger as truth or LEFT JOIN dimensions only:
-- Correct: revenue from payments ledger
SELECT SUM(CASE WHEN p.status = 'paid' THEN p.amount_cents ELSE 0 END)/100.0 AS revenue
FROM raw.payments p
WHERE DATE(p.paid_ts) = {{d}};
Example 2: Double counting from many-to-many
Symptom: Revenue is inflated after joining order_items to promotions (a product can have multiple promotions per date).
Diagnosis and fix
-- Baseline at item grain
WITH items AS (
SELECT order_id, SUM(item_total_cents) AS item_total_cents
FROM mart.order_items
WHERE order_date = {{d}}
GROUP BY order_id
)
-- Bad: joining per-item to promotions duplicates rows
SELECT SUM(i.item_total_cents)/100.0 AS inflated
FROM items i
JOIN dim.product_promotions pr ON pr.order_date = {{d}} AND pr.product_id IN (
SELECT product_id FROM mart.order_items WHERE order_date = {{d}}
);
Fix 1: Pre-aggregate promotions to one row per order_id before joining.
WITH item_rev AS (
SELECT order_id, SUM(item_total_cents) AS rev_cents
FROM mart.order_items
WHERE order_date = {{d}}
GROUP BY order_id
), promo_per_order AS (
SELECT oi.order_id, MAX(CASE WHEN pr.type = 'coupon' THEN 1 ELSE 0 END) AS had_coupon
FROM mart.order_items oi
LEFT JOIN dim.product_promotions pr
ON pr.product_id = oi.product_id
AND pr.valid_from <= oi.order_date
AND pr.valid_to >= oi.order_date
WHERE oi.order_date = {{d}}
GROUP BY oi.order_id
)
SELECT SUM(ir.rev_cents)/100.0 AS revenue
FROM item_rev ir
LEFT JOIN promo_per_order p ON p.order_id = ir.order_id;
Fix 2: Deduplicate bridge rows, or use DISTINCT at the correct grain before the join.
Example 3: Time zone boundary
Symptom: Daily revenue differs between a UTC-sourced table and a dashboard using local time.
Diagnosis and fix
-- Wrong: truncating in UTC then shifting
SELECT DATE_TRUNC('day', paid_ts) AS d, SUM(amount)/100.0
FROM fact_payments
GROUP BY 1;
-- Right: convert to business time zone first, then truncate
SELECT DATE_TRUNC('day', paid_ts AT TIME ZONE 'America/New_York') AS d_local,
SUM(amount_cents)/100.0 AS amount
FROM fact_payments
GROUP BY 1;
Also check late-arriving payments; yesterday can change after settlements finalize.
Example 4: SCD2 dimension join
Symptom: Customer counts increase after joining fact_sessions to dim_customer (SCD2) on natural key.
Diagnosis and fix
-- Bad: natural key join creates duplicates when multiple versions exist
SELECT COUNT(*)
FROM fact_sessions s
JOIN dim_customer d
ON d.customer_id = s.customer_id; -- duplicates!
-- Fix: join on surrogate key present in fact OR use valid time window
SELECT COUNT(*)
FROM fact_sessions s
JOIN dim_customer d
ON d.customer_sk = s.customer_sk; -- preferred
-- Or date-bounded when only natural key is available
SELECT COUNT(*)
FROM fact_sessions s
JOIN dim_customer d
ON d.customer_id = s.customer_id
AND s.session_ts >= d.effective_from
AND s.session_ts < COALESCE(d.effective_to, '9999-12-31');
Ready-to-use diagnostic queries
Find rows dropped by an inner join
SELECT l.key
FROM left_table l
LEFT JOIN right_table r ON r.key = l.key
WHERE r.key IS NULL;
Detect double counting after a join
SELECT fact.key, COUNT(*) AS joined_rows
FROM fact
JOIN dim ON dim.key = fact.key
GROUP BY fact.key
HAVING COUNT(*) > 1;
Compare sums across stages
WITH a AS (
SELECT date_col, SUM(metric) AS m FROM stage_a GROUP BY 1
), b AS (
SELECT date_col, SUM(metric) AS m FROM stage_b GROUP BY 1
)
SELECT COALESCE(a.date_col, b.date_col) AS d,
a.m AS a_m, b.m AS b_m,
(b.m - a.m) AS diff
FROM a
FULL OUTER JOIN b ON a.date_col = b.date_col
ORDER BY d;
Scan for nulls and unexpected values
SELECT
SUM(CASE WHEN col IS NULL THEN 1 ELSE 0 END) AS nulls,
COUNT(*) AS total,
SUM(CASE WHEN col NOT IN ('expected1','expected2') THEN 1 ELSE 0 END) AS unexpected
FROM t;
Exercises you can run
Mirror of the exercises below. Work through them in order. Use the checklist before submitting.
Exercise 1 — Trace missing revenue through joins
Tables: raw.orders(id, user_id, status, order_ts, currency, amount_cents), raw.payments(order_id, paid_ts, status, amount_cents). Scope: yesterday (use your SQL date function). Task: explain why the dashboard’s revenue (built from orders JOIN payments) is lower than Finance (payments ledger), and provide a corrected query.
- Deliverable 1: A breakdown by cause (e.g., payments missing orders, orders without paid payments, duplicate payments)
- Deliverable 2: A corrected revenue SQL aligned to the chosen source of truth
Tip: What to compare
- Row counts: orders vs payments
- Left anti-join to find dropped payments
- Group by status and hour to localize the gap
Exercise 2 — Eliminate double counting in a many-to-many join
Tables: mart.order_items(order_id, product_id, item_total_cents, order_date), dim.product_promotions(product_id, type, valid_from, valid_to). Scope: single day. Task: compute total revenue without inflation despite multiple promotions per product-date.
- Deliverable 1: Proof of duplication (orders with joined_rows > 1)
- Deliverable 2: Corrected SQL that returns the same total as the baseline at item grain
Tip: Two safe strategies
- Pre-aggregate to the target grain before joining
- Deduplicate bridge rows or compute a per-order rollup of promo flags
Submission checklist
- Scope stated (date, filters, statuses)
- Source of truth and grain stated
- Diagnostic queries included (anti-join, duplication check)
- Final corrected query provided
- Short note on root cause + prevention
Common mistakes and self-check
- Mixing grains: counting users at event grain, comparing to user grain. Fix: aggregate first.
- Inner joins as filters: dropping valid data. Fix: use LEFT JOIN + WHERE right.key IS NULL to confirm impact.
- Date truncation in wrong time zone: Fix: convert to business time zone, then truncate.
- Definition drift: assuming KPIs are identical. Fix: write down the definition you are using and compare.
- Late-arriving data: yesterday shifts. Fix: rolling refresh windows or finalization status flags.
Self-check prompts
- Did I compare row counts and sums at each stage?
- Did I prove or rule out join drops and duplicates?
- Is my time window and time zone identical across comparisons?
- Is the KPI definition documented and matched?
Practical projects
- Reconciliation notebook: Build a SQL script or notebook that compares counts/sums across raw, staging, and marts for core KPIs.
- Join health dashboard: Daily checks for dropped rows (anti-joins) and duplicate joins (HAVING COUNT(*) > 1) on critical joins.
- Time zone harness: A set of test queries verifying daily totals under different time zones and late-arriving windows.
Learning path
- Before: SQL basics, JOINs, GROUP BY, window functions
- Now: Debugging Data Discrepancies (this lesson)
- Next: Data testing and monitoring (automated reconciliation, data contracts)
Next steps
- Turn the playbook into reusable SQL snippets
- Add assertions to your BI model (e.g., row counts, null checks)
- Document KPI definitions and version them
Quick test
The quick test below is available to everyone; only logged-in users get their progress saved.
Mini challenge
You have fact_orders and fact_payments. Yesterday’s revenue differs by 2.4%. Produce:
- A one-liner KPI definition for revenue
- Two diagnostic queries that localize the gap
- One final corrected query
Need a nudge?
- Start with a full outer join by day to compute the difference
- Use an anti-join to find dropped payments
- Check time zone and status filters