How to learn Debugging Data Discrepancies for SQL in BI Developer for free

Who this is for

BI Developers and Analytics Engineers who need to explain why numbers differ between dashboards, SQL queries, and source systems—and fix them fast.

Prerequisites

Comfort writing SQL (SELECT, JOINs, GROUP BY, window functions)
Basic understanding of data models (fact/dimension, data grain)
Familiarity with your company’s KPI definitions

Why this matters

Data discrepancies are a daily reality. Common real-world tasks you’ll face:

Executive dashboard KPIs don’t match Finance reports
Yesterday’s revenue changed after a refresh
Marketing’s user counts are higher than Product’s
Teams use the same data but get different results

Being able to rapidly diagnose and resolve these mismatches builds trust, saves time, and prevents costly decisions based on wrong data.

Concept explained simply

A discrepancy is when two numbers that should match, don’t. The root causes typically fall into a few buckets:

Source mismatches: comparing different systems of record (e.g., orders vs payments)
Join/filter issues: inner joins dropping rows, or filters not aligned
Grain issues: counting rows at different levels (order vs order_item)
Aggregation differences: distinct vs non-distinct, daily vs monthly rollups
Time boundaries: time zones, late-arriving data, windows vs calendar days
Definition drift: different KPI rules or versions

Mental model: The 5W+H of a discrepancy

Who: which users/customers/products differ?
What: which metric (count, sum, avg) differs and by how much?
Where: which tables, dashboards, or pipeline stages?
When: which dates? Is it consistent or intermittent?
Why/How: suspect joins, filters, time zones, duplicates, or definitions.

Then trace the value through the pipeline and compare at each step.

Debugging playbook (fast and reliable)

Reproduce the issue
- Get the exact filter set (date range, segments, status flags).
- Freeze a small scope (e.g., one day, one region) to iterate quickly.
Pick the source of truth and grain
- Example: revenue should use the payments ledger at the payment grain.
- Write the expected definition in one sentence. Keep it visible.
Measure the gap
- Compute absolute and percent difference. Track it as you narrow down.
Compare step-by-step
- Row counts and sums in raw → staging → marts → BI model
- Slice by date, source, or key to localize where divergence starts
Check joins
- Look for dropped rows (LEFT JOIN + WHERE right.key IS NULL)
- Check many-to-many joins inflating counts (duplicates)
Validate filters and time
- Time zone conversions before date truncation
- Include late-arriving data or refresh windows
Check grain and deduplicate
- Aggregate to the correct grain before joining
- Use window functions to pick one record per entity when needed
Confirm KPI definitions
- Same inclusive/exclusive rules? Same statuses? Same currency conversions?
Document and guard
- Write up the root cause and the fix
- Add data tests or reconciliation checks so it doesn’t recur

Worked examples (with SQL)

Example 1: Inner join dropping revenue

Symptom: Dashboard revenue (orders JOIN payments) is lower than Finance (payments ledger).

Diagnosis and fix

Check totals by source:

-- Totals by stage for a given date (example date parameterized as {{d}})
SELECT 'orders' AS stage, COUNT(*) AS rows, SUM(amount_cents)/100.0 AS amount
FROM raw.orders
WHERE DATE(order_ts) = {{d}}
UNION ALL
SELECT 'payments', COUNT(*), SUM(amount_cents)/100.0
FROM raw.payments
WHERE DATE(paid_ts) = {{d}};

Find dropped rows from inner join:

-- Payments with no matching order (late ingest, soft-deletes, etc.)
SELECT p.order_id, SUM(p.amount_cents)/100.0 AS amt
FROM raw.payments p
LEFT JOIN raw.orders o ON o.id = p.order_id
WHERE DATE(p.paid_ts) = {{d}} AND o.id IS NULL
GROUP BY 1;

Fix: Use the payments ledger as truth or LEFT JOIN dimensions only:

-- Correct: revenue from payments ledger
SELECT SUM(CASE WHEN p.status = 'paid' THEN p.amount_cents ELSE 0 END)/100.0 AS revenue
FROM raw.payments p
WHERE DATE(p.paid_ts) = {{d}};

Example 2: Double counting from many-to-many

Symptom: Revenue is inflated after joining order_items to promotions (a product can have multiple promotions per date).

Diagnosis and fix

-- Baseline at item grain
WITH items AS (
  SELECT order_id, SUM(item_total_cents) AS item_total_cents
  FROM mart.order_items
  WHERE order_date = {{d}}
  GROUP BY order_id
)
-- Bad: joining per-item to promotions duplicates rows
SELECT SUM(i.item_total_cents)/100.0 AS inflated
FROM items i
JOIN dim.product_promotions pr ON pr.order_date = {{d}} AND pr.product_id IN (
  SELECT product_id FROM mart.order_items WHERE order_date = {{d}}
);

Fix 1: Pre-aggregate promotions to one row per order_id before joining.

WITH item_rev AS (
  SELECT order_id, SUM(item_total_cents) AS rev_cents
  FROM mart.order_items
  WHERE order_date = {{d}}
  GROUP BY order_id
), promo_per_order AS (
  SELECT oi.order_id, MAX(CASE WHEN pr.type = 'coupon' THEN 1 ELSE 0 END) AS had_coupon
  FROM mart.order_items oi
  LEFT JOIN dim.product_promotions pr
    ON pr.product_id = oi.product_id
   AND pr.valid_from <= oi.order_date
   AND pr.valid_to   >= oi.order_date
  WHERE oi.order_date = {{d}}
  GROUP BY oi.order_id
)
SELECT SUM(ir.rev_cents)/100.0 AS revenue
FROM item_rev ir
LEFT JOIN promo_per_order p ON p.order_id = ir.order_id;

Fix 2: Deduplicate bridge rows, or use DISTINCT at the correct grain before the join.

Example 3: Time zone boundary

Symptom: Daily revenue differs between a UTC-sourced table and a dashboard using local time.

Diagnosis and fix

-- Wrong: truncating in UTC then shifting
SELECT DATE_TRUNC('day', paid_ts) AS d, SUM(amount)/100.0
FROM fact_payments
GROUP BY 1;

-- Right: convert to business time zone first, then truncate
SELECT DATE_TRUNC('day', paid_ts AT TIME ZONE 'America/New_York') AS d_local,
       SUM(amount_cents)/100.0 AS amount
FROM fact_payments
GROUP BY 1;

Also check late-arriving payments; yesterday can change after settlements finalize.

Example 4: SCD2 dimension join

Symptom: Customer counts increase after joining fact_sessions to dim_customer (SCD2) on natural key.

Diagnosis and fix

-- Bad: natural key join creates duplicates when multiple versions exist
SELECT COUNT(*)
FROM fact_sessions s
JOIN dim_customer d
  ON d.customer_id = s.customer_id;  -- duplicates!

-- Fix: join on surrogate key present in fact OR use valid time window
SELECT COUNT(*)
FROM fact_sessions s
JOIN dim_customer d
  ON d.customer_sk = s.customer_sk;  -- preferred

-- Or date-bounded when only natural key is available
SELECT COUNT(*)
FROM fact_sessions s
JOIN dim_customer d
  ON d.customer_id = s.customer_id
 AND s.session_ts >= d.effective_from
 AND s.session_ts <  COALESCE(d.effective_to, '9999-12-31');

Ready-to-use diagnostic queries

Find rows dropped by an inner join

SELECT l.key
FROM left_table l
LEFT JOIN right_table r ON r.key = l.key
WHERE r.key IS NULL;

Detect double counting after a join

SELECT fact.key, COUNT(*) AS joined_rows
FROM fact
JOIN dim ON dim.key = fact.key
GROUP BY fact.key
HAVING COUNT(*) > 1;

Compare sums across stages

WITH a AS (
  SELECT date_col, SUM(metric) AS m FROM stage_a GROUP BY 1
), b AS (
  SELECT date_col, SUM(metric) AS m FROM stage_b GROUP BY 1
)
SELECT COALESCE(a.date_col, b.date_col) AS d,
       a.m AS a_m, b.m AS b_m,
       (b.m - a.m) AS diff
FROM a
FULL OUTER JOIN b ON a.date_col = b.date_col
ORDER BY d;

Scan for nulls and unexpected values

SELECT
  SUM(CASE WHEN col IS NULL THEN 1 ELSE 0 END) AS nulls,
  COUNT(*) AS total,
  SUM(CASE WHEN col NOT IN ('expected1','expected2') THEN 1 ELSE 0 END) AS unexpected
FROM t;

Exercises you can run

Mirror of the exercises below. Work through them in order. Use the checklist before submitting.

Exercise 1 — Trace missing revenue through joins

Tables: raw.orders(id, user_id, status, order_ts, currency, amount_cents), raw.payments(order_id, paid_ts, status, amount_cents). Scope: yesterday (use your SQL date function). Task: explain why the dashboard’s revenue (built from orders JOIN payments) is lower than Finance (payments ledger), and provide a corrected query.

Deliverable 1: A breakdown by cause (e.g., payments missing orders, orders without paid payments, duplicate payments)
Deliverable 2: A corrected revenue SQL aligned to the chosen source of truth

Tip: What to compare

Row counts: orders vs payments
Left anti-join to find dropped payments
Group by status and hour to localize the gap

Exercise 2 — Eliminate double counting in a many-to-many join

Tables: mart.order_items(order_id, product_id, item_total_cents, order_date), dim.product_promotions(product_id, type, valid_from, valid_to). Scope: single day. Task: compute total revenue without inflation despite multiple promotions per product-date.

Deliverable 1: Proof of duplication (orders with joined_rows > 1)
Deliverable 2: Corrected SQL that returns the same total as the baseline at item grain

Tip: Two safe strategies

Pre-aggregate to the target grain before joining
Deduplicate bridge rows or compute a per-order rollup of promo flags

Submission checklist

Scope stated (date, filters, statuses)
Source of truth and grain stated
Diagnostic queries included (anti-join, duplication check)
Final corrected query provided
Short note on root cause + prevention

Common mistakes and self-check

Mixing grains: counting users at event grain, comparing to user grain. Fix: aggregate first.
Inner joins as filters: dropping valid data. Fix: use LEFT JOIN + WHERE right.key IS NULL to confirm impact.
Date truncation in wrong time zone: Fix: convert to business time zone, then truncate.
Definition drift: assuming KPIs are identical. Fix: write down the definition you are using and compare.
Late-arriving data: yesterday shifts. Fix: rolling refresh windows or finalization status flags.

Self-check prompts

Did I compare row counts and sums at each stage?
Did I prove or rule out join drops and duplicates?
Is my time window and time zone identical across comparisons?
Is the KPI definition documented and matched?

Practical projects

Reconciliation notebook: Build a SQL script or notebook that compares counts/sums across raw, staging, and marts for core KPIs.
Join health dashboard: Daily checks for dropped rows (anti-joins) and duplicate joins (HAVING COUNT(*) > 1) on critical joins.
Time zone harness: A set of test queries verifying daily totals under different time zones and late-arriving windows.

Learning path

Before: SQL basics, JOINs, GROUP BY, window functions
Now: Debugging Data Discrepancies (this lesson)
Next: Data testing and monitoring (automated reconciliation, data contracts)

Next steps

Turn the playbook into reusable SQL snippets
Add assertions to your BI model (e.g., row counts, null checks)
Document KPI definitions and version them

Quick test

The quick test below is available to everyone; only logged-in users get their progress saved.

Mini challenge

You have fact_orders and fact_payments. Yesterday’s revenue differs by 2.4%. Produce:

A one-liner KPI definition for revenue
Two diagnostic queries that localize the gap
One final corrected query

Need a nudge?

Start with a full outer join by day to compute the difference
Use an anti-join to find dropped payments
Check time zone and status filters

Menu

Debugging Data Discrepancies

Table of Contents

Who this is for

Prerequisites

Why this matters

Concept explained simply

Debugging playbook (fast and reliable)

Worked examples (with SQL)

Example 1: Inner join dropping revenue

Example 2: Double counting from many-to-many

Example 3: Time zone boundary

Example 4: SCD2 dimension join

Ready-to-use diagnostic queries

Exercises you can run

Exercise 1 — Trace missing revenue through joins

Exercise 2 — Eliminate double counting in a many-to-many join

Submission checklist

Common mistakes and self-check

Practical projects

Learning path

Next steps

Quick test

Mini challenge

Practice Exercises

Trace missing revenue through joins

Instructions

Expected Output

Eliminate double counting in a many-to-many join

Debugging Data Discrepancies — Quick Test

Have questions about Debugging Data Discrepancies?

AI Assistant