Why this matters
As a Business Analyst, your insights drive decisions. A solid query can still be wrong if the result is mis-joined, double-counted, filtered incorrectly, or off by a simple oversight. Sanity checks are fast, lightweight tests you run after writing SQL to ensure results are believable and safe to present.
- Verify that revenue totals match expectations after joins.
- Confirm counts before/after filters or joins didn’t multiply rows.
- Ensure percentages sum to ~100% and no negatives appear in amounts that should be positive.
- Catch missing dates, unexpected spikes/drops, and odd null rates.
Concept explained simply
Sanity checks are quick, approximate validations you run right after a query. They don’t prove the query is perfect; they prove it’s not obviously wrong. Think of them as smoke alarms: simple but powerful.
Mental model
- Three-lens review: Input (source tables) → Transform (joins, filters, group-bys) → Output (numbers and shape). Run checks at each lens.
- Invariants: Things that should always hold true (e.g., order_id uniqueness, non-negative paid amounts).
- Reconciliation: Compare two ways of computing the same number, or compare to a known baseline.
Core sanity checks you can apply today
- Row count continuity: Compare counts before/after joins and filters.
Expect joined_cnt ≥ orders_cnt. If much larger, check for duplicate inflation.-- Before join SELECT COUNT(*) AS orders_cnt FROM orders WHERE status = 'paid'; -- After join SELECT COUNT(*) AS joined_cnt FROM orders o JOIN order_items i ON o.order_id = i.order_id WHERE o.status = 'paid'; - Primary key uniqueness:
No rows should return if order_id is a true key.SELECT order_id, COUNT(*) AS c FROM orders GROUP BY order_id HAVING COUNT(*) > 1; - Join duplication trap: If joining to a many-side table, aggregate first:
WITH items AS ( SELECT order_id, SUM(item_amount) AS items_total FROM order_items GROUP BY order_id ) SELECT o.order_id, o.total_amount, items.items_total FROM orders o LEFT JOIN items ON o.order_id = items.order_id; - Range and sign checks:
-- Negative amounts that shouldn't be negative SELECT * FROM orders WHERE status = 'paid' AND total_amount < 0; - Percent-of-total sanity:
Expect percent_sum ≈ 100 (small rounding tolerance is okay).WITH by_country AS ( SELECT country, SUM(total_amount) AS amt FROM orders o JOIN users u ON o.user_id = u.user_id WHERE o.status = 'paid' GROUP BY country ), total AS ( SELECT SUM(amt) AS t FROM by_country ) SELECT SUM(ROUND(100.0 * amt / t, 2)) AS percent_sum FROM by_country, total; - Time continuity: look for missing dates or extreme spikes.
Scan for gaps or shocking changes.SELECT order_date::date AS d, SUM(total_amount) AS revenue FROM orders WHERE status = 'paid' GROUP BY d ORDER BY d; - Null-rate inspection:
Sudden jumps in null_rate can signal pipeline issues.SELECT SUM(CASE WHEN coupon_code IS NULL THEN 1 ELSE 0 END) * 1.0 / COUNT(*) AS null_rate FROM orders; - Two-method reconciliation:
The two sums should match (allow small rounding differences).-- Method A: sum of order totals SELECT SUM(total_amount) FROM orders WHERE status = 'paid'; -- Method B: sum of item_amounts SELECT SUM(i.item_amount) FROM orders o JOIN order_items i ON o.order_id = i.order_id WHERE o.status = 'paid';
Tip: A quick review checklist
- Row counts before vs after joins look reasonable
- Keys unique where expected
- No unexpected negatives or out-of-range values
- Percentages sum to ~100%
- Time series has no gaps, spikes are explainable
- Recomputed totals match via a second method
- Null rates plausible; sudden shifts investigated
Worked examples
Example 1: Detect duplicate inflation after a join
Goal: Revenue by day from orders joined to order_items.
-- 1) Baseline count
SELECT COUNT(*) AS paid_orders
FROM orders
WHERE status = 'paid';
-- 2) After naive join (likely duplicates)
SELECT COUNT(*) AS rows_after_join
FROM orders o
JOIN order_items i ON o.order_id = i.order_id
WHERE o.status = 'paid';
-- 3) Diagnose order-level duplication
SELECT o.order_id, COUNT(*) AS row_c
FROM orders o JOIN order_items i ON o.order_id = i.order_id
WHERE o.status = 'paid'
GROUP BY o.order_id
HAVING COUNT(*) > 1
ORDER BY row_c DESC
LIMIT 10;If many order_ids have row_c > 1, aggregate items first:
WITH items AS (
SELECT order_id, SUM(item_amount) AS items_total
FROM order_items
GROUP BY order_id
)
SELECT o.order_date::date AS d, SUM(items.items_total) AS revenue
FROM orders o
JOIN items ON o.order_id = items.order_id
WHERE o.status = 'paid'
GROUP BY d
ORDER BY d;Example 2: Percent-of-total sanity
Goal: Country revenue shares sum to ~100%.
WITH by_country AS (
SELECT u.country, SUM(o.total_amount) AS amt
FROM orders o
JOIN users u ON o.user_id = u.user_id
WHERE o.status = 'paid'
GROUP BY u.country
), total AS (
SELECT SUM(amt) AS t FROM by_country
)
SELECT country,
ROUND(100.0 * amt / t, 2) AS pct
FROM by_country, total
ORDER BY pct DESC;
-- Sanity: do the parts sum to ~100?
WITH by_country AS (
SELECT u.country, SUM(o.total_amount) AS amt
FROM orders o
JOIN users u ON o.user_id = u.user_id
WHERE o.status = 'paid'
GROUP BY u.country
), total AS (
SELECT SUM(amt) AS t FROM by_country
)
SELECT SUM(ROUND(100.0 * amt / t, 2)) AS percent_sum
FROM by_country, total;Investigate if percent_sum is far from 100 due to rounding or missing categories.
Example 3: Two-method revenue reconciliation
Goal: Sum by orders vs sum of items.
-- Method A
SELECT SUM(total_amount) AS a
FROM orders
WHERE status = 'paid';
-- Method B
SELECT SUM(i.item_amount) AS b
FROM orders o
JOIN order_items i ON o.order_id = i.order_id
WHERE o.status = 'paid';
-- Compare difference
WITH a AS (
SELECT SUM(total_amount) AS v
FROM orders WHERE status = 'paid'
), b AS (
SELECT SUM(i.item_amount) AS v
FROM orders o JOIN order_items i ON o.order_id = i.order_id
WHERE o.status = 'paid'
)
SELECT (SELECT v FROM a) - (SELECT v FROM b) AS diff;Expect diff ≈ 0. If not, check missing items, returns, or rounding.
Common mistakes and how to self-check
- Counting rows after joining to a many-side table. Fix: Aggregate first or count DISTINCT keys.
- Filtering on the wrong side of a LEFT JOIN (turns it into INNER). Fix: Put child-table filters in the ON clause.
- Ignoring rounding tolerances. Fix: Allow a small epsilon when comparing sums.
- Overlooking time gaps. Fix: Always scan daily counts for missing dates or sharp anomalies.
- Assuming keys are unique. Fix: Explicitly test for duplicates with GROUP BY ... HAVING COUNT(*) > 1.
Quick self-check routine
- Run key uniqueness checks.
- Compare pre- and post-join counts.
- Check negative/out-of-range values.
- Scan time series for gaps/spikes.
- Reconcile a main KPI via two methods.
Exercises
Use tables: orders(order_id, user_id, order_date, status, total_amount), order_items(order_id, product_id, quantity, item_amount), users(user_id, country).
- Exercise ex1 — Stop duplicate inflation after a join
Write queries to: (a) count paid orders; (b) count rows after joining to order_items; (c) find order_ids with row duplication; (d) fix by aggregating items before the join.
See the detailed task in the Exercises section below. - Exercise ex2 — Range, gaps, and reconciliation
For the last 30 days of paid orders: (a) find negative totals; (b) detect missing dates in daily revenue; (c) compare order totals vs item totals and return the difference in cents.
- I verified row counts pre/post join
- I checked key uniqueness
- I validated percentages and ranges
- I reconciled a KPI via two methods
Practical projects
- Sales dashboard QA: Build daily revenue and units dashboards and embed a query cell that prints sanity flags (e.g., gaps, negative totals, duplicates found).
- Cohort retention validator: Compute weekly retention and auto-calc percent sum to 100% per cohort with anomaly notes.
- Marketing attribution double-check: Compare attribution sums to total paid orders revenue; alert if mismatch exceeds 1%.
Who this is for
Business Analysts and aspiring analysts who run SQL and need to trust their numbers before sharing.
Prerequisites
- Basic SELECT, WHERE, GROUP BY, JOIN
- Comfort reading result sets and simple aggregates
Learning path
- Master core SELECT/JOIN (done or in progress)
- Apply sanity checks on every query (this lesson)
- Introduce re-usable QA CTEs/macros in your queries
- Learn window functions to compare against baselines
- Automate a daily “QA panel” query for key dashboards
Next steps
- Repeat today’s checks on another KPI (e.g., orders, active users, refunds).
- Add a small tolerance rule you’ll reuse (e.g., sums must match within 0.5%).
- Document two anomalies you found and how you resolved them.
Mini challenge
In one query, return yesterday’s paid revenue along with flags: duplicated_orders (bool), negative_amounts (bool), missing_yesterday (bool if no orders). Keep it compact and readable.
Quick Test
Take the quick test to check your understanding. Everyone can take it; progress is saved if you’re logged in.