Why this matters
Data Visualization Engineers rely on trustworthy data to build accurate dashboards. Data validation checks in SQL catch issues early: missing rows, broken joins, wrong aggregations, and unexpected values. With a few repeatable queries, you can prevent misleading charts, reduce rework, and build stakeholder confidence.
Who this is for
- Data Visualization Engineers and BI developers building dashboards and metrics.
- Analysts who prepare SQL datasets for reporting.
- Anyone needing a practical checklist to verify data correctness before publishing visuals.
Prerequisites
- Basic SQL: SELECT, WHERE, GROUP BY, JOIN.
- Comfort with reading table schemas and primary/foreign keys.
Concept explained simply
Data validation checks are small SQL queries that test assumptions about your dataset. Think of them as unit tests for your tables: do counts match expectations, are keys unique, and are values within valid ranges?
Mental model
- Shape: Does the dataset have the expected number of rows and columns?
- Keys: Are identifiers unique and do foreign keys match dimensions?
- Ranges: Are dates, numbers, and categories within valid bounds?
- Aggregates: Do totals, averages, and proportions reconcile with sources?
- Freshness: Is the latest data present and on time?
Practical validation checklist
- Row count comparison (source vs. report table)
- NULL and empties check for critical fields
- Uniqueness of primary keys
- Referential integrity (no orphaned foreign keys)
- Range checks (dates, amounts, percentages)
- Distribution shifts (mean/median vs. last period)
- Aggregation reconciliation (totals vs. authoritative source)
- Freshness (max date within expected window)
Worked examples
1) Row count and total reconciliation
Compare yesterday's orders loaded to your reporting table against the source.
-- Expected: counts and totals match or known, documented differences
WITH src AS (
SELECT COUNT(*) AS c, SUM(order_amount) AS s
FROM raw.orders
WHERE order_date = CURRENT_DATE - INTERVAL '1 day'
), rpt AS (
SELECT COUNT(*) AS c, SUM(order_amount) AS s
FROM mart.orders_report
WHERE order_date = CURRENT_DATE - INTERVAL '1 day'
)
SELECT src.c AS src_count, rpt.c AS rpt_count,
src.s AS src_sum, rpt.s AS rpt_sum,
(rpt.c - src.c) AS diff_count,
(rpt.s - src.s) AS diff_sum
FROM src, rpt;
Why it works
Row count checks catch missing or duplicated rows; total reconciliation reveals partial loads or aggregation errors.
2) Referential integrity (orphan keys)
Ensure each fact order references an existing customer.
SELECT f.customer_id, COUNT(*) AS orphan_rows
FROM mart.orders_report f
LEFT JOIN dim.customers d ON f.customer_id = d.customer_id
WHERE d.customer_id IS NULL
GROUP BY f.customer_id
ORDER BY orphan_rows DESC;
Why it works
LEFT JOIN + NULL filter surfaces foreign keys with no matching dimension row.
3) Uniqueness and range checks
Verify primary key uniqueness and reasonable value ranges.
-- Duplicate primary keys
SELECT order_id, COUNT(*) AS n
FROM mart.orders_report
GROUP BY order_id
HAVING COUNT(*) > 1;
-- Numeric ranges (no negatives, sane upper bound)
SELECT
SUM(CASE WHEN order_amount < 0 THEN 1 ELSE 0 END) AS negatives,
SUM(CASE WHEN order_amount > 100000 THEN 1 ELSE 0 END) AS extreme_high
FROM mart.orders_report;
-- Freshness (latest record date)
SELECT MAX(order_date) AS max_date FROM mart.orders_report;
Why it works
Uniqueness protects against double-counting, range checks catch data entry or pipeline errors, and freshness flags late or missing loads.
Step-by-step: add checks to your workflow
- List critical metrics, keys, and expected ranges for your dashboard.
- Write small SQL queries for each assumption (count, sum, uniqueness, FK joins).
- Run checks on a known-good day to baseline expected results.
- Automate: save queries and run before publishing or scheduling refreshes.
- Document known differences and thresholds for alerts.
Exercises
Do these in your SQL editor. Compare your results with the expected outputs. Use the hints if you get stuck.
Exercise 1 — Daily validation across source and report (mirrors EX1)
You maintain two tables: raw.orders (source) and mart.orders_report (dashboard-ready). Validate yesterday's load.
- Compare row counts and total
order_amountfororder_date = CURRENT_DATE - 1. - Return differences for both count and sum in one result row.
Exercise 2 — Find orphaned orders (mirrors EX2)
Check that all orders in mart.orders_report have a matching customer_id in dim.customers.
- Write a query to list
customer_idvalues in facts that do not exist in the dimension. - Include a count of orphan rows per
customer_id, sorted by count desc.
Self-check checklist
- I can validate row counts and totals between source and report tables.
- I can detect orphaned foreign keys using a LEFT JOIN.
- I can check uniqueness of primary keys and investigate duplicates.
- I can verify ranges and freshness for key fields.
Common mistakes and how to catch them
Comparing different filters
Ensure both sides use the same date filter and status flags. Mismatched WHERE clauses create false alarms.
Counting after JOINs
A JOIN can duplicate rows. For reconciliation, compare base tables first, then separately validate joins for FKs.
Ignoring NULL semantics
Use COALESCE where appropriate and explicitly count NULLs in critical fields to spot missing data.
Forgetting distinctness
When checking uniqueness, use GROUP BY and HAVING rather than a DISTINCT SELECT which hides duplicates.
No baseline or thresholds
Record a baseline from a known-good day and define acceptable deltas (e.g., ±1%). Not all differences indicate errors.
Practical projects
- Build a reusable SQL validation script that outputs a single row of key checks (counts, sums, max date, duplicate count).
- Create a small "data quality" table where you insert daily check results with timestamps for easy monitoring.
- Design a dashboard tile that displays validation KPIs (row diff, sum diff, freshness) from your quality table.
Learning path
- Start: Row counts and sums → Keys and referential integrity → Ranges and freshness → Distribution checks (optional) → Automate.
- Next skill: Aggregations and window functions to deepen reconciliation techniques.
Mini challenge
Your product manager reports a sudden drop in daily revenue on your dashboard. In 10 minutes, run three checks to isolate the issue:
- Row count and sum reconciliation (source vs. report) for yesterday.
- Orphan check on
customer_idandproduct_id. - Freshness check on both dimension and fact tables.
Suggested approach
Start with base counts and sums. If they match, check joins for orphans. If joins are OK, verify max dates to ensure all tables refreshed.
Next steps
- Turn your best checks into saved queries or views and run before publishing dashboards.
- Document known differences and acceptable thresholds in your team wiki or dataset description.
- Practice on a new dataset each week to build speed and reliability.
Quick Test
Take the quick test to check your understanding. Available to everyone for free; only logged-in users have results saved.