How to learn Data Validation Checks for SQL Basics in Data Visualization Engineer for free

Why this matters

Data Visualization Engineers rely on trustworthy data to build accurate dashboards. Data validation checks in SQL catch issues early: missing rows, broken joins, wrong aggregations, and unexpected values. With a few repeatable queries, you can prevent misleading charts, reduce rework, and build stakeholder confidence.

Who this is for

Data Visualization Engineers and BI developers building dashboards and metrics.
Analysts who prepare SQL datasets for reporting.
Anyone needing a practical checklist to verify data correctness before publishing visuals.

Prerequisites

Basic SQL: SELECT, WHERE, GROUP BY, JOIN.
Comfort with reading table schemas and primary/foreign keys.

Concept explained simply

Data validation checks are small SQL queries that test assumptions about your dataset. Think of them as unit tests for your tables: do counts match expectations, are keys unique, and are values within valid ranges?

Mental model

Shape: Does the dataset have the expected number of rows and columns?
Keys: Are identifiers unique and do foreign keys match dimensions?
Ranges: Are dates, numbers, and categories within valid bounds?
Aggregates: Do totals, averages, and proportions reconcile with sources?
Freshness: Is the latest data present and on time?

Practical validation checklist

Row count comparison (source vs. report table)
NULL and empties check for critical fields
Uniqueness of primary keys
Referential integrity (no orphaned foreign keys)
Range checks (dates, amounts, percentages)
Distribution shifts (mean/median vs. last period)
Aggregation reconciliation (totals vs. authoritative source)
Freshness (max date within expected window)

Worked examples

1) Row count and total reconciliation

Compare yesterday's orders loaded to your reporting table against the source.

-- Expected: counts and totals match or known, documented differences
WITH src AS (
  SELECT COUNT(*) AS c, SUM(order_amount) AS s
  FROM raw.orders
  WHERE order_date = CURRENT_DATE - INTERVAL '1 day'
), rpt AS (
  SELECT COUNT(*) AS c, SUM(order_amount) AS s
  FROM mart.orders_report
  WHERE order_date = CURRENT_DATE - INTERVAL '1 day'
)
SELECT src.c AS src_count, rpt.c AS rpt_count,
       src.s AS src_sum,   rpt.s AS rpt_sum,
       (rpt.c - src.c) AS diff_count,
       (rpt.s - src.s) AS diff_sum
FROM src, rpt;

Why it works

Row count checks catch missing or duplicated rows; total reconciliation reveals partial loads or aggregation errors.

2) Referential integrity (orphan keys)

Ensure each fact order references an existing customer.

SELECT f.customer_id, COUNT(*) AS orphan_rows
FROM mart.orders_report f
LEFT JOIN dim.customers d ON f.customer_id = d.customer_id
WHERE d.customer_id IS NULL
GROUP BY f.customer_id
ORDER BY orphan_rows DESC;

Why it works

LEFT JOIN + NULL filter surfaces foreign keys with no matching dimension row.

3) Uniqueness and range checks

Verify primary key uniqueness and reasonable value ranges.

-- Duplicate primary keys
SELECT order_id, COUNT(*) AS n
FROM mart.orders_report
GROUP BY order_id
HAVING COUNT(*) > 1;

-- Numeric ranges (no negatives, sane upper bound)
SELECT
  SUM(CASE WHEN order_amount < 0 THEN 1 ELSE 0 END) AS negatives,
  SUM(CASE WHEN order_amount > 100000 THEN 1 ELSE 0 END) AS extreme_high
FROM mart.orders_report;

-- Freshness (latest record date)
SELECT MAX(order_date) AS max_date FROM mart.orders_report;

Why it works

Uniqueness protects against double-counting, range checks catch data entry or pipeline errors, and freshness flags late or missing loads.

Step-by-step: add checks to your workflow

List critical metrics, keys, and expected ranges for your dashboard.
Write small SQL queries for each assumption (count, sum, uniqueness, FK joins).
Run checks on a known-good day to baseline expected results.
Automate: save queries and run before publishing or scheduling refreshes.
Document known differences and thresholds for alerts.

Exercises

Do these in your SQL editor. Compare your results with the expected outputs. Use the hints if you get stuck.

Exercise 1 — Daily validation across source and report (mirrors EX1)

You maintain two tables: raw.orders (source) and mart.orders_report (dashboard-ready). Validate yesterday's load.

Compare row counts and total order_amount for order_date = CURRENT_DATE - 1.
Return differences for both count and sum in one result row.

Exercise 2 — Find orphaned orders (mirrors EX2)

Check that all orders in mart.orders_report have a matching customer_id in dim.customers.

Write a query to list customer_id values in facts that do not exist in the dimension.
Include a count of orphan rows per customer_id, sorted by count desc.

Self-check checklist

I can validate row counts and totals between source and report tables.
I can detect orphaned foreign keys using a LEFT JOIN.
I can check uniqueness of primary keys and investigate duplicates.
I can verify ranges and freshness for key fields.

Common mistakes and how to catch them

Comparing different filters

Ensure both sides use the same date filter and status flags. Mismatched WHERE clauses create false alarms.

Counting after JOINs

A JOIN can duplicate rows. For reconciliation, compare base tables first, then separately validate joins for FKs.

Ignoring NULL semantics

Use COALESCE where appropriate and explicitly count NULLs in critical fields to spot missing data.

Forgetting distinctness

When checking uniqueness, use GROUP BY and HAVING rather than a DISTINCT SELECT which hides duplicates.

No baseline or thresholds

Record a baseline from a known-good day and define acceptable deltas (e.g., ±1%). Not all differences indicate errors.

Practical projects

Build a reusable SQL validation script that outputs a single row of key checks (counts, sums, max date, duplicate count).
Create a small "data quality" table where you insert daily check results with timestamps for easy monitoring.
Design a dashboard tile that displays validation KPIs (row diff, sum diff, freshness) from your quality table.

Learning path

Start: Row counts and sums → Keys and referential integrity → Ranges and freshness → Distribution checks (optional) → Automate.
Next skill: Aggregations and window functions to deepen reconciliation techniques.

Mini challenge

Your product manager reports a sudden drop in daily revenue on your dashboard. In 10 minutes, run three checks to isolate the issue:

Row count and sum reconciliation (source vs. report) for yesterday.
Orphan check on customer_id and product_id.
Freshness check on both dimension and fact tables.

Suggested approach

Start with base counts and sums. If they match, check joins for orphans. If joins are OK, verify max dates to ensure all tables refreshed.

Next steps

Turn your best checks into saved queries or views and run before publishing dashboards.
Document known differences and acceptable thresholds in your team wiki or dataset description.
Practice on a new dataset each week to build speed and reliability.

Quick Test

Take the quick test to check your understanding. Available to everyone for free; only logged-in users have results saved.

Menu

Data Validation Checks

Table of Contents

Why this matters

Who this is for

Prerequisites

Concept explained simply

Mental model

Practical validation checklist

Worked examples

1) Row count and total reconciliation

2) Referential integrity (orphan keys)

3) Uniqueness and range checks

Step-by-step: add checks to your workflow

Exercises

Self-check checklist

Common mistakes and how to catch them

Practical projects

Learning path

Mini challenge

Next steps

Quick Test

Practice Exercises

Daily reconciliation: source vs. report

Instructions

Expected Output

Find orphan customer_ids in orders

Data Validation Checks — Quick Test

Have questions about Data Validation Checks?

AI Assistant