How to learn Data Quality Checks In SQL for SQL in Data Scientist for free

Why this matters

As a Data Scientist, your models, dashboards, and decisions are only as good as the data feeding them. Data quality checks in SQL catch issues early—before a forecast is wrong, an A/B test is misread, or a model drifts due to unexpected nulls and duplicates.

Typical real-world scenarios you’ll face

Validate training data: detect null labels, outliers, and duplicated records.
Audit pipelines: ensure daily loads are complete and fresh.
Verify joins: find orphan rows before building features.
Enforce business rules: only allowed statuses, sensible ranges for amounts.

Concept explained simply

Data quality checks are small SQL queries that measure whether data fits expectations. Think of them as unit tests for your tables.

Mental model

Completeness: Are required fields present? (NULL checks)
Uniqueness: Are keys unique? (duplicate detection)
Validity: Do values fall within allowed sets/ranges? (domain checks)
Referential integrity: Do foreign keys match a parent table? (orphan detection)
Consistency: Do columns agree with each other? (cross-field rules)
Timeliness/Freshness: Is the data up-to-date? (max timestamp vs today)

Core SQL patterns you can reuse

1) Completeness (NULLs)

-- Count nulls per column
SELECT
  SUM(CASE WHEN col1 IS NULL THEN 1 ELSE 0 END) AS null_col1,
  SUM(CASE WHEN col2 IS NULL THEN 1 ELSE 0 END) AS null_col2
FROM your_table;

-- List offending rows
SELECT *
FROM your_table
WHERE col1 IS NULL OR col2 IS NULL;

2) Uniqueness (duplicate detection)

-- Find duplicate keys
SELECT key_col, COUNT(*) AS cnt
FROM your_table
GROUP BY key_col
HAVING COUNT(*) > 1;

-- Flag duplicates, keep the first (for de-duplication)
WITH ranked AS (
  SELECT *, ROW_NUMBER() OVER (PARTITION BY key_col ORDER BY created_at) AS rn
  FROM your_table
)
SELECT * FROM ranked WHERE rn > 1;  -- duplicates

3) Validity (allowed sets & ranges)

-- Allowed values
SELECT *
FROM orders
WHERE status NOT IN ('new','paid','shipped','cancelled') OR status IS NULL;

-- Ranges (inclusive bounds)
SELECT *
FROM orders
WHERE amount < 0 OR amount > 100000; -- adjust to your domain

4) Referential integrity (orphans)

-- Find child rows without a matching parent
SELECT o.*
FROM orders o
LEFT JOIN customers c ON o.customer_id = c.customer_id
WHERE c.customer_id IS NULL;  -- orphans

5) Consistency (cross-column rules)

-- Example: shipped orders must have a shipped_at timestamp
SELECT *
FROM orders
WHERE status = 'shipped' AND shipped_at IS NULL;

6) Timeliness/Freshness

-- Days since last event
SELECT DATE_DIFF('day', MAX(event_time), CURRENT_DATE) AS days_since_last
FROM events;

-- Row count by date to detect missing load
SELECT event_date, COUNT(*) AS rows_loaded
FROM events
GROUP BY event_date
ORDER BY event_date DESC;

Worked examples

Example 1: Missing values and invalid domains

-- Table: orders(order_id, customer_id, order_date, amount, status)
-- Task: Find rows with NULL amount or invalid status
SELECT *
FROM orders
WHERE amount IS NULL
   OR status NOT IN ('new','paid','shipped','cancelled')
   OR status IS NULL;

Interpretation: Any returned row needs fixing at the source or imputing rules before analysis.

Example 2: Duplicates by natural key

-- Natural key: (customer_id, order_date)
SELECT customer_id, order_date, COUNT(*) AS cnt
FROM orders
GROUP BY customer_id, order_date
HAVING COUNT(*) > 1;  -- investigate reasons (retries, idempotency issues, etc.)

-- Keep first occurrence and flag the rest
WITH r AS (
  SELECT *, ROW_NUMBER() OVER (
    PARTITION BY customer_id, order_date
    ORDER BY created_at
  ) AS rn
  FROM orders
)
SELECT * FROM r WHERE rn > 1;  -- duplicates to remove or consolidate

Example 3: Orphan detection

-- Tables: customers(customer_id), orders(customer_id)
SELECT o.order_id, o.customer_id
FROM orders o
LEFT JOIN customers c ON o.customer_id = c.customer_id
WHERE c.customer_id IS NULL;  -- orphans indicate join/ETL issues

Example 4: Freshness and completeness

-- Is yesterday's partition present and roughly the expected size?
WITH by_day AS (
  SELECT CAST(order_date AS DATE) AS d, COUNT(*) AS n
  FROM orders
  GROUP BY CAST(order_date AS DATE)
)
SELECT d, n
FROM by_day
ORDER BY d DESC;

-- Gap detection pattern (missing dates)
WITH calendar AS (
  SELECT DATEADD('day', seq4(), DATEADD('day', -30, CURRENT_DATE)) AS d
  FROM TABLE(GENERATOR(ROWCOUNT => 31))
),
obs AS (
  SELECT CAST(order_date AS DATE) AS d
  FROM orders
  GROUP BY CAST(order_date AS DATE)
)
SELECT c.d
FROM calendar c
LEFT JOIN obs o ON c.d = o.d
WHERE o.d IS NULL;  -- missing partitions

Practical workflow: build a lightweight data quality suite

List assumptions: keys, required columns, allowed values, expected daily row counts.
Write small SQL checks for each assumption using patterns above.
Create summary metrics: counts of nulls, duplicates, orphans, invalids, days since last event.
Set thresholds: what counts as a failure (e.g., any orphan > 0, duplicates > 0, freshness > 1 day).
Schedule and log: run checks after each load; store results in an audit table for trend monitoring.
Close the loop: when a check fails, file an issue and add a regression test.

Practice: Exercises

Try these in your SQL environment. Then compare with the provided solutions.

Exercise 1 — Find NULLs and out-of-range values in orders

Tables: orders(order_id INT, customer_id INT, order_date DATE, amount NUMERIC, status TEXT)

Task A: List all rows where amount IS NULL or amount < 0.
Task B: List all rows where status NOT IN ('new','paid','shipped','cancelled') OR status IS NULL.

Hints

Use IS NULL, OR conditions, and NOT IN for domain checks.
Consider separating numeric and categorical checks into two queries.

Exercise 2 — Detect duplicates and orphans across orders and customers

Tables: customers(customer_id INT PRIMARY KEY), orders(order_id INT PRIMARY KEY, customer_id INT, created_at TIMESTAMP)

Task A: Find duplicate customer_id + DATE(created_at) combinations in orders and show counts.
Task B: Return all orders whose customer_id does not exist in customers (orphans).
Task C: Produce a de-duplicated view that keeps the earliest row per (customer_id, DATE(created_at)).

Hints

Use GROUP BY ... HAVING COUNT(> 1) for duplicates.
Use LEFT JOIN and a NULL filter on the parent key to find orphans.
Use ROW_NUMBER() PARTITION BY to keep one row per key.

Self-check checklist

Your queries return only the rows that violate each rule.
Duplicate detection reports keys, not individual rows, when counting.
Orphan queries use LEFT JOIN (or NOT EXISTS) and filter on NULL parent keys.
De-duplication uses ROW_NUMBER() and retains exactly one survivor per key.
Validity checks cover both NULLs and unexpected values.

Common mistakes and how to self-check

Using IN with NULLs: NULL never equals anything. Add explicit IS NULL conditions.
Forgetting composite keys: checking uniqueness on one column when the real key is two or more columns.
BETWEEN misunderstandings: BETWEEN is inclusive; adjust bounds if you need exclusive limits.
Inner joining for orphans: an INNER JOIN hides missing parents. Use LEFT JOIN + parent IS NULL.
Windowing without ORDER BY: ROW_NUMBER() without a deterministic ORDER BY yields non-reproducible survivors.

Practical projects

Build a data quality dashboard: daily counts of nulls, duplicates, orphans, invalids across 3–5 critical tables.
Create a regression test suite: store SQL checks and their thresholds in a control table; log results after each pipeline run.
Design a de-duplication strategy: write SQL to cluster near-duplicates and keep a canonical record with clear rules.

Who this is for

Data Scientists who prepare datasets for modeling and analytics.
Analytics Engineers and Data Analysts responsible for reliable dashboards.
Anyone owning data pipelines that feed ML/BI systems.

Prerequisites

Comfort with SELECT, WHERE, GROUP BY, HAVING, JOINs.
Basic window functions (ROW_NUMBER, RANK) recommended.

Learning path

Review SQL JOINs and NULL semantics.
Practice duplicate detection and de-duplication with window functions.
Implement domain and range checks for key datasets.
Add referential integrity and freshness checks.
Automate: schedule checks and log results.

Mini challenge

Given a table payments(payment_id, order_id, amount, currency, paid_at), write two checks: (1) invalid currency (not in ['USD','EUR','GBP']), (2) payments with amount <= 0 OR paid_at IS NULL. Then add a freshness query reporting hours since MAX(paid_at).

Next steps

Turn your best checks into reusable views so teams can monitor them daily.
Work with upstream owners to fix root causes—not just symptoms.
Document assumptions and thresholds so they’re easy to review and update.

Take the quick test

Test is available to everyone. If you log in, your progress and score will be saved for next time.

Menu

Data Quality Checks In SQL

Table of Contents