luvv to helpDiscover the Best Free Online Tools
Topic 2 of 8

Null Duplicate And Range Checks

Learn Null Duplicate And Range Checks for free with explanations, exercises, and a quick test (for ETL Developer).

Published: January 11, 2026 | Updated: January 11, 2026

Why this matters

As an ETL Developer, you ship reliable datasets. Three fast checks catch most issues early:

  • Null checks: prevent missing keys, dates, amounts that break downstream joins and reports.
  • Duplicate checks: stop inflated metrics by keeping one canonical record per business key (e.g., email, order_id).
  • Range checks: enforce sensible numeric and date bounds (e.g., age 0–120, amount 0–10,000).

These checks fit naturally into staging/validation steps and protect dashboards, ML features, and financial reporting.

Concept explained simply

  • Null check: Are required fields present? If not, the row is incomplete.
  • Duplicate check: Do multiple rows represent the same business entity? If yes, keep one canonical row.
  • Range check: Are values within allowed bounds or valid sets? If not, flag them.
Mental model

Imagine three gates before data enters the warehouse:

  • Gate 1 (Nulls): Missing essentials get stopped.
  • Gate 2 (Duplicates): Only one representative of each entity proceeds.
  • Gate 3 (Ranges): Out-of-bounds values are turned back.

If a row fails any gate, you either fix it, quarantine it, or reject it.

Implementation fundamentals

  1. Define rules: Column-level requirements (NOT NULL), uniqueness keys, allowed value sets, numeric/date bounds.
  2. Write detection queries: SQL patterns to count and isolate offending rows.
  3. Decide severity: Error (stop load) vs Warning (quarantine and continue). Set thresholds.
  4. Remediate: Standardize, deduplicate, default, or route to error table for review.
  5. Monitor: Track daily counts, percentages, and trends.
Reusable SQL patterns
-- Nulls per column (profile)
SELECT 
  SUM(CASE WHEN col1 IS NULL THEN 1 ELSE 0 END) AS nulls_col1,
  SUM(CASE WHEN col2 IS NULL THEN 1 ELSE 0 END) AS nulls_col2
FROM my_table;

-- Duplicate keys (count and isolate)
SELECT business_key, COUNT(*) AS cnt
FROM my_table
GROUP BY business_key
HAVING COUNT(*) > 1;

-- Canonical row with tie-breakers (keep latest)
WITH ranked AS (
  SELECT *,
         ROW_NUMBER() OVER (
           PARTITION BY business_key
           ORDER BY updated_at DESC, id ASC
         ) AS rn
  FROM my_table
)
SELECT * FROM ranked WHERE rn = 1; -- keepers

-- Range and allowed set checks
SELECT *
FROM my_table
WHERE amount < 0 OR amount > 10000
   OR currency IS NULL OR currency NOT IN ('USD','EUR','GBP');

-- Cross-field range example
SELECT *
FROM orders
WHERE ship_date < order_date;  -- invalid if shipping before ordering
In an ETL pipeline (simple flow)
  1. Load raw data into a staging table as-is.
  2. Run validation queries to produce two sets: valid_rows, error_rows (with reason).
  3. Insert valid_rows into the model; store error_rows in an error table for review.
  4. Emit metrics: counts of nulls, duplicates, range violations.

Worked examples

Example 1 — Null checks on required columns

Suppose users(id PK, email, created_at) must have non-null id and email.

-- Detect rows with nulls in required columns
SELECT id, email, created_at,
  CASE 
    WHEN id IS NULL THEN 'id_null'
    WHEN email IS NULL THEN 'email_null'
  END AS reason
FROM staging_users
WHERE id IS NULL OR email IS NULL;

-- Quick profile
SELECT 
  SUM(CASE WHEN id IS NULL THEN 1 ELSE 0 END) AS null_id,
  SUM(CASE WHEN email IS NULL THEN 1 ELSE 0 END) AS null_email
FROM staging_users;
Example 2 — Duplicate detection and canonicalization

Keep one row per normalized email (lower-trim), prefer latest updated_at, then lowest id.

WITH normalized AS (
  SELECT *, LOWER(TRIM(email)) AS email_n
  FROM users_raw
), ranked AS (
  SELECT *,
         ROW_NUMBER() OVER (
           PARTITION BY email_n
           ORDER BY updated_at DESC, id ASC
         ) AS rn
  FROM normalized
)
SELECT * FROM ranked WHERE rn = 1;  -- canonical keepers
-- Duplicates to review: SELECT * FROM ranked WHERE rn > 1;
Example 3 — Range and cross-field checks

Orders must have 0 ≤ amount ≤ 10000, currency in {USD,EUR,GBP}, and ship_date ≥ order_date.

SELECT order_id, amount, currency, order_date, ship_date,
  CASE 
    WHEN amount < 0 OR amount > 10000 THEN 'amount_out_of_range'
    WHEN currency IS NULL OR currency NOT IN ('USD','EUR','GBP') THEN 'currency_invalid'
    WHEN ship_date < order_date THEN 'ship_before_order'
  END AS reason
FROM orders
WHERE (amount < 0 OR amount > 10000)
   OR (currency IS NULL OR currency NOT IN ('USD','EUR','GBP'))
   OR (ship_date < order_date);

Defining solid rules

  • Required columns: id, business keys, foreign keys, timestamps.
  • Uniqueness keys: email, order_id, external_id, composite keys if needed.
  • Allowed sets: enumerations for status/currency/country.
  • Ranges: numeric bounds, date windows, cross-field relations (start ≤ end).
  • Severity and thresholds: error if > 0 invalids in key fields; warning if ≤ 0.5% invalids in optional fields.

Exercises

Work through these hands-on tasks. Then check your answers in the expandable solutions. Use the checklist to verify your approach.

Exercise 1 — Flag bad rows (nulls, invalid sets, ranges)

Table: transactions(id, user_id, amount, currency, created_at)

Rules:

  • user_id, amount, currency must be NOT NULL
  • amount range: 0 to 10000 inclusive
  • currency allowed: USD, EUR, GBP

Sample data:

iduser_idamountcurrencycreated_at
11050.00USD2023-01-05
211-5.00USD2023-01-06
31212500.00EUR2023-01-07
413nullGBP2023-01-10
5null20.00USD2023-01-11
61415.00null2023-01-12
71530.00AUD2023-01-13
8160.00USD2023-01-14

Task: Write one SQL that returns rows failing any rule with a reason column.

Checklist
  • One WHERE clause covering all rules
  • Reason is specific (e.g., amount_out_of_range)
  • Correct handling of NULL with allowed sets

Exercise 2 — Detect and keep one canonical user per email

Table: users_raw(id, email, updated_at)

Rule: Keep one row per normalized email = lower(trim(email)). Prefer latest updated_at; break ties by lowest id. Return two result sets: keepers and duplicates.

Sample data:

idemailupdated_at
1ALICE@example.com2023-01-01
2alice@example.com2023-01-05
3bob@example.com2023-01-03
4bob@example.com2023-01-02
5carol@example.com2023-01-04
6carol@example.com2023-01-04
7dave@example.com2023-01-06

Task: Produce a query that adds rn and then selects keepers (rn=1) and duplicates (rn>1).

Checklist
  • Normalize email (lower/trim)
  • ROW_NUMBER partitioned by normalized email
  • ORDER BY updated_at DESC, id ASC

Common mistakes and self-check

  • Using NOT IN with a column that may be NULL. Self-check: add explicit "OR col IS NULL" or use NOT IN on COALESCE(col,'') with care.
  • Deleting all duplicates instead of keeping one canonical record. Self-check: verify rn=1 row remains per key.
  • Relying only on COUNT to detect duplicates, without isolating the rows. Self-check: produce a ranked result set.
  • Ignoring cross-field rules (e.g., end_date before start_date). Self-check: add at least one cross-field predicate.
  • Hard-coding bounds without business sign-off. Self-check: document each rule with owner and rationale.
  • Letting a few invalids block the whole pipeline. Self-check: define thresholds and quarantine logic for warnings.
  • Not tracking trends. Self-check: log daily invalid counts/percentages.

Practical projects

  • Build a validation layer for an orders feed: null/duplicate/range checks, error table with reasons, and a daily metrics summary.
  • Contact list dedup: normalize emails and phone numbers, select canonical contacts, and generate a merge map for downstream systems.
  • IoT sanity checks: enforce timestamp windows, value ranges per sensor type, and cross-sensor consistency rules.

Who this is for

  • ETL Developers and Data Engineers building staging and transformation layers.
  • Analytics Engineers adding data quality guardrails.
  • QA/Ops teams monitoring data pipelines.

Prerequisites

  • Comfort with SQL SELECT, WHERE, GROUP BY, window functions.
  • Basic understanding of ETL staging vs. curated layers.
  • Familiarity with your warehouse’s NULL and comparison semantics.

Learning path

  1. Write detection queries for nulls, duplicates, and ranges on a small table.
  2. Add reasons and counts; document rules and owners.
  3. Automate: produce valid vs. error datasets in your pipeline.
  4. Add thresholds (error vs warning) and daily metrics.
  5. Extend to cross-field and cross-table checks (FK existence).

Next steps

  • Add referential integrity checks (FK presence in dimension tables).
  • Introduce schema constraints or contracts so producers know the rules.
  • Build dashboards for data quality KPIs and alerts.

Mini challenge

Pick one of your staging tables. Define 3–5 rules mixing null, duplicate, and range checks. Implement one SQL that returns offending rows with a reason. Bonus: add a ranking to keep canonical rows for one business key.

Ready for the Quick Test?

The quick test below is available to everyone. Only logged-in users will have their progress saved.

Practice Exercises

2 exercises to complete

Instructions

Using table transactions(id, user_id, amount, currency, created_at):

  • user_id, amount, currency must be NOT NULL
  • amount range: 0 to 10000 inclusive
  • currency allowed: USD, EUR, GBP

Sample rows:

iduser_idamountcurrencycreated_at
11050.00USD2023-01-05
211-5.00USD2023-01-06
31212500.00EUR2023-01-07
413nullGBP2023-01-10
5null20.00USD2023-01-11
61415.00null2023-01-12
71530.00AUD2023-01-13
8160.00USD2023-01-14

Write one SQL that returns all rows that fail any rule with columns: id, user_id, amount, currency, reason.

Expected Output
Rows id 2,3,4,5,6,7 are returned. Reasons include: amount_out_of_range (2,3), amount_null (4), user_id_null (5), currency_null (6), currency_invalid (7).

Null Duplicate And Range Checks — Quick Test

Test your knowledge with 8 questions. Pass with 70% or higher.

8 questions70% to pass

Have questions about Null Duplicate And Range Checks?

AI Assistant

Ask questions about this tool