Topic Not Found

Why this matters

As an ETL Developer, you ship reliable datasets. Three fast checks catch most issues early:

Null checks: prevent missing keys, dates, amounts that break downstream joins and reports.
Duplicate checks: stop inflated metrics by keeping one canonical record per business key (e.g., email, order_id).
Range checks: enforce sensible numeric and date bounds (e.g., age 0–120, amount 0–10,000).

These checks fit naturally into staging/validation steps and protect dashboards, ML features, and financial reporting.

Concept explained simply

Null check: Are required fields present? If not, the row is incomplete.
Duplicate check: Do multiple rows represent the same business entity? If yes, keep one canonical row.
Range check: Are values within allowed bounds or valid sets? If not, flag them.

Mental model

Imagine three gates before data enters the warehouse:

Gate 1 (Nulls): Missing essentials get stopped.
Gate 2 (Duplicates): Only one representative of each entity proceeds.
Gate 3 (Ranges): Out-of-bounds values are turned back.

If a row fails any gate, you either fix it, quarantine it, or reject it.

Implementation fundamentals

Define rules: Column-level requirements (NOT NULL), uniqueness keys, allowed value sets, numeric/date bounds.
Write detection queries: SQL patterns to count and isolate offending rows.
Decide severity: Error (stop load) vs Warning (quarantine and continue). Set thresholds.
Remediate: Standardize, deduplicate, default, or route to error table for review.
Monitor: Track daily counts, percentages, and trends.

Reusable SQL patterns

-- Nulls per column (profile)
SELECT 
  SUM(CASE WHEN col1 IS NULL THEN 1 ELSE 0 END) AS nulls_col1,
  SUM(CASE WHEN col2 IS NULL THEN 1 ELSE 0 END) AS nulls_col2
FROM my_table;

-- Duplicate keys (count and isolate)
SELECT business_key, COUNT(*) AS cnt
FROM my_table
GROUP BY business_key
HAVING COUNT(*) > 1;

-- Canonical row with tie-breakers (keep latest)
WITH ranked AS (
  SELECT *,
         ROW_NUMBER() OVER (
           PARTITION BY business_key
           ORDER BY updated_at DESC, id ASC
         ) AS rn
  FROM my_table
)
SELECT * FROM ranked WHERE rn = 1; -- keepers

-- Range and allowed set checks
SELECT *
FROM my_table
WHERE amount < 0 OR amount > 10000
   OR currency IS NULL OR currency NOT IN ('USD','EUR','GBP');

-- Cross-field range example
SELECT *
FROM orders
WHERE ship_date < order_date;  -- invalid if shipping before ordering

In an ETL pipeline (simple flow)

Load raw data into a staging table as-is.
Run validation queries to produce two sets: valid_rows, error_rows (with reason).
Insert valid_rows into the model; store error_rows in an error table for review.
Emit metrics: counts of nulls, duplicates, range violations.

Worked examples

Example 1 — Null checks on required columns

Suppose users(id PK, email, created_at) must have non-null id and email.

-- Detect rows with nulls in required columns
SELECT id, email, created_at,
  CASE 
    WHEN id IS NULL THEN 'id_null'
    WHEN email IS NULL THEN 'email_null'
  END AS reason
FROM staging_users
WHERE id IS NULL OR email IS NULL;

-- Quick profile
SELECT 
  SUM(CASE WHEN id IS NULL THEN 1 ELSE 0 END) AS null_id,
  SUM(CASE WHEN email IS NULL THEN 1 ELSE 0 END) AS null_email
FROM staging_users;

Example 2 — Duplicate detection and canonicalization

Keep one row per normalized email (lower-trim), prefer latest updated_at, then lowest id.

WITH normalized AS (
  SELECT *, LOWER(TRIM(email)) AS email_n
  FROM users_raw
), ranked AS (
  SELECT *,
         ROW_NUMBER() OVER (
           PARTITION BY email_n
           ORDER BY updated_at DESC, id ASC
         ) AS rn
  FROM normalized
)
SELECT * FROM ranked WHERE rn = 1;  -- canonical keepers
-- Duplicates to review: SELECT * FROM ranked WHERE rn > 1;

Example 3 — Range and cross-field checks

Orders must have 0 ≤ amount ≤ 10000, currency in {USD,EUR,GBP}, and ship_date ≥ order_date.

SELECT order_id, amount, currency, order_date, ship_date,
  CASE 
    WHEN amount < 0 OR amount > 10000 THEN 'amount_out_of_range'
    WHEN currency IS NULL OR currency NOT IN ('USD','EUR','GBP') THEN 'currency_invalid'
    WHEN ship_date < order_date THEN 'ship_before_order'
  END AS reason
FROM orders
WHERE (amount < 0 OR amount > 10000)
   OR (currency IS NULL OR currency NOT IN ('USD','EUR','GBP'))
   OR (ship_date < order_date);

Defining solid rules

Required columns: id, business keys, foreign keys, timestamps.
Uniqueness keys: email, order_id, external_id, composite keys if needed.
Allowed sets: enumerations for status/currency/country.
Ranges: numeric bounds, date windows, cross-field relations (start ≤ end).
Severity and thresholds: error if > 0 invalids in key fields; warning if ≤ 0.5% invalids in optional fields.

Exercises

Work through these hands-on tasks. Then check your answers in the expandable solutions. Use the checklist to verify your approach.

Exercise 1 — Flag bad rows (nulls, invalid sets, ranges)

Table: transactions(id, user_id, amount, currency, created_at)

Rules:

user_id, amount, currency must be NOT NULL
amount range: 0 to 10000 inclusive
currency allowed: USD, EUR, GBP

Sample data:

id	user_id	amount	currency	created_at
1	10	50.00	USD	2023-01-05
2	11	-5.00	USD	2023-01-06
3	12	12500.00	EUR	2023-01-07
4	13	null	GBP	2023-01-10
5	null	20.00	USD	2023-01-11
6	14	15.00	null	2023-01-12
7	15	30.00	AUD	2023-01-13
8	16	0.00	USD	2023-01-14

Task: Write one SQL that returns rows failing any rule with a reason column.

Checklist

One WHERE clause covering all rules
Reason is specific (e.g., amount_out_of_range)
Correct handling of NULL with allowed sets

Exercise 2 — Detect and keep one canonical user per email

Table: users_raw(id, email, updated_at)

Rule: Keep one row per normalized email = lower(trim(email)). Prefer latest updated_at; break ties by lowest id. Return two result sets: keepers and duplicates.

Sample data:

id	email	updated_at
1	ALICE@example.com	2023-01-01
2	alice@example.com	2023-01-05
3	bob@example.com	2023-01-03
4	bob@example.com	2023-01-02
5	carol@example.com	2023-01-04
6	carol@example.com	2023-01-04
7	dave@example.com	2023-01-06

Task: Produce a query that adds rn and then selects keepers (rn=1) and duplicates (rn>1).

Checklist

Normalize email (lower/trim)
ROW_NUMBER partitioned by normalized email
ORDER BY updated_at DESC, id ASC

Common mistakes and self-check

Using NOT IN with a column that may be NULL. Self-check: add explicit "OR col IS NULL" or use NOT IN on COALESCE(col,'') with care.
Deleting all duplicates instead of keeping one canonical record. Self-check: verify rn=1 row remains per key.
Relying only on COUNT to detect duplicates, without isolating the rows. Self-check: produce a ranked result set.
Ignoring cross-field rules (e.g., end_date before start_date). Self-check: add at least one cross-field predicate.
Hard-coding bounds without business sign-off. Self-check: document each rule with owner and rationale.
Letting a few invalids block the whole pipeline. Self-check: define thresholds and quarantine logic for warnings.
Not tracking trends. Self-check: log daily invalid counts/percentages.

Practical projects

Build a validation layer for an orders feed: null/duplicate/range checks, error table with reasons, and a daily metrics summary.
Contact list dedup: normalize emails and phone numbers, select canonical contacts, and generate a merge map for downstream systems.
IoT sanity checks: enforce timestamp windows, value ranges per sensor type, and cross-sensor consistency rules.

Who this is for

ETL Developers and Data Engineers building staging and transformation layers.
Analytics Engineers adding data quality guardrails.
QA/Ops teams monitoring data pipelines.

Prerequisites

Comfort with SQL SELECT, WHERE, GROUP BY, window functions.
Basic understanding of ETL staging vs. curated layers.
Familiarity with your warehouse’s NULL and comparison semantics.

Learning path

Write detection queries for nulls, duplicates, and ranges on a small table.
Add reasons and counts; document rules and owners.
Automate: produce valid vs. error datasets in your pipeline.
Add thresholds (error vs warning) and daily metrics.
Extend to cross-field and cross-table checks (FK existence).

Next steps

Add referential integrity checks (FK presence in dimension tables).
Introduce schema constraints or contracts so producers know the rules.
Build dashboards for data quality KPIs and alerts.

Mini challenge

Pick one of your staging tables. Define 3–5 rules mixing null, duplicate, and range checks. Implement one SQL that returns offending rows with a reason. Bonus: add a ranking to keep canonical rows for one business key.

Ready for the Quick Test?

The quick test below is available to everyone. Only logged-in users will have their progress saved.

Menu

Null Duplicate And Range Checks

Table of Contents

Why this matters

Concept explained simply

Implementation fundamentals

Worked examples

Defining solid rules

Exercises

Exercise 1 — Flag bad rows (nulls, invalid sets, ranges)

Exercise 2 — Detect and keep one canonical user per email

Common mistakes and self-check

Practical projects

Who this is for

Prerequisites

Learning path

Next steps

Mini challenge

Ready for the Quick Test?

Practice Exercises

Flag bad rows (nulls, invalid sets, ranges)

Instructions

Expected Output

Detect and keep one canonical user per email

Null Duplicate And Range Checks — Quick Test

Have questions about Null Duplicate And Range Checks?

AI Assistant