Why this matters
As an ETL Developer, you ship reliable datasets. Three fast checks catch most issues early:
- Null checks: prevent missing keys, dates, amounts that break downstream joins and reports.
- Duplicate checks: stop inflated metrics by keeping one canonical record per business key (e.g., email, order_id).
- Range checks: enforce sensible numeric and date bounds (e.g., age 0–120, amount 0–10,000).
These checks fit naturally into staging/validation steps and protect dashboards, ML features, and financial reporting.
Concept explained simply
- Null check: Are required fields present? If not, the row is incomplete.
- Duplicate check: Do multiple rows represent the same business entity? If yes, keep one canonical row.
- Range check: Are values within allowed bounds or valid sets? If not, flag them.
Mental model
Imagine three gates before data enters the warehouse:
- Gate 1 (Nulls): Missing essentials get stopped.
- Gate 2 (Duplicates): Only one representative of each entity proceeds.
- Gate 3 (Ranges): Out-of-bounds values are turned back.
If a row fails any gate, you either fix it, quarantine it, or reject it.
Implementation fundamentals
- Define rules: Column-level requirements (NOT NULL), uniqueness keys, allowed value sets, numeric/date bounds.
- Write detection queries: SQL patterns to count and isolate offending rows.
- Decide severity: Error (stop load) vs Warning (quarantine and continue). Set thresholds.
- Remediate: Standardize, deduplicate, default, or route to error table for review.
- Monitor: Track daily counts, percentages, and trends.
Reusable SQL patterns
-- Nulls per column (profile)
SELECT
SUM(CASE WHEN col1 IS NULL THEN 1 ELSE 0 END) AS nulls_col1,
SUM(CASE WHEN col2 IS NULL THEN 1 ELSE 0 END) AS nulls_col2
FROM my_table;
-- Duplicate keys (count and isolate)
SELECT business_key, COUNT(*) AS cnt
FROM my_table
GROUP BY business_key
HAVING COUNT(*) > 1;
-- Canonical row with tie-breakers (keep latest)
WITH ranked AS (
SELECT *,
ROW_NUMBER() OVER (
PARTITION BY business_key
ORDER BY updated_at DESC, id ASC
) AS rn
FROM my_table
)
SELECT * FROM ranked WHERE rn = 1; -- keepers
-- Range and allowed set checks
SELECT *
FROM my_table
WHERE amount < 0 OR amount > 10000
OR currency IS NULL OR currency NOT IN ('USD','EUR','GBP');
-- Cross-field range example
SELECT *
FROM orders
WHERE ship_date < order_date; -- invalid if shipping before ordering
In an ETL pipeline (simple flow)
- Load raw data into a staging table as-is.
- Run validation queries to produce two sets: valid_rows, error_rows (with reason).
- Insert valid_rows into the model; store error_rows in an error table for review.
- Emit metrics: counts of nulls, duplicates, range violations.
Worked examples
Example 1 — Null checks on required columns
Suppose users(id PK, email, created_at) must have non-null id and email.
-- Detect rows with nulls in required columns
SELECT id, email, created_at,
CASE
WHEN id IS NULL THEN 'id_null'
WHEN email IS NULL THEN 'email_null'
END AS reason
FROM staging_users
WHERE id IS NULL OR email IS NULL;
-- Quick profile
SELECT
SUM(CASE WHEN id IS NULL THEN 1 ELSE 0 END) AS null_id,
SUM(CASE WHEN email IS NULL THEN 1 ELSE 0 END) AS null_email
FROM staging_users;
Example 2 — Duplicate detection and canonicalization
Keep one row per normalized email (lower-trim), prefer latest updated_at, then lowest id.
WITH normalized AS (
SELECT *, LOWER(TRIM(email)) AS email_n
FROM users_raw
), ranked AS (
SELECT *,
ROW_NUMBER() OVER (
PARTITION BY email_n
ORDER BY updated_at DESC, id ASC
) AS rn
FROM normalized
)
SELECT * FROM ranked WHERE rn = 1; -- canonical keepers
-- Duplicates to review: SELECT * FROM ranked WHERE rn > 1;
Example 3 — Range and cross-field checks
Orders must have 0 ≤ amount ≤ 10000, currency in {USD,EUR,GBP}, and ship_date ≥ order_date.
SELECT order_id, amount, currency, order_date, ship_date,
CASE
WHEN amount < 0 OR amount > 10000 THEN 'amount_out_of_range'
WHEN currency IS NULL OR currency NOT IN ('USD','EUR','GBP') THEN 'currency_invalid'
WHEN ship_date < order_date THEN 'ship_before_order'
END AS reason
FROM orders
WHERE (amount < 0 OR amount > 10000)
OR (currency IS NULL OR currency NOT IN ('USD','EUR','GBP'))
OR (ship_date < order_date);
Defining solid rules
- Required columns: id, business keys, foreign keys, timestamps.
- Uniqueness keys: email, order_id, external_id, composite keys if needed.
- Allowed sets: enumerations for status/currency/country.
- Ranges: numeric bounds, date windows, cross-field relations (start ≤ end).
- Severity and thresholds: error if > 0 invalids in key fields; warning if ≤ 0.5% invalids in optional fields.
Exercises
Work through these hands-on tasks. Then check your answers in the expandable solutions. Use the checklist to verify your approach.
Exercise 1 — Flag bad rows (nulls, invalid sets, ranges)
Table: transactions(id, user_id, amount, currency, created_at)
Rules:
- user_id, amount, currency must be NOT NULL
- amount range: 0 to 10000 inclusive
- currency allowed: USD, EUR, GBP
Sample data:
| id | user_id | amount | currency | created_at |
|---|---|---|---|---|
| 1 | 10 | 50.00 | USD | 2023-01-05 |
| 2 | 11 | -5.00 | USD | 2023-01-06 |
| 3 | 12 | 12500.00 | EUR | 2023-01-07 |
| 4 | 13 | null | GBP | 2023-01-10 |
| 5 | null | 20.00 | USD | 2023-01-11 |
| 6 | 14 | 15.00 | null | 2023-01-12 |
| 7 | 15 | 30.00 | AUD | 2023-01-13 |
| 8 | 16 | 0.00 | USD | 2023-01-14 |
Task: Write one SQL that returns rows failing any rule with a reason column.
Checklist
- One WHERE clause covering all rules
- Reason is specific (e.g., amount_out_of_range)
- Correct handling of NULL with allowed sets
Exercise 2 — Detect and keep one canonical user per email
Table: users_raw(id, email, updated_at)
Rule: Keep one row per normalized email = lower(trim(email)). Prefer latest updated_at; break ties by lowest id. Return two result sets: keepers and duplicates.
Sample data:
| id | updated_at | |
|---|---|---|
| 1 | ALICE@example.com | 2023-01-01 |
| 2 | alice@example.com | 2023-01-05 |
| 3 | bob@example.com | 2023-01-03 |
| 4 | bob@example.com | 2023-01-02 |
| 5 | carol@example.com | 2023-01-04 |
| 6 | carol@example.com | 2023-01-04 |
| 7 | dave@example.com | 2023-01-06 |
Task: Produce a query that adds rn and then selects keepers (rn=1) and duplicates (rn>1).
Checklist
- Normalize email (lower/trim)
- ROW_NUMBER partitioned by normalized email
- ORDER BY updated_at DESC, id ASC
Common mistakes and self-check
- Using NOT IN with a column that may be NULL. Self-check: add explicit "OR col IS NULL" or use NOT IN on COALESCE(col,'') with care.
- Deleting all duplicates instead of keeping one canonical record. Self-check: verify rn=1 row remains per key.
- Relying only on COUNT to detect duplicates, without isolating the rows. Self-check: produce a ranked result set.
- Ignoring cross-field rules (e.g., end_date before start_date). Self-check: add at least one cross-field predicate.
- Hard-coding bounds without business sign-off. Self-check: document each rule with owner and rationale.
- Letting a few invalids block the whole pipeline. Self-check: define thresholds and quarantine logic for warnings.
- Not tracking trends. Self-check: log daily invalid counts/percentages.
Practical projects
- Build a validation layer for an orders feed: null/duplicate/range checks, error table with reasons, and a daily metrics summary.
- Contact list dedup: normalize emails and phone numbers, select canonical contacts, and generate a merge map for downstream systems.
- IoT sanity checks: enforce timestamp windows, value ranges per sensor type, and cross-sensor consistency rules.
Who this is for
- ETL Developers and Data Engineers building staging and transformation layers.
- Analytics Engineers adding data quality guardrails.
- QA/Ops teams monitoring data pipelines.
Prerequisites
- Comfort with SQL SELECT, WHERE, GROUP BY, window functions.
- Basic understanding of ETL staging vs. curated layers.
- Familiarity with your warehouse’s NULL and comparison semantics.
Learning path
- Write detection queries for nulls, duplicates, and ranges on a small table.
- Add reasons and counts; document rules and owners.
- Automate: produce valid vs. error datasets in your pipeline.
- Add thresholds (error vs warning) and daily metrics.
- Extend to cross-field and cross-table checks (FK existence).
Next steps
- Add referential integrity checks (FK presence in dimension tables).
- Introduce schema constraints or contracts so producers know the rules.
- Build dashboards for data quality KPIs and alerts.
Mini challenge
Pick one of your staging tables. Define 3–5 rules mixing null, duplicate, and range checks. Implement one SQL that returns offending rows with a reason. Bonus: add a ranking to keep canonical rows for one business key.
Ready for the Quick Test?
The quick test below is available to everyone. Only logged-in users will have their progress saved.