luvv to helpDiscover the Best Free Online Tools
Topic 2 of 8

Null Duplicate And Range Checks

Learn Null Duplicate And Range Checks for free with explanations, exercises, and a quick test (for Data Engineer).

Published: January 8, 2026 | Updated: January 8, 2026

Why this matters

Null, duplicate, and range checks are the fastest, highest-impact data quality gates. They catch broken joins, bad extractions, schema drift, and unexpected values before they poison dashboards, ML features, and downstream jobs.

  • Daily tasks: validate new ingestion tables, protect primary keys, enforce business rules (e.g., non-negative prices).
  • On-call: isolate bad batches quickly by comparing today’s null/duplicate/range metrics to baselines.
  • Governance: prove data fitness with simple, auditable checks and metrics.

Note: The quick test is available to everyone. Only logged-in users have their progress saved.

Concept explained simply

Think of your dataset like a shipping manifest:

  • Null checks: missing fields on packages (no address).
  • Duplicate checks: the same package listed twice.
  • Range checks: impossible weights or delivery dates (negative weight, date in year 1900).

Mental model

Use three lenses on every table:

  • Completeness: required fields are present (null checks).
  • Uniqueness: identifiers are unique (duplicate checks).
  • Validity: values fit allowed domains (range checks).
Tip: Where to run these checks
  • Ingestion (bronze/raw): verify schema and gross errors early.
  • Transform (silver/curated): enforce business rules.
  • Serve (gold/marts): block promotions to production if checks fail.

Core checks and patterns

Null checks

  • Required columns must be non-null (e.g., order_id, order_date).
  • Conditional null rules (e.g., cancelled_at is required when status = 'cancelled').
-- Count nulls per column
SELECT 'order_id' AS column_name, COUNT(*) FILTER (WHERE order_id IS NULL) AS nulls FROM orders
UNION ALL
SELECT 'order_date', COUNT(*) FILTER (WHERE order_date IS NULL) FROM orders;

Duplicate checks

  • Duplicate keys: same primary/business key appears more than once.
  • Near-duplicates: same customer_id + order_date + amount within a small window.
-- Find duplicate primary keys
SELECT order_id, COUNT(*) AS cnt
FROM orders
GROUP BY order_id
HAVING COUNT(*) > 1;

Range checks

  • Numeric ranges (0 ≤ amount ≤ 10,000).
  • Date ranges (order_date not in the future, not too far in the past).
  • Enumerations (status IN ('new','shipped','cancelled')).
-- Out-of-range examples
SELECT *
FROM orders
WHERE amount < 0 OR amount > 10000
   OR order_date > CURRENT_DATE
   OR status NOT IN ('new','shipped','cancelled');

Worked examples

Example 1 — Null checks with conditional logic (SQL)
-- Required fields must be present
SELECT
  COUNT(*) FILTER (WHERE order_id IS NULL) AS null_order_id,
  COUNT(*) FILTER (WHERE customer_id IS NULL) AS null_customer_id,
  COUNT(*) FILTER (WHERE order_date IS NULL) AS null_order_date
FROM orders;

-- If status = 'cancelled', cancelled_at must be not null
SELECT COUNT(*) AS violations
FROM orders
WHERE status = 'cancelled' AND cancelled_at IS NULL;

Expected: all required fields have 0 nulls; conditional rule has 0 violations.

Example 2 — Duplicate detection and safe dedup (SQL)
-- Identify duplicates by primary key
WITH dup AS (
  SELECT order_id
  FROM orders
  GROUP BY order_id
  HAVING COUNT(*) > 1
), ranked AS (
  SELECT o.*, ROW_NUMBER() OVER (PARTITION BY o.order_id ORDER BY updated_at DESC NULLS LAST) AS rn
  FROM orders o
  JOIN dup d USING (order_id)
)
-- Keep the most recent record per order_id
SELECT * FROM ranked WHERE rn = 1;  -- Use this to build a clean table

Pattern: pick a deterministic rule (e.g., latest updated_at) and keep rn = 1.

Example 3 — Range checks at write-time (constraints)
-- Example with constraints (syntax may vary by database)
ALTER TABLE orders
  ADD CONSTRAINT chk_amount_range CHECK (amount >= 0 AND amount <= 10000),
  ADD CONSTRAINT chk_valid_status CHECK (status IN ('new','shipped','cancelled'));

-- Optional: soft constraints via data validation queries before inserts

If you cannot use constraints (e.g., data lake), run validation queries before publishing to the serving layer.

Step-by-step: Add checks to a pipeline

  1. List required columns and expected ranges (document assumptions).
  2. Implement queries/notebooks to compute null, duplicate, and range violations.
  3. Emit metrics (counts, rates) and store sample offending rows for debugging.
  4. Fail or quarantine the batch if violations exceed thresholds.
  5. Schedule checks near ingestion and before publishing to consumers.
Minimal metric set to track
  • Null rate per required column.
  • Duplicate key count.
  • Out-of-range count per rule.

Exercises

These mirror the interactive exercises below. Try them now, then compare with solutions.

  1. Exercise 1 (SQL): Inspect a tiny orders table for nulls, duplicates, and range issues. Produce counts and the list of offending rows.
  2. Exercise 2 (PySpark): Validate a DataFrame with the same rules. Return three DataFrames: null_violations, duplicate_keys, range_violations.

Checklist before you submit

  • [ ] You counted nulls for each required column.
  • [ ] You identified duplicate order_id values.
  • [ ] You flagged out-of-range amounts and invalid statuses.
  • [ ] You can explain how you would quarantine bad rows.

Common mistakes and how to self-check

  • Only checking nulls on a subset of columns. Self-check: list all required fields and verify each is tested.
  • Removing duplicates without a deterministic rule. Self-check: is your dedup rule stable (e.g., latest updated_at) and reproducible?
  • Range rules too strict. Self-check: review historical distributions; set ranges that match reality plus a safety margin.
  • Ignoring conditional rules (e.g., cancelled_at). Self-check: add at least one conditional null rule per table with statuses.
  • Not surfacing metrics. Self-check: can you see today vs. last 7 days violation counts?

Practical projects

  • Build a daily validation job that reads a raw orders table, computes null/duplicate/range metrics, writes a violations table, and stops promotion if thresholds are exceeded.
  • Create a reusable SQL snippet or notebook template for these three checks. Parameterize the table name, key column, and ranges.
  • Add a small dashboard with three sparkline metrics: null rate (order_date), duplicate key count, out-of-range amount count.

Who this is for

  • Data Engineers who maintain ingestion and transformation pipelines.
  • Analytics Engineers who publish marts and want reliable joins.
  • Data Scientists who depend on clean feature tables.

Prerequisites

  • Basic SQL (SELECT, GROUP BY, window functions).
  • Familiarity with your storage/compute environment (warehouse or lakehouse).
  • Optional: PySpark or pandas for DataFrame-based checks.

Learning path

  • Start: Null, duplicate, and range checks (this subskill).
  • Next: Schema drift detection and type validation.
  • Then: Freshness/volume monitoring and anomaly detection.
  • Finally: Quarantine strategies and automated remediation.

Next steps

  • Automate today’s checks in one critical table.
  • Add thresholds and make the pipeline fail fast on severe violations.
  • Document rules in your repo so others can extend them.

Mini challenge

You discover a surge in duplicate keys today, but nulls and ranges look normal. In one paragraph, outline how you would triage: where you look, which queries you run, and how you decide whether to block the publish step.

Practice Exercises

2 exercises to complete

Instructions

You have a small orders table:

CREATE TEMP TABLE orders(order_id INT, customer_id INT, order_date DATE, amount DECIMAL(10,2), status TEXT, updated_at TIMESTAMP, cancelled_at TIMESTAMP);
INSERT INTO orders VALUES
  (1, 10, DATE '2025-01-01', 49.99, 'new',        TIMESTAMP '2025-01-01 10:00', NULL),
  (2, 11, NULL,               15.00, 'shipped',    TIMESTAMP '2025-01-02 09:00', NULL),
  (2, 11, DATE '2025-01-02',  15.00, 'shipped',    TIMESTAMP '2025-01-02 12:00', NULL),
  (3, 12, DATE '2025-01-03', -5.00,  'new',        TIMESTAMP '2025-01-03 08:00', NULL),
  (4, NULL, DATE '2025-01-04', 8.00, 'cancelled',  TIMESTAMP '2025-01-04 11:00', NULL),
  (5, 13, DATE '2099-01-01', 12.00, 'unknown',     TIMESTAMP '2025-01-05 11:00', NULL);

Write SQL that returns: (1) counts of nulls per required column (order_id, customer_id, order_date), (2) list of duplicate order_id with counts, (3) all rows violating range/domain rules: amount < 0 OR amount > 10000, order_date > current_date, status NOT IN ('new','shipped','cancelled'), and (4) conditional rule: if status='cancelled' then cancelled_at is NOT NULL.

Expected Output
Null counts: order_id=0, customer_id=1, order_date=1. Duplicates: order_id=2 has 2 rows. Range/domain violations: order_id in {3 (amount -5.00), 5 (future date and invalid status 'unknown')}. Conditional rule violations: order_id=4 (cancelled but cancelled_at is NULL).

Null Duplicate And Range Checks — Quick Test

Test your knowledge with 8 questions. Pass with 70% or higher.

8 questions70% to pass

Have questions about Null Duplicate And Range Checks?

AI Assistant

Ask questions about this tool