Topic Not Found

Why this matters

Null, duplicate, and range checks are the fastest, highest-impact data quality gates. They catch broken joins, bad extractions, schema drift, and unexpected values before they poison dashboards, ML features, and downstream jobs.

Daily tasks: validate new ingestion tables, protect primary keys, enforce business rules (e.g., non-negative prices).
On-call: isolate bad batches quickly by comparing today’s null/duplicate/range metrics to baselines.
Governance: prove data fitness with simple, auditable checks and metrics.

Note: The quick test is available to everyone. Only logged-in users have their progress saved.

Concept explained simply

Think of your dataset like a shipping manifest:

Null checks: missing fields on packages (no address).
Duplicate checks: the same package listed twice.
Range checks: impossible weights or delivery dates (negative weight, date in year 1900).

Mental model

Use three lenses on every table:

Completeness: required fields are present (null checks).
Uniqueness: identifiers are unique (duplicate checks).
Validity: values fit allowed domains (range checks).

Tip: Where to run these checks

Ingestion (bronze/raw): verify schema and gross errors early.
Transform (silver/curated): enforce business rules.
Serve (gold/marts): block promotions to production if checks fail.

Core checks and patterns

Null checks

Required columns must be non-null (e.g., order_id, order_date).
Conditional null rules (e.g., cancelled_at is required when status = 'cancelled').

-- Count nulls per column
SELECT 'order_id' AS column_name, COUNT(*) FILTER (WHERE order_id IS NULL) AS nulls FROM orders
UNION ALL
SELECT 'order_date', COUNT(*) FILTER (WHERE order_date IS NULL) FROM orders;

Duplicate checks

Duplicate keys: same primary/business key appears more than once.
Near-duplicates: same customer_id + order_date + amount within a small window.

-- Find duplicate primary keys
SELECT order_id, COUNT(*) AS cnt
FROM orders
GROUP BY order_id
HAVING COUNT(*) > 1;

Range checks

Numeric ranges (0 ≤ amount ≤ 10,000).
Date ranges (order_date not in the future, not too far in the past).
Enumerations (status IN ('new','shipped','cancelled')).

-- Out-of-range examples
SELECT *
FROM orders
WHERE amount < 0 OR amount > 10000
   OR order_date > CURRENT_DATE
   OR status NOT IN ('new','shipped','cancelled');

Worked examples

Example 1 — Null checks with conditional logic (SQL)

-- Required fields must be present
SELECT
  COUNT(*) FILTER (WHERE order_id IS NULL) AS null_order_id,
  COUNT(*) FILTER (WHERE customer_id IS NULL) AS null_customer_id,
  COUNT(*) FILTER (WHERE order_date IS NULL) AS null_order_date
FROM orders;

-- If status = 'cancelled', cancelled_at must be not null
SELECT COUNT(*) AS violations
FROM orders
WHERE status = 'cancelled' AND cancelled_at IS NULL;

Expected: all required fields have 0 nulls; conditional rule has 0 violations.

Example 2 — Duplicate detection and safe dedup (SQL)

-- Identify duplicates by primary key
WITH dup AS (
  SELECT order_id
  FROM orders
  GROUP BY order_id
  HAVING COUNT(*) > 1
), ranked AS (
  SELECT o.*, ROW_NUMBER() OVER (PARTITION BY o.order_id ORDER BY updated_at DESC NULLS LAST) AS rn
  FROM orders o
  JOIN dup d USING (order_id)
)
-- Keep the most recent record per order_id
SELECT * FROM ranked WHERE rn = 1;  -- Use this to build a clean table

Pattern: pick a deterministic rule (e.g., latest updated_at) and keep rn = 1.

Example 3 — Range checks at write-time (constraints)

-- Example with constraints (syntax may vary by database)
ALTER TABLE orders
  ADD CONSTRAINT chk_amount_range CHECK (amount >= 0 AND amount <= 10000),
  ADD CONSTRAINT chk_valid_status CHECK (status IN ('new','shipped','cancelled'));

-- Optional: soft constraints via data validation queries before inserts

If you cannot use constraints (e.g., data lake), run validation queries before publishing to the serving layer.

Step-by-step: Add checks to a pipeline

List required columns and expected ranges (document assumptions).
Implement queries/notebooks to compute null, duplicate, and range violations.
Emit metrics (counts, rates) and store sample offending rows for debugging.
Fail or quarantine the batch if violations exceed thresholds.
Schedule checks near ingestion and before publishing to consumers.

Minimal metric set to track

Null rate per required column.
Duplicate key count.
Out-of-range count per rule.

Exercises

These mirror the interactive exercises below. Try them now, then compare with solutions.

Exercise 1 (SQL): Inspect a tiny orders table for nulls, duplicates, and range issues. Produce counts and the list of offending rows.
Exercise 2 (PySpark): Validate a DataFrame with the same rules. Return three DataFrames: null_violations, duplicate_keys, range_violations.

Checklist before you submit

[ ] You counted nulls for each required column.
[ ] You identified duplicate order_id values.
[ ] You flagged out-of-range amounts and invalid statuses.
[ ] You can explain how you would quarantine bad rows.

Common mistakes and how to self-check

Only checking nulls on a subset of columns. Self-check: list all required fields and verify each is tested.
Removing duplicates without a deterministic rule. Self-check: is your dedup rule stable (e.g., latest updated_at) and reproducible?
Range rules too strict. Self-check: review historical distributions; set ranges that match reality plus a safety margin.
Ignoring conditional rules (e.g., cancelled_at). Self-check: add at least one conditional null rule per table with statuses.
Not surfacing metrics. Self-check: can you see today vs. last 7 days violation counts?

Practical projects

Build a daily validation job that reads a raw orders table, computes null/duplicate/range metrics, writes a violations table, and stops promotion if thresholds are exceeded.
Create a reusable SQL snippet or notebook template for these three checks. Parameterize the table name, key column, and ranges.
Add a small dashboard with three sparkline metrics: null rate (order_date), duplicate key count, out-of-range amount count.

Who this is for

Data Engineers who maintain ingestion and transformation pipelines.
Analytics Engineers who publish marts and want reliable joins.
Data Scientists who depend on clean feature tables.

Prerequisites

Basic SQL (SELECT, GROUP BY, window functions).
Familiarity with your storage/compute environment (warehouse or lakehouse).
Optional: PySpark or pandas for DataFrame-based checks.

Learning path

Start: Null, duplicate, and range checks (this subskill).
Next: Schema drift detection and type validation.
Then: Freshness/volume monitoring and anomaly detection.
Finally: Quarantine strategies and automated remediation.

Next steps

Automate today’s checks in one critical table.
Add thresholds and make the pipeline fail fast on severe violations.
Document rules in your repo so others can extend them.

Mini challenge

You discover a surge in duplicate keys today, but nulls and ranges look normal. In one paragraph, outline how you would triage: where you look, which queries you run, and how you decide whether to block the publish step.

Instructions

You have a small orders table:

CREATE TEMP TABLE orders(order_id INT, customer_id INT, order_date DATE, amount DECIMAL(10,2), status TEXT, updated_at TIMESTAMP, cancelled_at TIMESTAMP);
INSERT INTO orders VALUES
  (1, 10, DATE '2025-01-01', 49.99, 'new',        TIMESTAMP '2025-01-01 10:00', NULL),
  (2, 11, NULL,               15.00, 'shipped',    TIMESTAMP '2025-01-02 09:00', NULL),
  (2, 11, DATE '2025-01-02',  15.00, 'shipped',    TIMESTAMP '2025-01-02 12:00', NULL),
  (3, 12, DATE '2025-01-03', -5.00,  'new',        TIMESTAMP '2025-01-03 08:00', NULL),
  (4, NULL, DATE '2025-01-04', 8.00, 'cancelled',  TIMESTAMP '2025-01-04 11:00', NULL),
  (5, 13, DATE '2099-01-01', 12.00, 'unknown',     TIMESTAMP '2025-01-05 11:00', NULL);

Write SQL that returns: (1) counts of nulls per required column (order_id, customer_id, order_date), (2) list of duplicate order_id with counts, (3) all rows violating range/domain rules: amount < 0 OR amount > 10000, order_date > current_date, status NOT IN ('new','shipped','cancelled'), and (4) conditional rule: if status='cancelled' then cancelled_at is NOT NULL.

Menu

Null Duplicate And Range Checks

Table of Contents