Why this matters
Null, duplicate, and range checks are the fastest, highest-impact data quality gates. They catch broken joins, bad extractions, schema drift, and unexpected values before they poison dashboards, ML features, and downstream jobs.
- Daily tasks: validate new ingestion tables, protect primary keys, enforce business rules (e.g., non-negative prices).
- On-call: isolate bad batches quickly by comparing today’s null/duplicate/range metrics to baselines.
- Governance: prove data fitness with simple, auditable checks and metrics.
Note: The quick test is available to everyone. Only logged-in users have their progress saved.
Concept explained simply
Think of your dataset like a shipping manifest:
- Null checks: missing fields on packages (no address).
- Duplicate checks: the same package listed twice.
- Range checks: impossible weights or delivery dates (negative weight, date in year 1900).
Mental model
Use three lenses on every table:
- Completeness: required fields are present (null checks).
- Uniqueness: identifiers are unique (duplicate checks).
- Validity: values fit allowed domains (range checks).
Tip: Where to run these checks
- Ingestion (bronze/raw): verify schema and gross errors early.
- Transform (silver/curated): enforce business rules.
- Serve (gold/marts): block promotions to production if checks fail.
Core checks and patterns
Null checks
- Required columns must be non-null (e.g., order_id, order_date).
- Conditional null rules (e.g., cancelled_at is required when status = 'cancelled').
-- Count nulls per column
SELECT 'order_id' AS column_name, COUNT(*) FILTER (WHERE order_id IS NULL) AS nulls FROM orders
UNION ALL
SELECT 'order_date', COUNT(*) FILTER (WHERE order_date IS NULL) FROM orders;
Duplicate checks
- Duplicate keys: same primary/business key appears more than once.
- Near-duplicates: same customer_id + order_date + amount within a small window.
-- Find duplicate primary keys
SELECT order_id, COUNT(*) AS cnt
FROM orders
GROUP BY order_id
HAVING COUNT(*) > 1;
Range checks
- Numeric ranges (0 ≤ amount ≤ 10,000).
- Date ranges (order_date not in the future, not too far in the past).
- Enumerations (status IN ('new','shipped','cancelled')).
-- Out-of-range examples
SELECT *
FROM orders
WHERE amount < 0 OR amount > 10000
OR order_date > CURRENT_DATE
OR status NOT IN ('new','shipped','cancelled');
Worked examples
Example 1 — Null checks with conditional logic (SQL)
-- Required fields must be present
SELECT
COUNT(*) FILTER (WHERE order_id IS NULL) AS null_order_id,
COUNT(*) FILTER (WHERE customer_id IS NULL) AS null_customer_id,
COUNT(*) FILTER (WHERE order_date IS NULL) AS null_order_date
FROM orders;
-- If status = 'cancelled', cancelled_at must be not null
SELECT COUNT(*) AS violations
FROM orders
WHERE status = 'cancelled' AND cancelled_at IS NULL;
Expected: all required fields have 0 nulls; conditional rule has 0 violations.
Example 2 — Duplicate detection and safe dedup (SQL)
-- Identify duplicates by primary key
WITH dup AS (
SELECT order_id
FROM orders
GROUP BY order_id
HAVING COUNT(*) > 1
), ranked AS (
SELECT o.*, ROW_NUMBER() OVER (PARTITION BY o.order_id ORDER BY updated_at DESC NULLS LAST) AS rn
FROM orders o
JOIN dup d USING (order_id)
)
-- Keep the most recent record per order_id
SELECT * FROM ranked WHERE rn = 1; -- Use this to build a clean table
Pattern: pick a deterministic rule (e.g., latest updated_at) and keep rn = 1.
Example 3 — Range checks at write-time (constraints)
-- Example with constraints (syntax may vary by database)
ALTER TABLE orders
ADD CONSTRAINT chk_amount_range CHECK (amount >= 0 AND amount <= 10000),
ADD CONSTRAINT chk_valid_status CHECK (status IN ('new','shipped','cancelled'));
-- Optional: soft constraints via data validation queries before inserts
If you cannot use constraints (e.g., data lake), run validation queries before publishing to the serving layer.
Step-by-step: Add checks to a pipeline
- List required columns and expected ranges (document assumptions).
- Implement queries/notebooks to compute null, duplicate, and range violations.
- Emit metrics (counts, rates) and store sample offending rows for debugging.
- Fail or quarantine the batch if violations exceed thresholds.
- Schedule checks near ingestion and before publishing to consumers.
Minimal metric set to track
- Null rate per required column.
- Duplicate key count.
- Out-of-range count per rule.
Exercises
These mirror the interactive exercises below. Try them now, then compare with solutions.
- Exercise 1 (SQL): Inspect a tiny orders table for nulls, duplicates, and range issues. Produce counts and the list of offending rows.
- Exercise 2 (PySpark): Validate a DataFrame with the same rules. Return three DataFrames: null_violations, duplicate_keys, range_violations.
Checklist before you submit
- [ ] You counted nulls for each required column.
- [ ] You identified duplicate order_id values.
- [ ] You flagged out-of-range amounts and invalid statuses.
- [ ] You can explain how you would quarantine bad rows.
Common mistakes and how to self-check
- Only checking nulls on a subset of columns. Self-check: list all required fields and verify each is tested.
- Removing duplicates without a deterministic rule. Self-check: is your dedup rule stable (e.g., latest updated_at) and reproducible?
- Range rules too strict. Self-check: review historical distributions; set ranges that match reality plus a safety margin.
- Ignoring conditional rules (e.g., cancelled_at). Self-check: add at least one conditional null rule per table with statuses.
- Not surfacing metrics. Self-check: can you see today vs. last 7 days violation counts?
Practical projects
- Build a daily validation job that reads a raw orders table, computes null/duplicate/range metrics, writes a violations table, and stops promotion if thresholds are exceeded.
- Create a reusable SQL snippet or notebook template for these three checks. Parameterize the table name, key column, and ranges.
- Add a small dashboard with three sparkline metrics: null rate (order_date), duplicate key count, out-of-range amount count.
Who this is for
- Data Engineers who maintain ingestion and transformation pipelines.
- Analytics Engineers who publish marts and want reliable joins.
- Data Scientists who depend on clean feature tables.
Prerequisites
- Basic SQL (SELECT, GROUP BY, window functions).
- Familiarity with your storage/compute environment (warehouse or lakehouse).
- Optional: PySpark or pandas for DataFrame-based checks.
Learning path
- Start: Null, duplicate, and range checks (this subskill).
- Next: Schema drift detection and type validation.
- Then: Freshness/volume monitoring and anomaly detection.
- Finally: Quarantine strategies and automated remediation.
Next steps
- Automate today’s checks in one critical table.
- Add thresholds and make the pipeline fail fast on severe violations.
- Document rules in your repo so others can extend them.
Mini challenge
You discover a surge in duplicate keys today, but nulls and ranges look normal. In one paragraph, outline how you would triage: where you look, which queries you run, and how you decide whether to block the publish step.