Why this matters
Data validation rules are the guardrails that stop bad data from entering your analysis. In real projects, they help you:
- Prevent broken dashboards by catching impossible values (e.g., negative quantities).
- Protect KPIs by enforcing uniqueness and referential integrity (e.g., no duplicate order IDs, every category exists in a lookup).
- Cut rework time by flagging issues at ingestion instead of after modeling.
- Build trust: stakeholders rely on consistently validated numbers.
Who this is for
New and working Data Analysts who need reliable, repeatable checks for spreadsheets, SQL datasets, and Python/R pipelines.
Prerequisites
- Basic data types (string, number, date).
- Comfort with spreadsheets and/or SQL/Pandas.
- Understanding of your domain fields (e.g., what a valid order looks like).
Note: The quick test is available to everyone. Only logged-in users have their progress saved.
Concept explained simply
Think of validation rules as bouncers at a club. Each rule checks an entry: if it fails, the row is rejected or flagged.
Mental model
- Field-level: checks a single column (type, range, format, allowed values).
- Row-level: checks combinations within a row (end_date ≥ start_date).
- Dataset-level: checks across rows (unique IDs, duplicate detection).
- Relational: checks across tables/lookups (foreign keys, category mapping).
- Severity: error (block), warning (flag), info (log).
Core rule types (with quick examples)
Field-level
- Type: quantity is integer.
- Range: discount between 0 and 0.5 inclusive.
- Allowed set: country in {US, CA, GB}.
- Format: order_id matches pattern O-\d+; email looks like user@domain.tld.
Row-level
- Cross-field: ship_date ≥ order_date.
- Derived: total_amount ≈ quantity * unit_price * (1 - discount) within tolerance (e.g., ±0.02).
Dataset-level
- Uniqueness: order_id unique and not null.
- Completeness: required fields not null.
Relational
- Referential integrity: category exists in categories lookup.
- Conditional mapping: currency matches country (US→USD, CA→CAD, GB→GBP).
Worked examples
Example 1: Spreadsheet data validation
- Select quantity column → set whole number → minimum 1.
- Select discount column → decimal → between 0 and 0.5.
- country column → list → US, CA, GB.
- order_id column → use custom formula to match O-\d+ (spreadsheet-specific regex option) or enforce via data entry template.
- Create a calculated check cell: total_check = ABS(total_amount - quantity * unit_price * (1 - discount)) ≤ 0.02, then conditional format rows with total_check = FALSE.
Example 2: SQL rules
-- Table with constraints (syntax may vary by RDBMS)
CREATE TABLE orders (
order_id VARCHAR(20) PRIMARY KEY,
customer_id VARCHAR(20) NOT NULL,
order_date DATE NOT NULL,
ship_date DATE NOT NULL,
country VARCHAR(2) NOT NULL CHECK (country IN ('US','CA','GB')),
currency VARCHAR(3) NOT NULL CHECK (currency IN ('USD','CAD','GBP')),
quantity INT NOT NULL CHECK (quantity >= 1),
unit_price NUMERIC(10,2) NOT NULL CHECK (unit_price > 0),
discount NUMERIC(5,2) NOT NULL CHECK (discount BETWEEN 0 AND 0.5),
total_amount NUMERIC(12,2) NOT NULL,
email VARCHAR(200)
);
-- Row-level check via generated column or view comparison
-- Example tolerance check in a view:
SELECT *,
CASE WHEN ABS(total_amount - quantity*unit_price*(1-discount)) <= 0.02 THEN 'OK' ELSE 'MISMATCH' END AS total_ok,
CASE WHEN ship_date >= order_date THEN 'OK' ELSE 'BAD_DATE' END AS date_ok
FROM orders;
-- Relational mapping example
-- currency-country mapping enforced via lookup + FK
CREATE TABLE country_currency (
country VARCHAR(2) PRIMARY KEY,
currency VARCHAR(3) NOT NULL
);
-- Then join in quality checks
Example 3: Python (pandas) checks
import pandas as pd
# df columns: order_id, country, currency, quantity, unit_price, discount, total_amount, order_date, ship_date
issues = []
# Field-level
issues.append((~df['quantity'].ge(1)).sum())
issues.append((~df['discount'].between(0, 0.5)).sum())
issues.append((~df['country'].isin(['US','CA','GB'])).sum())
# Row-level
calc = df['quantity']*df['unit_price']*(1 - df['discount'])
issues.append(( (df['total_amount'] - calc).abs() > 0.02 ).sum())
issues.append((df['ship_date'] < df['order_date']).sum())
# Dataset-level
issues.append(df['order_id'].isna().sum())
issues.append(df['order_id'].duplicated().sum())
print('Problems by check:', issues)
Practical setup checklist
- List your required fields (not null).
- Define types and ranges for numeric/date columns.
- Create allowed sets for categorical fields.
- Add cross-field rules (dates, totals, conditional mappings).
- Set uniqueness rules for IDs and natural keys.
- Decide severity: error vs warning.
- Add tolerance to computed checks (rounding).
- Document rules in a simple table: name, scope, logic, severity, owner.
Exercises
Do these in any tool you like (spreadsheet, SQL, or pandas).
Exercise 1 — Design validation rules for an Orders dataset (ex1)
Dataset columns: order_id, customer_id, order_date, ship_date, country, currency, quantity, unit_price, discount, total_amount, email.
- Define at least 10 rules: include field-level, row-level, dataset-level, and at least one relational rule (e.g., country→currency).
- For each rule, set severity (error/warning) and a short rationale.
Show a possible solution
- order_id: not null, pattern O-\d+, unique (error).
- customer_id: not null (error).
- order_date, ship_date: valid dates (error).
- ship_date ≥ order_date (error).
- country in {US, CA, GB} (error).
- currency in {USD, CAD, GBP} (error) and currency matches country mapping (error).
- quantity integer ≥ 1 (error).
- unit_price > 0 (error).
- discount between 0 and 0.5 inclusive (error).
- total_amount within ±0.02 of quantity*unit_price*(1-discount) (error).
- email basic format check if not null (warning).
Rationale examples: prevents duplicates; ensures positive money; enforces realistic business logic.
Exercise 2 — Spot the broken rules (ex2)
Use these rows and identify which rules fail for each row.
Rules to apply (short):
- order_id: unique, pattern O-\d+, not null
- ship_date ≥ order_date
- country in {US, CA, GB}
- currency matches country (US→USD, CA→CAD, GB→GBP)
- quantity ≥ 1
- unit_price > 0
- discount 0..0.5
- total_amount within ±0.02 of quantity*unit_price*(1-discount)
- email basic format user@domain.tld if present
Rows:
1) O-1001, C-501, 2025-03-10, 2025-03-15, US, USD, 3, 19.99, 0.10, 53.97, a@b.com
2) O-1002, C-502, 2025-03-12, 2025-03-11, CA, CAD, 2, 50.00, 0.00, 100.00, x@y.ca
3) O-1003, C-503, 2025-03-12, 2025-03-20, GB, USD, 1, 100.00, 0.05, 95.00, foo@example.co.uk
4) O-1004, C-504, 2025-03-13, 2025-03-16, US, USD, 5, 10.00, 0.20, 39.00, good@ok.com
5) O-1004, C-505, 2025-03-14, 2025-03-18, US, USD, 1, 0.00, 0.00, 0.00, hi@ok.com
6) O-1006, C-506, 2025-03-10, 2025-03-12, US, USD, 0, 5.00, 0.60, 0.00, not-an-email
Show a possible solution
- Row 1: Passes all checks (total = 3*19.99*0.9=53.973 → 53.97 within tolerance).
- Row 2: ship_date < order_date (error).
- Row 3: currency-country mismatch (GB requires GBP) (error).
- Row 4: total mismatch (expected 40.00; got 39.00) (error).
- Row 5: duplicate order_id (error); unit_price must be > 0 (error); total derived is incorrect (error).
- Row 6: quantity < 1 (error); discount > 0.5 (error); email format invalid (warning); total derived incorrect (error).
Common mistakes and how to self-check
- Missing tolerance on calculated fields → false positives.
Self-check
Compare failure rates with ±0.00 vs ±0.02 tolerance; if many errors disappear with small tolerance, keep the tolerance.
- Over-strict regex for emails → legitimate emails rejected.
Self-check
Use a simple pattern like ^.+@.+\..+$ for business data. Treat advanced validation as warning.
- Rules without owners → rules rot over time.
Self-check
Add an owner column in your rule log and review monthly.
- Introducing hard errors too early in legacy systems.
Self-check
Start as warnings, measure impact, then upgrade to errors.
Practical projects
- Set up a validation layer for a sales dataset: create 12 rules, run weekly, and produce a one-page quality report.
- Build a spreadsheet template with data validation for a small team intake form (drop-downs, numeric ranges, conditional formatting).
- Write a Python or SQL script that checks rules and outputs a fail log with row_id, rule_name, severity, reason.
Learning path
- Before: Basic data types, spreadsheet validation tools, simple SQL filters.
- Now: Define and implement field, row, dataset, and relational rules with severities and tolerances.
- Next: Automate checks in pipelines, add monitoring dashboards, and integrate with data catalogs.
Mini challenge
You receive an event_log dataset with columns: event_id, user_id, ts, event_type, revenue. Write 7 rules (mix of field, row, dataset, relational), mark each as error or warning, and explain why. Keep it under 10 minutes.
Next steps
- Document your rules in a shared sheet or repo.
- Promote 2–3 high-impact warnings to errors after a trial period.
- Schedule a monthly quality review with stakeholders.