luvv to helpDiscover the Best Free Online Tools
Topic 8 of 12

Data Validation Rules

Learn Data Validation Rules for free with explanations, exercises, and a quick test (for Data Analyst).

Published: December 19, 2025 | Updated: December 19, 2025

Why this matters

Data validation rules are the guardrails that stop bad data from entering your analysis. In real projects, they help you:

  • Prevent broken dashboards by catching impossible values (e.g., negative quantities).
  • Protect KPIs by enforcing uniqueness and referential integrity (e.g., no duplicate order IDs, every category exists in a lookup).
  • Cut rework time by flagging issues at ingestion instead of after modeling.
  • Build trust: stakeholders rely on consistently validated numbers.
Who this is for

New and working Data Analysts who need reliable, repeatable checks for spreadsheets, SQL datasets, and Python/R pipelines.

Prerequisites
  • Basic data types (string, number, date).
  • Comfort with spreadsheets and/or SQL/Pandas.
  • Understanding of your domain fields (e.g., what a valid order looks like).

Note: The quick test is available to everyone. Only logged-in users have their progress saved.

Concept explained simply

Think of validation rules as bouncers at a club. Each rule checks an entry: if it fails, the row is rejected or flagged.

Mental model

  • Field-level: checks a single column (type, range, format, allowed values).
  • Row-level: checks combinations within a row (end_date ≥ start_date).
  • Dataset-level: checks across rows (unique IDs, duplicate detection).
  • Relational: checks across tables/lookups (foreign keys, category mapping).
  • Severity: error (block), warning (flag), info (log).

Core rule types (with quick examples)

Field-level

  • Type: quantity is integer.
  • Range: discount between 0 and 0.5 inclusive.
  • Allowed set: country in {US, CA, GB}.
  • Format: order_id matches pattern O-\d+; email looks like user@domain.tld.

Row-level

  • Cross-field: ship_date ≥ order_date.
  • Derived: total_amount ≈ quantity * unit_price * (1 - discount) within tolerance (e.g., ±0.02).

Dataset-level

  • Uniqueness: order_id unique and not null.
  • Completeness: required fields not null.

Relational

  • Referential integrity: category exists in categories lookup.
  • Conditional mapping: currency matches country (US→USD, CA→CAD, GB→GBP).

Worked examples

Example 1: Spreadsheet data validation

  1. Select quantity column → set whole number → minimum 1.
  2. Select discount column → decimal → between 0 and 0.5.
  3. country column → list → US, CA, GB.
  4. order_id column → use custom formula to match O-\d+ (spreadsheet-specific regex option) or enforce via data entry template.
  5. Create a calculated check cell: total_check = ABS(total_amount - quantity * unit_price * (1 - discount)) ≤ 0.02, then conditional format rows with total_check = FALSE.

Example 2: SQL rules

-- Table with constraints (syntax may vary by RDBMS)
CREATE TABLE orders (
  order_id      VARCHAR(20) PRIMARY KEY,
  customer_id   VARCHAR(20) NOT NULL,
  order_date    DATE NOT NULL,
  ship_date     DATE NOT NULL,
  country       VARCHAR(2) NOT NULL CHECK (country IN ('US','CA','GB')),
  currency      VARCHAR(3) NOT NULL CHECK (currency IN ('USD','CAD','GBP')),
  quantity      INT NOT NULL CHECK (quantity >= 1),
  unit_price    NUMERIC(10,2) NOT NULL CHECK (unit_price > 0),
  discount      NUMERIC(5,2) NOT NULL CHECK (discount BETWEEN 0 AND 0.5),
  total_amount  NUMERIC(12,2) NOT NULL,
  email         VARCHAR(200)
);

-- Row-level check via generated column or view comparison
-- Example tolerance check in a view:
SELECT *,
  CASE WHEN ABS(total_amount - quantity*unit_price*(1-discount)) <= 0.02 THEN 'OK' ELSE 'MISMATCH' END AS total_ok,
  CASE WHEN ship_date >= order_date THEN 'OK' ELSE 'BAD_DATE' END AS date_ok
FROM orders;

-- Relational mapping example
-- currency-country mapping enforced via lookup + FK
CREATE TABLE country_currency (
  country VARCHAR(2) PRIMARY KEY,
  currency VARCHAR(3) NOT NULL
);
-- Then join in quality checks

Example 3: Python (pandas) checks

import pandas as pd

# df columns: order_id, country, currency, quantity, unit_price, discount, total_amount, order_date, ship_date

issues = []

# Field-level
issues.append((~df['quantity'].ge(1)).sum())
issues.append((~df['discount'].between(0, 0.5)).sum())
issues.append((~df['country'].isin(['US','CA','GB'])).sum())

# Row-level
calc = df['quantity']*df['unit_price']*(1 - df['discount'])
issues.append(( (df['total_amount'] - calc).abs() > 0.02 ).sum())
issues.append((df['ship_date'] < df['order_date']).sum())

# Dataset-level
issues.append(df['order_id'].isna().sum())
issues.append(df['order_id'].duplicated().sum())

print('Problems by check:', issues)

Practical setup checklist

  • List your required fields (not null).
  • Define types and ranges for numeric/date columns.
  • Create allowed sets for categorical fields.
  • Add cross-field rules (dates, totals, conditional mappings).
  • Set uniqueness rules for IDs and natural keys.
  • Decide severity: error vs warning.
  • Add tolerance to computed checks (rounding).
  • Document rules in a simple table: name, scope, logic, severity, owner.

Exercises

Do these in any tool you like (spreadsheet, SQL, or pandas).

Exercise 1 — Design validation rules for an Orders dataset (ex1)

Dataset columns: order_id, customer_id, order_date, ship_date, country, currency, quantity, unit_price, discount, total_amount, email.

  • Define at least 10 rules: include field-level, row-level, dataset-level, and at least one relational rule (e.g., country→currency).
  • For each rule, set severity (error/warning) and a short rationale.
Show a possible solution
Sample rules (abbrev.):
  • order_id: not null, pattern O-\d+, unique (error).
  • customer_id: not null (error).
  • order_date, ship_date: valid dates (error).
  • ship_date ≥ order_date (error).
  • country in {US, CA, GB} (error).
  • currency in {USD, CAD, GBP} (error) and currency matches country mapping (error).
  • quantity integer ≥ 1 (error).
  • unit_price > 0 (error).
  • discount between 0 and 0.5 inclusive (error).
  • total_amount within ±0.02 of quantity*unit_price*(1-discount) (error).
  • email basic format check if not null (warning).

Rationale examples: prevents duplicates; ensures positive money; enforces realistic business logic.

Exercise 2 — Spot the broken rules (ex2)

Use these rows and identify which rules fail for each row.

Rules to apply (short):
- order_id: unique, pattern O-\d+, not null
- ship_date ≥ order_date
- country in {US, CA, GB}
- currency matches country (US→USD, CA→CAD, GB→GBP)
- quantity ≥ 1
- unit_price > 0
- discount 0..0.5
- total_amount within ±0.02 of quantity*unit_price*(1-discount)
- email basic format user@domain.tld if present

Rows:
1) O-1001, C-501, 2025-03-10, 2025-03-15, US, USD, 3, 19.99, 0.10, 53.97, a@b.com
2) O-1002, C-502, 2025-03-12, 2025-03-11, CA, CAD, 2, 50.00, 0.00, 100.00, x@y.ca
3) O-1003, C-503, 2025-03-12, 2025-03-20, GB, USD, 1, 100.00, 0.05, 95.00, foo@example.co.uk
4) O-1004, C-504, 2025-03-13, 2025-03-16, US, USD, 5, 10.00, 0.20, 39.00, good@ok.com
5) O-1004, C-505, 2025-03-14, 2025-03-18, US, USD, 1, 0.00, 0.00, 0.00, hi@ok.com
6) O-1006, C-506, 2025-03-10, 2025-03-12, US, USD, 0, 5.00, 0.60, 0.00, not-an-email
Show a possible solution
  • Row 1: Passes all checks (total = 3*19.99*0.9=53.973 → 53.97 within tolerance).
  • Row 2: ship_date < order_date (error).
  • Row 3: currency-country mismatch (GB requires GBP) (error).
  • Row 4: total mismatch (expected 40.00; got 39.00) (error).
  • Row 5: duplicate order_id (error); unit_price must be > 0 (error); total derived is incorrect (error).
  • Row 6: quantity < 1 (error); discount > 0.5 (error); email format invalid (warning); total derived incorrect (error).

Common mistakes and how to self-check

  • Missing tolerance on calculated fields → false positives.
    Self-check

    Compare failure rates with ±0.00 vs ±0.02 tolerance; if many errors disappear with small tolerance, keep the tolerance.

  • Over-strict regex for emails → legitimate emails rejected.
    Self-check

    Use a simple pattern like ^.+@.+\..+$ for business data. Treat advanced validation as warning.

  • Rules without owners → rules rot over time.
    Self-check

    Add an owner column in your rule log and review monthly.

  • Introducing hard errors too early in legacy systems.
    Self-check

    Start as warnings, measure impact, then upgrade to errors.

Practical projects

  • Set up a validation layer for a sales dataset: create 12 rules, run weekly, and produce a one-page quality report.
  • Build a spreadsheet template with data validation for a small team intake form (drop-downs, numeric ranges, conditional formatting).
  • Write a Python or SQL script that checks rules and outputs a fail log with row_id, rule_name, severity, reason.

Learning path

  • Before: Basic data types, spreadsheet validation tools, simple SQL filters.
  • Now: Define and implement field, row, dataset, and relational rules with severities and tolerances.
  • Next: Automate checks in pipelines, add monitoring dashboards, and integrate with data catalogs.

Mini challenge

You receive an event_log dataset with columns: event_id, user_id, ts, event_type, revenue. Write 7 rules (mix of field, row, dataset, relational), mark each as error or warning, and explain why. Keep it under 10 minutes.

Next steps

  • Document your rules in a shared sheet or repo.
  • Promote 2–3 high-impact warnings to errors after a trial period.
  • Schedule a monthly quality review with stakeholders.

Practice Exercises

2 exercises to complete

Instructions

Define at least 10 validation rules for this schema: order_id, customer_id, order_date, ship_date, country, currency, quantity, unit_price, discount, total_amount, email. Include:

  • At least 3 field-level rules.
  • At least 2 row-level rules.
  • At least 2 dataset-level rules.
  • At least 1 relational rule (country→currency mapping).
  • Severity (error/warning) and a short rationale for each.
Expected Output
A list of ≥10 rules covering field, row, dataset, and relational scopes; each with severity and rationale.

Data Validation Rules — Quick Test

Test your knowledge with 8 questions. Pass with 70% or higher.

8 questions70% to pass

Have questions about Data Validation Rules?

AI Assistant

Ask questions about this tool