How to learn Data Validation Rules for Data Cleaning in Data Analyst for free

Why this matters

Data validation rules are the guardrails that stop bad data from entering your analysis. In real projects, they help you:

Prevent broken dashboards by catching impossible values (e.g., negative quantities).
Protect KPIs by enforcing uniqueness and referential integrity (e.g., no duplicate order IDs, every category exists in a lookup).
Cut rework time by flagging issues at ingestion instead of after modeling.
Build trust: stakeholders rely on consistently validated numbers.

Who this is for

New and working Data Analysts who need reliable, repeatable checks for spreadsheets, SQL datasets, and Python/R pipelines.

Prerequisites

Basic data types (string, number, date).
Comfort with spreadsheets and/or SQL/Pandas.
Understanding of your domain fields (e.g., what a valid order looks like).

Note: The quick test is available to everyone. Only logged-in users have their progress saved.

Concept explained simply

Think of validation rules as bouncers at a club. Each rule checks an entry: if it fails, the row is rejected or flagged.

Mental model

Field-level: checks a single column (type, range, format, allowed values).
Row-level: checks combinations within a row (end_date ≥ start_date).
Dataset-level: checks across rows (unique IDs, duplicate detection).
Relational: checks across tables/lookups (foreign keys, category mapping).
Severity: error (block), warning (flag), info (log).

Core rule types (with quick examples)

Field-level

Type: quantity is integer.
Range: discount between 0 and 0.5 inclusive.
Allowed set: country in {US, CA, GB}.
Format: order_id matches pattern O-\d+; email looks like user@domain.tld.

Row-level

Cross-field: ship_date ≥ order_date.
Derived: total_amount ≈ quantity * unit_price * (1 - discount) within tolerance (e.g., ±0.02).

Dataset-level

Uniqueness: order_id unique and not null.
Completeness: required fields not null.

Relational

Referential integrity: category exists in categories lookup.
Conditional mapping: currency matches country (US→USD, CA→CAD, GB→GBP).

Worked examples

Example 1: Spreadsheet data validation

Select quantity column → set whole number → minimum 1.
Select discount column → decimal → between 0 and 0.5.
country column → list → US, CA, GB.
order_id column → use custom formula to match O-\d+ (spreadsheet-specific regex option) or enforce via data entry template.
Create a calculated check cell: total_check = ABS(total_amount - quantity * unit_price * (1 - discount)) ≤ 0.02, then conditional format rows with total_check = FALSE.

Example 2: SQL rules

-- Table with constraints (syntax may vary by RDBMS)
CREATE TABLE orders (
  order_id      VARCHAR(20) PRIMARY KEY,
  customer_id   VARCHAR(20) NOT NULL,
  order_date    DATE NOT NULL,
  ship_date     DATE NOT NULL,
  country       VARCHAR(2) NOT NULL CHECK (country IN ('US','CA','GB')),
  currency      VARCHAR(3) NOT NULL CHECK (currency IN ('USD','CAD','GBP')),
  quantity      INT NOT NULL CHECK (quantity >= 1),
  unit_price    NUMERIC(10,2) NOT NULL CHECK (unit_price > 0),
  discount      NUMERIC(5,2) NOT NULL CHECK (discount BETWEEN 0 AND 0.5),
  total_amount  NUMERIC(12,2) NOT NULL,
  email         VARCHAR(200)
);

-- Row-level check via generated column or view comparison
-- Example tolerance check in a view:
SELECT *,
  CASE WHEN ABS(total_amount - quantity*unit_price*(1-discount)) <= 0.02 THEN 'OK' ELSE 'MISMATCH' END AS total_ok,
  CASE WHEN ship_date >= order_date THEN 'OK' ELSE 'BAD_DATE' END AS date_ok
FROM orders;

-- Relational mapping example
-- currency-country mapping enforced via lookup + FK
CREATE TABLE country_currency (
  country VARCHAR(2) PRIMARY KEY,
  currency VARCHAR(3) NOT NULL
);
-- Then join in quality checks

Example 3: Python (pandas) checks

import pandas as pd

# df columns: order_id, country, currency, quantity, unit_price, discount, total_amount, order_date, ship_date

issues = []

# Field-level
issues.append((~df['quantity'].ge(1)).sum())
issues.append((~df['discount'].between(0, 0.5)).sum())
issues.append((~df['country'].isin(['US','CA','GB'])).sum())

# Row-level
calc = df['quantity']*df['unit_price']*(1 - df['discount'])
issues.append(( (df['total_amount'] - calc).abs() > 0.02 ).sum())
issues.append((df['ship_date'] < df['order_date']).sum())

# Dataset-level
issues.append(df['order_id'].isna().sum())
issues.append(df['order_id'].duplicated().sum())

print('Problems by check:', issues)

Practical setup checklist

List your required fields (not null).
Define types and ranges for numeric/date columns.
Create allowed sets for categorical fields.
Add cross-field rules (dates, totals, conditional mappings).
Set uniqueness rules for IDs and natural keys.
Decide severity: error vs warning.
Add tolerance to computed checks (rounding).
Document rules in a simple table: name, scope, logic, severity, owner.

Exercises

Do these in any tool you like (spreadsheet, SQL, or pandas).

Exercise 1 — Design validation rules for an Orders dataset (ex1)

Dataset columns: order_id, customer_id, order_date, ship_date, country, currency, quantity, unit_price, discount, total_amount, email.

Define at least 10 rules: include field-level, row-level, dataset-level, and at least one relational rule (e.g., country→currency).
For each rule, set severity (error/warning) and a short rationale.

Show a possible solution

Sample rules (abbrev.):

order_id: not null, pattern O-\d+, unique (error).
customer_id: not null (error).
order_date, ship_date: valid dates (error).
ship_date ≥ order_date (error).
country in {US, CA, GB} (error).
currency in {USD, CAD, GBP} (error) and currency matches country mapping (error).
quantity integer ≥ 1 (error).
unit_price > 0 (error).
discount between 0 and 0.5 inclusive (error).
total_amount within ±0.02 of quantity*unit_price*(1-discount) (error).
email basic format check if not null (warning).

Rationale examples: prevents duplicates; ensures positive money; enforces realistic business logic.

Exercise 2 — Spot the broken rules (ex2)

Use these rows and identify which rules fail for each row.

Rules to apply (short):
- order_id: unique, pattern O-\d+, not null
- ship_date ≥ order_date
- country in {US, CA, GB}
- currency matches country (US→USD, CA→CAD, GB→GBP)
- quantity ≥ 1
- unit_price > 0
- discount 0..0.5
- total_amount within ±0.02 of quantity*unit_price*(1-discount)
- email basic format user@domain.tld if present

Rows:
1) O-1001, C-501, 2025-03-10, 2025-03-15, US, USD, 3, 19.99, 0.10, 53.97, a@b.com
2) O-1002, C-502, 2025-03-12, 2025-03-11, CA, CAD, 2, 50.00, 0.00, 100.00, x@y.ca
3) O-1003, C-503, 2025-03-12, 2025-03-20, GB, USD, 1, 100.00, 0.05, 95.00, foo@example.co.uk
4) O-1004, C-504, 2025-03-13, 2025-03-16, US, USD, 5, 10.00, 0.20, 39.00, good@ok.com
5) O-1004, C-505, 2025-03-14, 2025-03-18, US, USD, 1, 0.00, 0.00, 0.00, hi@ok.com
6) O-1006, C-506, 2025-03-10, 2025-03-12, US, USD, 0, 5.00, 0.60, 0.00, not-an-email

Show a possible solution

Row 1: Passes all checks (total = 3*19.99*0.9=53.973 → 53.97 within tolerance).
Row 2: ship_date < order_date (error).
Row 3: currency-country mismatch (GB requires GBP) (error).
Row 4: total mismatch (expected 40.00; got 39.00) (error).
Row 5: duplicate order_id (error); unit_price must be > 0 (error); total derived is incorrect (error).
Row 6: quantity < 1 (error); discount > 0.5 (error); email format invalid (warning); total derived incorrect (error).

Common mistakes and how to self-check

Missing tolerance on calculated fields → false positives.
Self-check
Compare failure rates with ±0.00 vs ±0.02 tolerance; if many errors disappear with small tolerance, keep the tolerance.
Over-strict regex for emails → legitimate emails rejected.
Self-check
Use a simple pattern like ^.+@.+\..+$ for business data. Treat advanced validation as warning.
Rules without owners → rules rot over time.
Self-check
Add an owner column in your rule log and review monthly.
Introducing hard errors too early in legacy systems.
Self-check
Start as warnings, measure impact, then upgrade to errors.

Practical projects

Set up a validation layer for a sales dataset: create 12 rules, run weekly, and produce a one-page quality report.
Build a spreadsheet template with data validation for a small team intake form (drop-downs, numeric ranges, conditional formatting).
Write a Python or SQL script that checks rules and outputs a fail log with row_id, rule_name, severity, reason.

Learning path

Before: Basic data types, spreadsheet validation tools, simple SQL filters.
Now: Define and implement field, row, dataset, and relational rules with severities and tolerances.
Next: Automate checks in pipelines, add monitoring dashboards, and integrate with data catalogs.

Mini challenge

You receive an event_log dataset with columns: event_id, user_id, ts, event_type, revenue. Write 7 rules (mix of field, row, dataset, relational), mark each as error or warning, and explain why. Keep it under 10 minutes.

Next steps

Document your rules in a shared sheet or repo.
Promote 2–3 high-impact warnings to errors after a trial period.
Schedule a monthly quality review with stakeholders.

Menu

Data Validation Rules

Table of Contents