Why this matters
Data validation and quality checks protect your Machine Learning pipelines from silent failures. In real work, you will:
- Stop training on corrupted or shifted data before models degrade.
- Catch schema changes from upstream teams early (added/removed columns, type changes).
- Enforce business rules (e.g., amounts cannot be negative; user IDs must exist).
- Monitor freshness and volume so dashboards and predictions stay reliable.
Real tasks you might handle
- Define checks for a new feature pipeline feeding a recommender system.
- Set drift alerts comparing production data to training baselines.
- Quarantine and reprocess a bad batch without breaking SLAs.
Concept explained simply
Data validation confirms that incoming data matches what you expect. Quality checks measure if the data is fit for purpose. Think of them as gates your data must pass through before it’s used.
Common data quality dimensions
- Schema: columns, types, and allowed values match.
- Completeness: acceptable missing rates.
- Validity: ranges, regex, categories, and business rules.
- Uniqueness: keys are not duplicated.
- Consistency/Integrity: cross-table relationships hold (e.g., foreign keys exist).
- Freshness/Timeliness: data is recent enough.
- Volume: expected number of rows per period.
- Drift/Distribution: stats similar to baseline or within safe bounds.
Mental model
Imagine an airport: multiple checkpoints verify identity, luggage, and timing. Your data passes through similar checkpoints. Each check either passes, warns, or fails. Critical failures stop the pipeline; warnings alert you but allow cautious continuation.
Worked examples
Example 1 — Schema and type checks
Scenario: A transactions table with columns: transaction_id (string), user_id (string), amount (float), currency (string), event_time (timestamp).
Checks:
1. Columns must be exactly: [transaction_id, user_id, amount, currency, event_time]
2. Types: amount is numeric; event_time is parseable timestamp.
3. currency in {"USD","EUR","GBP"}.
4. transaction_id unique (no duplicates).
5. amount >= 0.
Actions:
- Fail (stop) if columns/types mismatch or duplicates in primary key.
- Warn if 0 < invalid currencies <= 0.5%.
Example 2 — Distribution/drift checks
Scenario: Feature price in training vs last day’s production.
Baseline (training): mean ~ 48, std ~ 12.
Production (today): mean ~ 61, std ~ 28.
Rules:
- |mean_prod - mean_base| / std_base > 1.5 → DRIFT WARNING.
- PSI (10 bins) > 0.25 → DRIFT ALERT.
Actions:
- On ALERT, hold model scoring and route to canary or fallback model.
Example 3 — Referential integrity and business rules
Scenario: orders.user_id must exist in users.id; cancel_time must be after create_time; status in an allowed set.
Checks:
- Anti-join orders.user_id to users.id → expect 0 orphan rows.
- cancel_time >= create_time (or null if not canceled).
- status in {"created","paid","shipped","canceled"}.
Actions:
- Orphans > 0 → Fail.
- Invalid times → Warn if <= 0.1%, else Fail.
Example 4 — Freshness and volume
Scenario: Daily partitioned events.
Checks:
- Max(event_time) within 3 hours of now for streaming; within 26 hours for daily batch.
- Row count within [avg_7d ± 25%].
Actions:
- If freshness or volume out of bounds → Alert and switch to cached snapshot.
Minimal implementation plan
- Step 1: Define expectations with owners and severities (fail/warn/info).
- Step 2: Implement checks close to the data (SQL/Python) and version-control them.
- Step 3: Run checks before consuming or training; fail fast on critical rules.
- Step 4: Log detailed results (metric values, sample bad rows).
- Step 5: Alert and auto-remediate (quarantine, reprocess, fallback).
- Step 6: Review thresholds monthly based on historical distributions.
Starter checklist (copy/paste)
- Columns and types match.
- Primary key uniqueness.
- Required fields completeness ≥ threshold.
- Value ranges and allowed categories.
- Referential integrity across tables.
- Freshness and volume within bounds.
- Numeric distribution vs baseline (mean/std and PSI/KS).
- Categorical distribution vs baseline (top-k share changes).
Exercises
Work hands-on with two targeted tasks. Complete them here, then compare with the provided solutions.
Exercise 1: Build a lightweight data validation plan for a CSV
Scenario: You receive monthly transactions.csv with columns transaction_id, user_id, amount, currency, event_time. Define checks across schema, completeness, ranges, uniqueness, categories, and freshness. Aim for 8+ clear rules with pass/fail thresholds.
- Deliverable: a short checklist of rules and what happens on fail/warn.
- Tip: include at least one business rule (e.g., amount ≥ 0).
Exercise 2: Compare production vs training distributions and set alert thresholds
Given training vs production samples for numeric and categorical features, compute simple stats and decide pass/warn/fail using mean shift and PSI.
Data to use
Numeric feature (price)
Training: [10, 11, 9, 12, 10, 11, 13, 10, 12, 9]
Production: [16, 17, 15, 18, 16, 17, 16, 19, 15, 18]
Categorical feature (currency)
Training: USD:70%, EUR:20%, GBP:10%
Production: USD:45%, EUR:35%, GBP:20%- Deliverable: a short drift report with metrics and decision.
- Tip: Use 3–10 bins for PSI; warn at 0.1–0.25, alert at > 0.25.
Common mistakes (and how to self-check)
- Mistake: Only schema checks, no statistical checks. Self-check: Do you monitor mean/std or PSI on key features?
- Mistake: Hard zero thresholds causing noisy alerts. Self-check: Are thresholds based on recent history with tolerances?
- Mistake: Checks run after consumption. Self-check: Do checks run before training/scoring?
- Mistake: No ownership/severity. Self-check: Does each rule name an owner and a fail/warn action?
- Mistake: Missing sample rows in reports. Self-check: Do your logs include examples of failures?
- Mistake: Not testing the tests. Self-check: Have you injected bad data to confirm alerts fire?
Practical projects
- Project 1: Build a validation suite for a user events pipeline (10–15 rules, with severities). Include a daily report summary.
- Project 2: Create a drift monitor comparing production vs training for 5 features, with alerts based on PSI/KS and category share shifts.
- Project 3: Implement a quarantine-and-reprocess workflow when critical checks fail, with a fallback cache for consumers.
Who this is for
- Machine Learning Engineers building or operating data pipelines.
- Data Scientists deploying models to production.
- Data/Analytics Engineers responsible for reliable datasets.
Prerequisites
- Basic SQL for querying and joins.
- Python or similar for simple metrics calculations.
- Understanding of your domain’s key business rules.
Learning path
- Start: Define minimal checks on one critical dataset (schema, completeness, uniqueness).
- Next: Add distribution/drift checks with baselines and thresholds.
- Then: Add cross-table integrity and business rules.
- Scale: Automate reporting, alerting, and quarantining; review thresholds monthly.
- Advance: Add canary scoring and automated rollback when drift alerts trigger.
Next steps
- Finish the exercises and take the quick test below to confirm understanding.
- Integrate your checks into one live pipeline and monitor for a week.
- Schedule a monthly threshold review using historical metrics.
Mini challenge
Pick one high-impact feature used by your model. Create 3 checks: a schema/type check, a distribution check (mean shift or PSI), and a business rule. Run them on yesterday’s data and write one paragraph on results and actions.
Quick Test
Take the test to check your understanding. Note: The quick test is available to everyone; only logged-in users get saved progress.