Why this matters
In MLOps, most production failures come from bad or unexpected data, not from model code. Adding data validation into Continuous Integration (CI) prevents deploys that would break training pipelines, silently degrade model performance, or leak sensitive information. Real tasks you will face:
- Blocking a merge when a critical column disappears or changes type.
- Failing the build when nulls in a key column exceed a threshold.
- Catching data drift before retraining schedules fire.
- Ensuring no personal data or target leakage sneaks into features.
- Publishing data validation reports as build artifacts for reviewers.
Concept explained simply
Data validation in CI is an automated gate that checks your incoming datasets or feature tables against rules (schema, quality, distribution) every time a change is introduced (code, pipeline, or data contract). If a rule fails, the CI job fails and the change cannot be merged.
Mental model
Think of it like a turnstile with sensors:
- Schema sensors: Are the expected columns present? Are types correct?
- Quality sensors: Are null/dup rates acceptable? Are ranges sane?
- Behavior sensors: Does the distribution look similar to a trusted baseline?
- Safety sensors: Any PII or label leakage?
If any sensor trips, the turnstile locks and the build stops.
Core checks to include in CI
- Schema & contracts: column names, dtypes, allowed enums, mandatory columns, uniqueness.
- Quality thresholds: max null %, duplicate row % or key uniqueness, valid ranges/patterns.
- Statistical drift: compare current sample vs baseline using PSI or KS; set per-feature thresholds.
- Leakage prevention: ensure no features derived from the target; enforce time-aware splits.
- PII safety: detect patterns (e.g., 16-digit numbers, emails) in non-PII fields; block if found.
- Data volume & freshness: minimum row count, timestamp recency windows.
- Artifact reporting: generate human-readable validation report for reviewers.
Workflow: add data validation to your CI pipeline
List required columns, dtypes, constraints, and acceptable ranges. Store it versioned with code.
Save a clean sample or statistics snapshot (e.g., feature means/quantiles) to compare drift.
Pull a representative sample (small but stable). Prefer deterministic sampling (fixed date window or seeded sampling) to avoid flaky tests.
Execute validators for schema, quality, drift, and safety. Fail fast on critical violations (e.g., missing columns, PII, leakage).
Upload validation reports and metrics as build artifacts for code reviewers.
If any critical rule fails, the CI job fails and the PR cannot be merged.
If distribution shifts are expected (e.g., product changes), update the baseline in a reviewed PR.
Example CI job (generic pseudo-YAML)
jobs:
validate_data:
runs-on: linux
steps:
- checkout code
- install deps
- run: python tools/sample_data.py --source data/raw --out .cache/sample.csv
- run: python tools/validate.py \
--data .cache/sample.csv \
--contract contracts/contract.yaml \
--baseline stats/baseline.json \
--report artifacts/data_validation_report.html
- save-artifact: artifacts/data_validation_report.html
Worked examples
Example 1 — Schema and null thresholds
Scenario: The upstream team renamed customer_id to client_id and nulls in age increased.
- Rule: customer_id must exist and be unique; age nulls <= 2%.
- Observed: customer_id missing; age nulls = 6%.
- Outcome: CI fails. Action: coordinate schema rename via contract update, fix age pipeline.
Example 2 — Drift detection with PSI
Scenario: avg_transaction_amount shifted due to a new pricing tier.
- Rule: PSI < 0.2 for this feature.
- Observed: PSI = 0.27.
- Outcome: CI fails. Action: confirm expected shift, retrain model, then update baseline in a PR.
Example 3 — PII pattern safety
Scenario: A free-text feature contains email addresses.
- Rule: No email pattern allowed in feature_text.
- Observed: 14 matches.
- Outcome: CI fails. Action: add PII scrubbing upstream or drop offending field.
Who this is for
- MLOps Engineers integrating ML pipelines into CI/CD.
- Data/ML Engineers maintaining reliable feature stores.
- Data Scientists contributing to production-grade pipelines.
Prerequisites
- Comfort with CI systems and writing basic CI jobs.
- Intermediate Python and data manipulation (pandas/SQL).
- Basic statistics: distributions, quantiles, PSI/KS intuition.
Learning path
- Start: Define a minimal data contract for one dataset.
- Add: Schema and quality checks in CI; publish a validation report.
- Extend: Add drift checks with a controlled baseline.
- Harden: Add PII/leakage checks; make failures block merges.
- Scale: Apply the same pattern to each critical dataset feeding your models.
Exercises
Hands-on tasks below mirror the exercises section. Do them now; they are short and practical.
-
Exercise 1 — Define a robust data schema gate
Dataset: transactions.csv with columns [transaction_id, customer_id, amount, currency, event_time].
- Require transaction_id unique, customer_id not null, amount in [0, 10,000].
- Allow currency in {USD, EUR, GBP} only.
- Max nulls: amount ≤ 1%, currency ≤ 0.5%.
- event_time must be within the last 60 days.
Deliverable: Write rules (YAML or pseudo) that a validator could run in CI and choose which failures block the build.
-
Exercise 2 — Add drift and leakage guards
Model: churn prediction with features [tenure_days, monthly_spend, support_tickets, plan_tier] and target churned.
- Drift: PSI thresholds — tenure_days < 0.2, monthly_spend < 0.15, support_tickets < 0.2.
- Leakage: Ensure no feature uses churned or future info; enforce time-aware split.
Deliverable: Describe CI steps that compute drift vs a baseline, fail on threshold, and a leakage rule that blocks merges.
Checklist: before merging
- Contract file exists and is versioned with the code.
- Deterministic sampling strategy documented.
- Critical checks (schema, PII, leakage) fail fast.
- Drift thresholds are per-feature and justified.
- Validation report is uploaded as a build artifact.
- Baseline updates require a reviewed PR.
Common mistakes and self-check
- Flaky tests: random samples change results. Fix by using fixed windows or seeded sampling.
- Global drift threshold: hides important shifts. Use per-feature thresholds.
- Silent contract changes: schema changes merged without review. Enforce contract bumps in CI.
- Ignoring volume/freshness: a dataset with 10 rows passes. Add min row and recency checks.
- PII only in prod: add the same safety checks in CI to block early.
Self-check prompts
- Can you trace which rule would have blocked the last data incident your team saw?
- If drift is expected tomorrow, do you know the approved process to update baselines?
- Are your checks fast enough to run on every PR (< 3–5 minutes)?
Practical projects
- Project 1: Add a data contract and schema checks to one production dataset; publish an HTML report artifact.
- Project 2: Create a drift baseline and thresholds for three key features; block merges on PSI/KS violations.
- Project 3: Implement PII and leakage guards; demonstrate a PR failing due to detected issues, then fix and pass.
Quick Test
Available to everyone. Log in to save your progress and resume later.
Mini challenge
Your nightly retrain job failed last week due to a missing column. Propose a minimal CI validation suite (3–5 rules) that would have prevented it, and describe how you would keep it fast and deterministic.
Next steps
- Roll out the same validation pattern to every dataset that feeds a production model.
- Automate baseline refresh with an approval workflow.
- Track validation metrics over time to spot trends before they become incidents.