luvv to helpDiscover the Best Free Online Tools
Topic 3 of 8

Data Validation In CI

Learn Data Validation In CI for free with explanations, exercises, and a quick test (for MLOps Engineer).

Published: January 4, 2026 | Updated: January 4, 2026

Why this matters

In MLOps, most production failures come from bad or unexpected data, not from model code. Adding data validation into Continuous Integration (CI) prevents deploys that would break training pipelines, silently degrade model performance, or leak sensitive information. Real tasks you will face:

  • Blocking a merge when a critical column disappears or changes type.
  • Failing the build when nulls in a key column exceed a threshold.
  • Catching data drift before retraining schedules fire.
  • Ensuring no personal data or target leakage sneaks into features.
  • Publishing data validation reports as build artifacts for reviewers.

Concept explained simply

Data validation in CI is an automated gate that checks your incoming datasets or feature tables against rules (schema, quality, distribution) every time a change is introduced (code, pipeline, or data contract). If a rule fails, the CI job fails and the change cannot be merged.

Mental model

Think of it like a turnstile with sensors:

  • Schema sensors: Are the expected columns present? Are types correct?
  • Quality sensors: Are null/dup rates acceptable? Are ranges sane?
  • Behavior sensors: Does the distribution look similar to a trusted baseline?
  • Safety sensors: Any PII or label leakage?

If any sensor trips, the turnstile locks and the build stops.

Core checks to include in CI

  • Schema & contracts: column names, dtypes, allowed enums, mandatory columns, uniqueness.
  • Quality thresholds: max null %, duplicate row % or key uniqueness, valid ranges/patterns.
  • Statistical drift: compare current sample vs baseline using PSI or KS; set per-feature thresholds.
  • Leakage prevention: ensure no features derived from the target; enforce time-aware splits.
  • PII safety: detect patterns (e.g., 16-digit numbers, emails) in non-PII fields; block if found.
  • Data volume & freshness: minimum row count, timestamp recency windows.
  • Artifact reporting: generate human-readable validation report for reviewers.

Workflow: add data validation to your CI pipeline

Step 1 — Define a data contract
List required columns, dtypes, constraints, and acceptable ranges. Store it versioned with code.
Step 2 — Prepare a baseline
Save a clean sample or statistics snapshot (e.g., feature means/quantiles) to compare drift.
Step 3 — Sample data in CI
Pull a representative sample (small but stable). Prefer deterministic sampling (fixed date window or seeded sampling) to avoid flaky tests.
Step 4 — Run validators
Execute validators for schema, quality, drift, and safety. Fail fast on critical violations (e.g., missing columns, PII, leakage).
Step 5 — Publish artifacts
Upload validation reports and metrics as build artifacts for code reviewers.
Step 6 — Gate the merge
If any critical rule fails, the CI job fails and the PR cannot be merged.
Step 7 — Update baselines (when appropriate)
If distribution shifts are expected (e.g., product changes), update the baseline in a reviewed PR.
Example CI job (generic pseudo-YAML)
jobs:
  validate_data:
    runs-on: linux
    steps:
      - checkout code
      - install deps
      - run: python tools/sample_data.py --source data/raw --out .cache/sample.csv
      - run: python tools/validate.py \
             --data .cache/sample.csv \
             --contract contracts/contract.yaml \
             --baseline stats/baseline.json \
             --report artifacts/data_validation_report.html
      - save-artifact: artifacts/data_validation_report.html

Worked examples

Example 1 — Schema and null thresholds

Scenario: The upstream team renamed customer_id to client_id and nulls in age increased.

  • Rule: customer_id must exist and be unique; age nulls <= 2%.
  • Observed: customer_id missing; age nulls = 6%.
  • Outcome: CI fails. Action: coordinate schema rename via contract update, fix age pipeline.
Example 2 — Drift detection with PSI

Scenario: avg_transaction_amount shifted due to a new pricing tier.

  • Rule: PSI < 0.2 for this feature.
  • Observed: PSI = 0.27.
  • Outcome: CI fails. Action: confirm expected shift, retrain model, then update baseline in a PR.
Example 3 — PII pattern safety

Scenario: A free-text feature contains email addresses.

  • Rule: No email pattern allowed in feature_text.
  • Observed: 14 matches.
  • Outcome: CI fails. Action: add PII scrubbing upstream or drop offending field.

Who this is for

  • MLOps Engineers integrating ML pipelines into CI/CD.
  • Data/ML Engineers maintaining reliable feature stores.
  • Data Scientists contributing to production-grade pipelines.

Prerequisites

  • Comfort with CI systems and writing basic CI jobs.
  • Intermediate Python and data manipulation (pandas/SQL).
  • Basic statistics: distributions, quantiles, PSI/KS intuition.

Learning path

  • Start: Define a minimal data contract for one dataset.
  • Add: Schema and quality checks in CI; publish a validation report.
  • Extend: Add drift checks with a controlled baseline.
  • Harden: Add PII/leakage checks; make failures block merges.
  • Scale: Apply the same pattern to each critical dataset feeding your models.

Exercises

Hands-on tasks below mirror the exercises section. Do them now; they are short and practical.

  1. Exercise 1 — Define a robust data schema gate

    Dataset: transactions.csv with columns [transaction_id, customer_id, amount, currency, event_time].

    • Require transaction_id unique, customer_id not null, amount in [0, 10,000].
    • Allow currency in {USD, EUR, GBP} only.
    • Max nulls: amount ≤ 1%, currency ≤ 0.5%.
    • event_time must be within the last 60 days.

    Deliverable: Write rules (YAML or pseudo) that a validator could run in CI and choose which failures block the build.

  2. Exercise 2 — Add drift and leakage guards

    Model: churn prediction with features [tenure_days, monthly_spend, support_tickets, plan_tier] and target churned.

    • Drift: PSI thresholds — tenure_days < 0.2, monthly_spend < 0.15, support_tickets < 0.2.
    • Leakage: Ensure no feature uses churned or future info; enforce time-aware split.

    Deliverable: Describe CI steps that compute drift vs a baseline, fail on threshold, and a leakage rule that blocks merges.

Checklist: before merging

  • Contract file exists and is versioned with the code.
  • Deterministic sampling strategy documented.
  • Critical checks (schema, PII, leakage) fail fast.
  • Drift thresholds are per-feature and justified.
  • Validation report is uploaded as a build artifact.
  • Baseline updates require a reviewed PR.

Common mistakes and self-check

  • Flaky tests: random samples change results. Fix by using fixed windows or seeded sampling.
  • Global drift threshold: hides important shifts. Use per-feature thresholds.
  • Silent contract changes: schema changes merged without review. Enforce contract bumps in CI.
  • Ignoring volume/freshness: a dataset with 10 rows passes. Add min row and recency checks.
  • PII only in prod: add the same safety checks in CI to block early.
Self-check prompts
  • Can you trace which rule would have blocked the last data incident your team saw?
  • If drift is expected tomorrow, do you know the approved process to update baselines?
  • Are your checks fast enough to run on every PR (< 3–5 minutes)?

Practical projects

  • Project 1: Add a data contract and schema checks to one production dataset; publish an HTML report artifact.
  • Project 2: Create a drift baseline and thresholds for three key features; block merges on PSI/KS violations.
  • Project 3: Implement PII and leakage guards; demonstrate a PR failing due to detected issues, then fix and pass.

Quick Test

Available to everyone. Log in to save your progress and resume later.

Mini challenge

Your nightly retrain job failed last week due to a missing column. Propose a minimal CI validation suite (3–5 rules) that would have prevented it, and describe how you would keep it fast and deterministic.

Next steps

  • Roll out the same validation pattern to every dataset that feeds a production model.
  • Automate baseline refresh with an approval workflow.
  • Track validation metrics over time to spot trends before they become incidents.

Practice Exercises

2 exercises to complete

Instructions

Dataset: transactions.csv with columns [transaction_id, customer_id, amount, currency, event_time].

  • transaction_id unique
  • customer_id not null
  • amount in [0, 10,000], nulls ≤ 1%
  • currency in {USD, EUR, GBP}, nulls ≤ 0.5%
  • event_time within last 60 days

Write rules (YAML or pseudo) that a validator could run in CI. Mark which failures block the build.

Expected Output
A set of machine-checkable rules covering schema, uniqueness, ranges, enums, null thresholds, and a time recency rule, with critical failures identified.

Data Validation In CI — Quick Test

Test your knowledge with 10 questions. Pass with 70% or higher.

10 questions70% to pass

Have questions about Data Validation In CI?

AI Assistant

Ask questions about this tool