Why this matters
As an MLOps Engineer, you turn raw data into reliable training sets. Real work looks like this: pulling data from multiple sources on a schedule, catching broken schemas before they break training, deduplicating and handling late arrivals, and producing reproducible, partitioned datasets with clear audit trails.
- Protect training runs from bad data with automated checks.
- Make pipelines idempotent so reruns don’t duplicate or corrupt data.
- Enable fast backfills and consistent historical training windows.
Concept explained simply
Data ingestion moves data from sources into your storage in a structured way. Validation checks that the data matches your expectations. Think of ingestion as the conveyor belt and validation as quality control gates. Only data that passes the gates moves on to model training.
Mental model: the gated conveyor
Raw → Staging → Validated → Curated
- Raw: just landed, no assumptions.
- Staging: normalized formats and basic parsing.
- Validated: schema, types, ranges, uniqueness, domains, null ratios checked.
- Curated: partitioned, deduped, ready for training with metadata and versioning.
Each gate has rules and severity (fail-fast vs warn-and-log). Idempotency ensures you can re-run without changing correct results.
Who this is for
- MLOps Engineers building batch training pipelines.
- Data/ML Engineers who own feature pipelines and model retraining.
- Data Scientists who need dependable, reproducible datasets.
Prerequisites
- Comfort with SQL and a scripting language (e.g., Python).
- Basic understanding of files/formats (CSV, JSON, Parquet) and partitioning.
- Familiarity with batch scheduling concepts (daily/hourly windows).
Core building blocks
- Data sources and formats: CSV/JSON for interchange; Parquet/Avro for efficient analytics; consistent UTF-8 and line endings.
- Schemas and contracts: explicit field names, types, nullability, and domain rules (enums, ranges). Version them.
- Batch windows and partitioning: organize by event date (e.g., dt=YYYY-MM-DD). Use watermarks to distinguish late data.
- Idempotency: stage, validate, dedup, and merge using primary keys and checksums; never append blindly.
- Validation rule types: schema/type, required fields, ranges, domain enums, regex, uniqueness, referential checks, null ratio, row count expectations, drift metrics.
- Severity and gating: critical failures block publish; warnings log and proceed; all issues are captured in audit logs.
- Metadata and lineage: keep run id, source files, counts, rejected-row samples, and rule results.
- Reproducibility: pin data versions and code; store rule configs; record run parameters.
Worked examples
Example 1: Daily CSV to Curated Parquet with validation
Scenario: Ingest transactions_2025-06-01.csv; publish to curated/transactions/dt=2025-06-01 in Parquet.
- Staging: normalize CSV (delimiter, header), parse types.
- Validate (critical): schema/types match contract; transaction_id unique; amount in (0, 10000]; currency in {USD, EUR, GBP}; ts within ±3 days of batch date.
- Validate (warn): null ratio for source <= 5%.
- Dedup: keep latest by (transaction_id, last_updated_at).
- Publish: write Parquet partition; write audit JSON with rule outcomes and row counts.
Edge cases handled
- Extra columns: fail with actionable message.
- Missing column currency: fail schema gate.
- Late-record with ts two weeks old: drop or route to late bucket for controlled backfill.
Example 2: Idempotent re-runs and watermarking
Scenario: Rerun 2025-06-01 due to upstream fix.
- Detect same source files via checksums; if unchanged, short-circuit.
- Recreate staging; apply exact same validation config (versioned).
- Merge into curated using primary key transaction_id and last_updated_at logic (upsert).
- Watermark for training window remains event-time based, not ingestion-time, so model sees consistent date slices.
Outcome
Rerun produces identical curated data and identical audit metrics (idempotency). If source changed, only legitimate upserts occur.
Example 3: Schema evolution with a new nullable column
Scenario: Upstream adds field channel (nullable, enum {web, app}).
- Bump contract version (v1 → v2). In validation, allow v1 or v2 for a grace period.
- If channel missing: set null; if present, enforce enum.
- Backfill optional: populate channel from source logs later.
Risk control
Fail only if new column is non-nullable or breaks downstream; otherwise warn and proceed while tracking adoption rate.
Hands-on exercises
These mirror the exercises below. Do them now; the quick test at the end helps you self-check. Note: The quick test is available to everyone. Only logged-in learners get saved progress.
- Exercise 1: Draft a validation rule set for a training transactions dataset (schema provided).
- Exercise 2: Write an idempotent batch ingestion plan (pseudocode/steps) with checkpoints and audit logging.
Tips before you start
- Write rules as if they’re code-owned artifacts: explicit, versioned, testable.
- Separate critical fail rules from warnings.
- Prefer event-time partitioning and upsert merges for idempotency.
Common mistakes and how to self-check
- Using ingestion time for partitions instead of event time. Self-check: Can you reproduce a training window if files arrive late?
- Appending without dedup. Self-check: Rerun yesterday’s job and verify counts and unique keys remain stable.
- Loose schemas. Self-check: Intentionally add an extra column to input; does your pipeline fail fast with a clear message?
- No audit trail. Self-check: Can you answer “which files were used and which rows were rejected” for a given run id?
- Overfitting validation. Self-check: Are your thresholds (e.g., null ratios) realistic across seasonal changes?
Quick sanity checklist before publish
- Contract version pinned and logged
- Row counts in/out make sense
- Unique keys verified
- Late data policy applied
- Audit artifacts written
Practical projects
- Project 1: Build a daily transaction ingestion pipeline. Deliverables: rule config (JSON/YAML), audit log per run, curated Parquet partitioned by dt, and a one-page runbook.
- Project 2: Implement a late-arrival backfill tool. Deliverables: script to select late records by watermark, safe merge into partitions, and before/after metrics.
- Project 3: Add drift checks to a training dataset. Deliverables: weekly job computing distribution metrics (e.g., PSI) for key features with alert thresholds and a dashboard JSON artifact.
Learning path
- Next: Feature engineering pipelines (transformations after validation).
- Then: Training orchestration and dataset versioning.
- Later: Model monitoring and data drift in production.
Next steps
- Write your initial data contract for one dataset.
- Add 6–10 validation rules with severities.
- Implement idempotent staging + merge.
- Run once, then rerun to prove identical results.
- Document run parameters and artifacts.
Mini challenge
You ingest user_events with fields: event_id, user_id, event_type in {click, view, purchase}, ts, device, country (ISO-2). Draft three critical validation rules and two warnings, and describe your late-data policy in two sentences. Keep it concise and deployable.