Who this is for
Machine Learning Engineers, Data Engineers, and Analytics Engineers who build or maintain pipelines where upstream data structures evolve over time.
Prerequisites
- Comfort with SQL and a scripting language (e.g., Python or Scala).
- Basic understanding of batch/stream processing and data formats (CSV/JSON/Parquet/Avro/Protobuf).
- Familiarity with version control and CI/CD concepts.
Why this matters
Real pipelines change constantly: new fields are added, types evolve, columns get split or deprecated. Poorly managed schema changes cause broken jobs, bad features, and outages. As an ML Engineer you will:
- Keep model training robust when upstream teams add or rename fields.
- Ensure feature stores and batch jobs remain compatible across versions.
- Migrate data safely (zero or minimal downtime) with clear rollback paths.
Concept explained simply
A schema is the shape of your data: field names, types, and constraints. Managing schema changes means planning how producers and consumers evolve that shape without breaking each other. The golden rule: make changes in small, compatible steps, monitor impact, then clean up once consumers have adapted.
Mental model
Think of a schema change like renovating a kitchen while still cooking daily. You:
- Expand: Bring in the new cabinet (add new fields, keep old ones working).
- Migrate: Move dishes gradually (dual-write/dual-read, backfill, validate).
- Contract: Remove the old cabinet after everyone has adapted (drop old fields).
This is the expand β migrate β contract pattern.
Core strategies that actually work
1) Schema versioning and contracts
- Version your schemas (v1, v2, ...). Embed version in topic, path, or metadata.
- Define a contract: required fields, allowed types, nullability, and defaults.
- Automate checks in CI to reject breaking changes.
2) Compatibility modes
- Backward compatible: New data can be read by old consumers (e.g., adding nullable fields).
- Forward compatible: Old data can be read by new consumers (defaults and optional fields help).
- Full compatible: Both backward and forward. Aim for this when practical.
3) Expand β Migrate β Contract (zero/minimal downtime)
- Expand: Add new field/structure without removing old.
- Migrate: Dual-produce or backfill; consumers dual-read; validate.
- Contract: Remove deprecated fields once adoption is confirmed.
4) Validation and monitoring
- Use data validation at ingestion (type, range, null checks, enum domains).
- Add canary readers/writers during migration and compare metrics.
- Alert on schema registry updates or metadata changes.
5) Safe type changes
- Widen types (int β long, float β double) rather than narrowing.
- For type changes (string β int), add a new field with the target type; migrate; then deprecate old.
6) Renames and splits
- Never hard-rename in-place. Add the new fields and populate from old.
- Maintain both while consumers switch; then drop old.
7) Backfill plans
- Backfill missing values with safe defaults and clear lineage notes.
- Backfill in batches with progress checks and idempotent jobs.
8) Stream and batch nuances
- Streams: Introduce new topics or fields with compatibility guarantees; keep consumers tolerant to unknown fields.
- Batch: Partition-aware backfills and reprocessing with checkpointing.
Worked examples
Example 1: Add a nullable column (backward compatible)
Scenario: Add country_code to customers.
- Expand: ALTER TABLE to add country_code VARCHAR NULL with default NULL.
- Migrate: Producer starts writing; backfill historical rows using IP-to-country lookup where possible.
- Consumers: Update to read country_code if present; fall back safely.
- Contract: After 2 release cycles and monitoring, enforce NOT NULL if business requires and all rows filled.
-- Expand
ALTER TABLE customers ADD COLUMN country_code VARCHAR(2);
-- Backfill (batched)
UPDATE customers SET country_code = lookup_country(ip_addr) WHERE country_code IS NULL;
Example 2: Change type safely (string β integer)
Scenario: orders.amount currently STRING; move to INT cents.
- Expand: Add amount_cents INT NULL.
- Migrate: Producer writes both amount (string) and amount_cents (int). Backfill from amount for historical data.
- Consumers: Switch to amount_cents with validation (non-negative, not null for new records).
- Contract: Stop writing amount string, then drop it after adoption.
-- Expand
ALTER TABLE orders ADD COLUMN amount_cents INT;
-- Backfill example logic
UPDATE orders SET amount_cents = CAST(ROUND(CAST(amount AS DECIMAL(10,2))*100) AS INT)
WHERE amount_cents IS NULL AND amount IS NOT NULL;
Example 3: Rename and split field (full_name β first_name, last_name)
Scenario: events has full_name; need first_name and last_name.
- Expand: Add first_name, last_name as NULLable.
- Migrate: Producer derives and writes all three fields. Backfill historical from full_name with best-effort splitting.
- Consumers: Read new fields when present; fallback to full_name.
- Contract: After all consumers updated, stop writing full_name and later drop it.
-- Expand
ALTER TABLE events ADD COLUMN first_name STRING, ADD COLUMN last_name STRING;
-- Backfill (pseudo)
UPDATE events SET first_name = SPLIT(full_name, ' ')[0],
last_name = SPLIT(full_name, ' ')[1];
Step-by-step playbook
- Define the change: What fields, types, constraints, and why.
- Classify compatibility: backward, forward, or full.
- Plan rollout: expand β migrate β contract; include rollback.
- Validate: add schema checks and sample data tests.
- Migrate: dual-read/write, backfill, compare metrics.
- Contract: deprecate and remove with clear communication and monitoring.
Rollout checklist
- Versioned schema defined
- Nullable or defaulted fields for additions
- Dual-write/dual-read plan documented
- Backfill approach and dry run completed
- Monitoring and alerts configured
- Deprecation window communicated
Exercises
These mirror the task list below. Do them in your environment or on paper. Use the hints if stuck.
- EX1: Add a new optional field safely to a widely used table.
- EX2: Rename a field by introducing new fields and planning the migration.
Open the exercise details in the panel below.
Exercise pack (EX1 and EX2)
EX1: Add country_code to customers (zero-downtime)
Goal: Draft an expand β migrate β contract plan.
- Write the exact DDL for expand.
- Describe the backfill logic and safety checks.
- Define consumer fallback behavior.
- Propose monitoring metrics and a deprecation window.
EX2: Rename full_name β first_name, last_name
- List producer changes (dual-write plan).
- Write a batch backfill approach (consider edge cases like single names).
- Describe consumer switch steps.
- Specify when to remove full_name and how to verify readiness.
Self-check checklist
- Your plan avoids in-place destructive changes.
- Backfill is idempotent and batched.
- Consumers have clear fallback logic.
- You defined success metrics and rollback triggers.
- Contract step only after adoption is verified.
Common mistakes and how to self-check
- Hard renames: If you changed a name directly, roll back and re-do via expand β migrate.
- Breaking types: If consumers fail to parse, introduce a new field with the new type instead of converting in-place.
- No defaults: For additions, use NULLable fields or safe defaults to maintain backward compatibility.
- Skipping validation: Add schema and data-quality checks before and after rollout.
- Rushing contraction: Keep old fields until you confirm all consumers have switched and metrics are stable.
Practical projects
- Pipeline sandbox: Create a small pipeline that reads a JSON file, adds a field, backfills, and writes Parquet. Practice the full expand β migrate β contract cycle.
- Compatibility harness: Build a tiny test that loads two schema versions and verifies old and new consumers can parse both.
- Monitoring drill: Simulate a migration and set alerts on null rates, parse errors, and consumer lag.
Learning path
- Review data formats and type systems (CSV, JSON, Parquet, Avro/Protobuf basics).
- Practice the expand β migrate β contract pattern with small scenarios.
- Add schema validation to your ingestion step.
- Design a zero-downtime plan for both batch and streaming contexts.
- Automate: add CI checks to prevent breaking schema changes.
Next steps
- Templatize your migration playbook so your team can reuse it.
- Add schema change checks to your CI/CD pipeline.
- Schedule periodic deprecation sweeps to remove dead fields.
Mini challenge
Your product team wants to change event_time from STRING to TIMESTAMP and also add timezone. Sketch a three-step plan covering expand, migrate (including backfill), and contract. Include validation rules and a rollback signal.
Quick Test
Note: Anyone can take this test for free; only logged-in users will have their progress saved.