Who this is for
Data Architects and senior data engineers who design pipelines, contracts, and storage layers that must survive changing source schemas without breaking downstream analytics or ML.
Prerequisites
- Comfort with ETL/ELT patterns and batch/streaming pipelines
- Familiarity with file/serialization formats (Parquet, Avro, JSON, Protobuf)
- Basic SQL and data modeling (star schema, data vault or lakehouse medallion)
Why this matters
Real systems evolve: teams add fields, rename attributes, change data types, split/merge entities. Your job is to design pipelines that handle these changes gracefully, so:
- Dashboards keep working during source changes
- Streaming consumers don’t crash on a new field
- Warehouses/lakehouses maintain historical consistency
- Data contracts and governance remain enforceable
Typical tasks you will face
- Choosing compatibility modes in a schema registry
- Rolling out a column rename without breaking SQL
- Allowing new nested fields in event payloads
- Coordinating backfills and reprocessing windows
- Versioning tables and views to support phased migration
Concept explained simply
Schema evolution means your data’s shape can change while older data still exists. Your system must read both old and new shapes safely.
Key levers:
- Compatibility: backward, forward, full (both). These rules define whether new producers/consumers can interoperate with old data/schemas.
- Defaults and nullability: adding a field is safe if it has a default or is nullable.
- Schema-on-write vs schema-on-read: either enforce schema at ingestion (write) or allow flexible ingest and enforce at consumption (read).
- Contracts and registries: describe payloads, validate them, and version changes.
Mental model
Think of evolution like adding new rooms to a house without blocking existing doors. Additions should be optional, with clear signage. Removals require detours (aliases/views) until everyone moves to the new path.
Compatibility cheat sheet
- Add optional field (nullable or has default): usually backward compatible
- Remove field: breaking; deprecate first and keep computed alias for a period
- Rename field: treat as add new + backfill + alias old to new + later remove
- Type widening (int to long/decimal): typically safe if consumers can handle widening
- Type narrowing (long to int): breaking; avoid or gate behind new versioned topic/table
Design patterns and options
- Schema registry with compatibility rules (for Avro/Protobuf/JSON Schema)
- Versioned tables/views (v1, v2) with a controlled cutover date
- Transformation layer: expose stable views while raw tables evolve
- Soft renames: keep both columns (new and old alias) during transition
- Medallion/layered approach: Bronze (as-is), Silver (cleaned/normalized), Gold (consumption) to isolate changes
- CDC-aware pipelines: detect DDL changes and map them to evolution actions
Worked examples
1) Adding a new optional column in a streaming event
Scenario: Producer adds field device_os to login_event.
- Registry: set backward compatibility. Add device_os as nullable or with default "unknown".
- Stream ETL: pass-through in Bronze; in Silver, coalesce null to "unknown".
- Gold views: keep queries stable; new analyses can reference the new column.
- Monitoring: track null rate for device_os.
What could break?
- Consumer that assumes fixed field order/position (avoid; use by-name access)
- Downstream not refreshed to include the new column; ensure schema propagation
2) Renaming a column safely
Scenario: customer_lastname -> last_name.
- Add last_name; keep customer_lastname for now.
- Backfill last_name from customer_lastname.
- Create a view exposing last_name but also projecting customer_lastname as alias for a deprecation window.
- Notify consumers; after migration, drop customer_lastname.
Key idea
Renames are add + backfill + alias + remove later.
3) Widening a type
Scenario: order_amount int -> decimal(18,2).
- Create new column order_amount_dec decimal(18,2).
- Populate from int column; validate no precision loss.
- Expose order_amount_dec in views as order_amount.
- Deprecate old int column after all consumers upgrade.
Check
- Ensure all BI tools/drivers support decimal type
- Validate aggregates match before/after
4) Nested events evolving
Scenario: address becomes nested with address.street, address.city replacing flat columns.
- Allow both shapes in Bronze (schema-on-read).
- In Silver, populate address.* from old flat columns if nested missing.
- Gold: expose a stable view that returns address fields consistently.
- Plan a cutoff date to remove flat columns once adoption completes.
Step-by-step playbook
- Classify the change: add, remove, rename, type change, nested restructure.
- Choose strategy: optional additions, soft rename, versioned view/table, or new stream/topic.
- Set/verify compatibility mode in registry or DDL policies.
- Add defaults/nullability and backfill plan.
- Deploy in layers: Bronze first, then Silver normalization, then Gold views.
- Run tests: schema validation, DQ checks, query snapshots.
- Monitor: null rates, error counts, consumer lag, schema version adoption.
- Communicate deprecation timelines and cutover windows.
Common mistakes and how to self-check
- Hard renames without aliases. Self-check: Does every consumer know the new name today? If not, keep an alias.
- Adding non-null fields without defaults. Self-check: Can old data be read? If not, make it nullable or set a default.
- Ignoring type compatibility. Self-check: Are all sinks/tools able to read the widened type?
- Skipping backfills. Self-check: Will queries mixing old/new data behave consistently?
- Forgetting contract tests. Self-check: Do CI tests validate old and new schemas against shared samples?
Exercises
All users can do the exercises and the quick test. If you log in, your progress will be saved.
Exercise 1: Add a field to an event stream
Scenario: Kafka topic user_profile_updated (Avro). New optional field marketing_opt_in (boolean) is introduced.
- Pick a registry compatibility mode and justify.
- Define the field with default or nullability.
- Describe Silver-layer transformation behavior on missing values.
- List monitoring metrics for rollout.
Write your plan in 5–8 bullet points.
Exercise 2: Safe column rename in the warehouse
Scenario: Warehouse table sales.order_lines has column sku_code to be renamed to product_sku.
- Propose a step-by-step migration with views to avoid breaking queries.
- Define tests to ensure no regressions.
- State a deprecation timeline.
Write your migration plan in steps.
Exercise checklist
- Compatibility mode chosen and justified
- Defaults/nullability specified
- Backfill or derivation plan
- Views/aliases for transition
- Testing and monitoring defined
Practical projects
- Build a mini lakehouse: Ingest JSON events to Bronze, normalize to Silver with schema evolution handling, and publish Gold views. Simulate an add, rename, and type widening.
- Create a data contract suite: Define JSON Schemas for two event versions, write validation tests, and enforce backward compatibility.
- Warehouse rename drill: Implement soft rename via views, run query snapshots before/after, and automate cutover with a feature flag.
Learning path
- Start: Review compatibility modes and defaults/nullability
- Next: Practice soft renames and versioned views
- Then: Implement layered evolution (Bronze/Silver/Gold)
- Finally: Add monitoring and CI contract tests
Mini challenge
Your product team wants to drop column legacy_score next week. Design a safe plan that avoids breaking dashboards and deprecates within 60 days. Include views, backfills (if needed), tests, and a cutover date.
Next steps
- Finish the exercises and run the quick test below
- Apply the patterns to one live pipeline in your environment
- Document your team’s default evolution policy (compatibility, views, timelines)