Topic Not Found

Who this is for

Data Architects and senior data engineers who design pipelines, contracts, and storage layers that must survive changing source schemas without breaking downstream analytics or ML.

Prerequisites

Comfort with ETL/ELT patterns and batch/streaming pipelines
Familiarity with file/serialization formats (Parquet, Avro, JSON, Protobuf)
Basic SQL and data modeling (star schema, data vault or lakehouse medallion)

Why this matters

Real systems evolve: teams add fields, rename attributes, change data types, split/merge entities. Your job is to design pipelines that handle these changes gracefully, so:

Dashboards keep working during source changes
Streaming consumers don’t crash on a new field
Warehouses/lakehouses maintain historical consistency
Data contracts and governance remain enforceable

Typical tasks you will face

Choosing compatibility modes in a schema registry
Rolling out a column rename without breaking SQL
Allowing new nested fields in event payloads
Coordinating backfills and reprocessing windows
Versioning tables and views to support phased migration

Concept explained simply

Schema evolution means your data’s shape can change while older data still exists. Your system must read both old and new shapes safely.

Key levers:

Compatibility: backward, forward, full (both). These rules define whether new producers/consumers can interoperate with old data/schemas.
Defaults and nullability: adding a field is safe if it has a default or is nullable.
Schema-on-write vs schema-on-read: either enforce schema at ingestion (write) or allow flexible ingest and enforce at consumption (read).
Contracts and registries: describe payloads, validate them, and version changes.

Mental model

Think of evolution like adding new rooms to a house without blocking existing doors. Additions should be optional, with clear signage. Removals require detours (aliases/views) until everyone moves to the new path.

Compatibility cheat sheet

Add optional field (nullable or has default): usually backward compatible
Remove field: breaking; deprecate first and keep computed alias for a period
Rename field: treat as add new + backfill + alias old to new + later remove
Type widening (int to long/decimal): typically safe if consumers can handle widening
Type narrowing (long to int): breaking; avoid or gate behind new versioned topic/table

Design patterns and options

Schema registry with compatibility rules (for Avro/Protobuf/JSON Schema)
Versioned tables/views (v1, v2) with a controlled cutover date
Transformation layer: expose stable views while raw tables evolve
Soft renames: keep both columns (new and old alias) during transition
Medallion/layered approach: Bronze (as-is), Silver (cleaned/normalized), Gold (consumption) to isolate changes
CDC-aware pipelines: detect DDL changes and map them to evolution actions

Worked examples

1) Adding a new optional column in a streaming event

Scenario: Producer adds field device_os to login_event.

Registry: set backward compatibility. Add device_os as nullable or with default "unknown".
Stream ETL: pass-through in Bronze; in Silver, coalesce null to "unknown".
Gold views: keep queries stable; new analyses can reference the new column.
Monitoring: track null rate for device_os.

What could break?

Consumer that assumes fixed field order/position (avoid; use by-name access)
Downstream not refreshed to include the new column; ensure schema propagation

2) Renaming a column safely

Scenario: customer_lastname -> last_name.

Add last_name; keep customer_lastname for now.
Backfill last_name from customer_lastname.
Create a view exposing last_name but also projecting customer_lastname as alias for a deprecation window.
Notify consumers; after migration, drop customer_lastname.

Key idea

Renames are add + backfill + alias + remove later.

3) Widening a type

Scenario: order_amount int -> decimal(18,2).

Create new column order_amount_dec decimal(18,2).
Populate from int column; validate no precision loss.
Expose order_amount_dec in views as order_amount.
Deprecate old int column after all consumers upgrade.

Check

Ensure all BI tools/drivers support decimal type
Validate aggregates match before/after

4) Nested events evolving

Scenario: address becomes nested with address.street, address.city replacing flat columns.

Allow both shapes in Bronze (schema-on-read).
In Silver, populate address.* from old flat columns if nested missing.
Gold: expose a stable view that returns address fields consistently.
Plan a cutoff date to remove flat columns once adoption completes.

Step-by-step playbook

Classify the change: add, remove, rename, type change, nested restructure.
Choose strategy: optional additions, soft rename, versioned view/table, or new stream/topic.
Set/verify compatibility mode in registry or DDL policies.
Add defaults/nullability and backfill plan.
Deploy in layers: Bronze first, then Silver normalization, then Gold views.
Run tests: schema validation, DQ checks, query snapshots.
Monitor: null rates, error counts, consumer lag, schema version adoption.
Communicate deprecation timelines and cutover windows.

Common mistakes and how to self-check

Hard renames without aliases. Self-check: Does every consumer know the new name today? If not, keep an alias.
Adding non-null fields without defaults. Self-check: Can old data be read? If not, make it nullable or set a default.
Ignoring type compatibility. Self-check: Are all sinks/tools able to read the widened type?
Skipping backfills. Self-check: Will queries mixing old/new data behave consistently?
Forgetting contract tests. Self-check: Do CI tests validate old and new schemas against shared samples?

Exercises

All users can do the exercises and the quick test. If you log in, your progress will be saved.

Exercise 1: Add a field to an event stream

Scenario: Kafka topic user_profile_updated (Avro). New optional field marketing_opt_in (boolean) is introduced.

Pick a registry compatibility mode and justify.
Define the field with default or nullability.
Describe Silver-layer transformation behavior on missing values.
List monitoring metrics for rollout.

Write your plan in 5–8 bullet points.

Exercise 2: Safe column rename in the warehouse

Scenario: Warehouse table sales.order_lines has column sku_code to be renamed to product_sku.

Propose a step-by-step migration with views to avoid breaking queries.
Define tests to ensure no regressions.
State a deprecation timeline.

Write your migration plan in steps.

Exercise checklist

Compatibility mode chosen and justified
Defaults/nullability specified
Backfill or derivation plan
Views/aliases for transition
Testing and monitoring defined

Practical projects

Build a mini lakehouse: Ingest JSON events to Bronze, normalize to Silver with schema evolution handling, and publish Gold views. Simulate an add, rename, and type widening.
Create a data contract suite: Define JSON Schemas for two event versions, write validation tests, and enforce backward compatibility.
Warehouse rename drill: Implement soft rename via views, run query snapshots before/after, and automate cutover with a feature flag.

Learning path

Start: Review compatibility modes and defaults/nullability
Next: Practice soft renames and versioned views
Then: Implement layered evolution (Bronze/Silver/Gold)
Finally: Add monitoring and CI contract tests

Mini challenge

Your product team wants to drop column legacy_score next week. Design a safe plan that avoids breaking dashboards and deprecates within 60 days. Include views, backfills (if needed), tests, and a cutover date.

Next steps

Finish the exercises and run the quick test below
Apply the patterns to one live pipeline in your environment
Document your team’s default evolution policy (compatibility, views, timelines)

Menu

Handling Schema Evolution

Table of Contents