Topic Not Found

Why this matters

Real data changes under your feet: teams add columns, rename fields, change types, and split tables. As an ETL Developer, you must keep pipelines reliable through these shifts without breaking dashboards or downstream jobs.

Keep SLAs when upstream contracts change.
Roll out new features safely (e.g., new metrics/columns).
Avoid data loss by capturing unexpected fields and maintaining data lineage.

Who this is for

ETL/ELT developers building SQL- or Spark-based pipelines.
Data engineers maintaining warehouses/lakes (e.g., columnar formats).
Analytics engineers dealing with model breakage from schema drift.

Prerequisites

Comfort with SQL DDL/DML (CREATE/ALTER/INSERT/UPDATE).
Basic understanding of batch pipelines and staging layers.
Familiarity with data types (string, numeric, timestamp, struct/array).

Concept explained simply

A schema is the shape of your data. A schema change is when that shape changes (new column, rename, type change). Some are safe (add a nullable column). Others break code (rename a column used by dashboards). Your job: detect, evaluate, and implement changes safely, preserving history and service continuity.

Mental model: Contract – Pipeline – Storage triangle

Contract (source producers): What they promise to deliver.
Pipeline (transform/load): How you map, validate, and evolve.
Storage (warehouse/lake): Where compatibility rules and historical versions live.

Any change touches the triangle. Make adjustments in a controlled order so each side remains consistent.

Types of schema changes

Additive (usually non-breaking): Add a nullable column; widen VARCHAR length; add a nullable field to a nested struct.
Modifying (potentially breaking): Rename a column; change data type or semantics; change nullability; split/merge fields.
Structural: Partitioning changes, table splits, nested field reorganizations.

Rule of thumb: Additions are typically safe. Renames, type changes, and nullability changes are risky and require a plan.

Runbook: handling a schema change

1) Detect
How: Ingestion validation, schema diff checks, data quality tests (e.g., new columns detected, parse errors).

2) Classify
Is it additive (non-breaking) or breaking (rename/type change)? Does it affect critical transformations or downstream contracts?

3) Decide policy
- Non-breaking: Allow and propagate with defaults.
- Breaking: Introduce a compatibility layer (views/mappings), dual-write, and schedule a migration window.

4) Implement safely
- Staging first, then core models, then derived models.
- Keep idempotent backfills (reruns produce same result).
- Use views or mapping layers to avoid downtime.

5) Backfill
Migrate history in batches with checkpoints and data validation.

6) Monitor and clean up
Track null rates, parse errors, freshness, and downstream failures. Remove deprecated columns after a deprecation window.

Rollback playbook

Keep raw landing data and previous table/view version.
Feature-flag transformations; revert the flag if metrics regress.
Document the exact DDL/DML applied for quick reversal.

Worked examples

Example 1 — Add a nullable column (non-breaking)

Upstream adds coupon_code to orders.

-- Staging (first)
ALTER TABLE stg_orders ADD COLUMN coupon_code VARCHAR;

-- Transform (be tolerant)
SELECT order_id, customer_id, total_amount,
       COALESCE(coupon_code, NULL) AS coupon_code
FROM stg_orders;

-- Warehouse/core
ALTER TABLE dwh_orders ADD COLUMN coupon_code VARCHAR;

-- Optional backfill from history
UPDATE dwh_orders d
SET coupon_code = s.coupon_code
FROM stg_orders s
WHERE d.order_id = s.order_id AND d.coupon_code IS NULL;

Downstream remains compatible, new data appears gradually, history is backfilled.

Example 2 — Rename without breaking consumers

Upstream renames customer_phone to phone_number. Keep downstream stable with a compatibility view.

-- In staging, keep both fields during transition
ALTER TABLE stg_customers ADD COLUMN phone_number VARCHAR;

-- Fill both columns (dual populate)
UPDATE stg_customers SET phone_number = customer_phone WHERE phone_number IS NULL;

-- Compatibility view for downstream (exposes old name)
CREATE OR REPLACE VIEW vw_customers AS
SELECT customer_id,
       phone_number AS customer_phone,  -- old alias
       phone_number
FROM stg_customers;

-- After downstream migrates, drop the alias and deprecate old name.

Example 3 — Type change with safe cast

total_amount changes from INT to DECIMAL(12,2).

-- Add new column and backfill with safe cast
ALTER TABLE stg_orders ADD COLUMN total_amount_dec DECIMAL(12,2);
UPDATE stg_orders SET total_amount_dec = CAST(total_amount AS DECIMAL(12,2));

-- Downstream transformation uses new column with fallback
SELECT order_id,
       COALESCE(total_amount_dec, CAST(total_amount AS DECIMAL(12,2))) AS total_amount
FROM stg_orders;

-- After full backfill and validation, retire the old column.

Example 4 — Nested JSON adds a field

payload.customer now includes preferred_language.

-- Landing: store raw JSON always
-- Staging: extract if exists; default NULL
SELECT payload:customer:first_name AS first_name,
       payload:customer:preferred_language::STRING AS preferred_language
FROM raw_events;

-- Warehouse: add nullable column and populate
ALTER TABLE dwh_customers ADD COLUMN preferred_language VARCHAR;
INSERT ... SELECT ... preferred_language ...;

Validation and monitoring

Schema diffs: alert on new/removed/renamed fields.
Null-rate thresholds: spikes indicate bad casts or missing data.
Parse/ingestion error rate: sudden increases suggest type/format changes.
Freshness and row-count deltas: catch silent failures.

Self-check: is this change safe?

Is it additive and nullable? Likely safe.
Will any existing query break? Search code for the column name.
Can you deploy a view alias for compatibility?
Do you have a reversible DDL plan?

Exercises

Do these hands-on tasks. Then compare with the solutions below and validate with the checklist.

Exercise 1 — Add a new nullable column safely

Daily orders source adds coupon_code. Update staging and warehouse, keep pipelines idempotent, and backfill recent data. See the full prompt in the Exercises section below.

Exercise 2 — Handle a breaking rename with zero downtime

Upstream renames customer_phone to phone_number and changes structure. Keep downstream stable using a compatibility layer and a phased plan.

Checklist

[ ] Staging updated before core models.
[ ] Backfill plan is idempotent and chunked.
[ ] Compatibility view/alias shields downstream.
[ ] Monitoring covers null-rate and parse errors.
[ ] Documented deprecation window and rollback.

Common mistakes and how to self-check

Dropping or renaming directly in core tables. Fix: add alias/view first, migrate, then clean up.
Changing nullability to NOT NULL without a backfill. Fix: backfill, validate, then enforce constraints.
One-shot backfills on large tables. Fix: batch with checkpoints and metrics.
Ignoring raw capture. Fix: always land raw payloads to enable reprocessing.
Assuming additions are always safe. Fix: still validate downstream transformations for assumptions.

Quick self-audit

Do you know which teams consume the changed fields?
Can you revert the last two DDLs quickly?
Are lineage and code search updated for the new/old names?

Practical projects

Build a schema-evolution demo: simulate weekly upstream changes and keep dashboards green using views and staged rollouts.
Implement a schema drift detector: compare expected vs observed columns and send alerts; log diffs to a table.
Migrate an INT amount to DECIMAL with a backfill job and validation report (row counts, sums, and discrepancy thresholds).

Learning path

Start: Understand additive vs breaking changes and compatibility patterns.
Practice: Implement two non-breaking and one breaking change with rollbacks.
Automate: Add schema diff checks and alerting to your pipeline.
Harden: Add idempotent backfills and validation queries.
Document: Create a team playbook with runbooks and SLAs.

Next steps

Turn the runbook into reusable templates (DDL, view patterns, validation queries).
Add monitoring to catch future drifts early.
Schedule periodic cleanup of deprecated columns and views.

Mini challenge

Your product team will split full_name into first_name and last_name next week. Draft a no-downtime plan: staging changes, compatibility view, backfill, validation, and cleanup timeline. Keep it under 10 steps.

Quick Test

You can take this test for free. Only logged-in users get saved progress and scores.

Menu

Handling Schema Changes

Table of Contents