Why this matters
As an ETL Developer, you will frequently load and update dimension tables where business attributes change over time (customer address, product category, employee role). Choosing the right Slowly Changing Dimension (SCD) pattern directly affects reporting accuracy, auditability, and storage costs.
- Customer analytics: preserve address history to analyze moves by time (Type 2).
- Compliance/audit: show what was true at the time of a transaction (Type 2/Type 6).
- Simple corrections: fix bad data without keeping history (Type 1).
Concept explained simply
A dimension describes entities (Customer, Product, Employee). Slowly Changing Dimensions are strategies to store how those descriptions change over time.
- Type 0: Never change after first load (retain original).
- Type 1: Overwrite values (no history).
- Type 2: Add a new row per change with dates/flags (full history).
- Type 3: Track a limited number of prior values in extra columns (e.g., current_region, previous_region).
- Type 4: Keep current row in the dimension; move history to a separate history table.
- Type 6 (1+2+3 hybrid): Combine Type 1 (current overwrite fields), Type 2 (row versioning), and Type 3 (previous attribute columns).
Mental model
Think of a bookshelf of entity "snapshots." Type 1 replaces the photo, Type 2 adds a new photo and keeps the old, Type 3 pins only the latest two photos side by side, Type 4 moves old photos into an archive box, and Type 6 does all three: replaces some details on the latest photo, keeps the old photos, and stores a favorite previous detail on the latest one.
Core columns for SCD2 (typical)
- surrogate_key (e.g., customer_sk): Integer identity for joins from facts.
- natural_key (e.g., customer_id): Stable business identifier.
- business attributes: name, email, city, etc.
- effective_start_date / effective_end_date: Validity window.
- is_current: 1 for the latest active row, 0 for historical rows.
- version: Incremented per natural_key change (optional but helpful).
- hash_diff: Hash of business attributes to detect changes (optional).
Worked examples
Example 1 — SCD Type 1 (overwrite)
Use when you do not need history (e.g., typo fixes).
dim_customer (Type 1) before customer_id | name | email ----------- | --------- | ---------------- C001 | Ana Brown | ana@old.com Incoming change: email=ana@new.com After overwrite customer_id | name | email ----------- | --------- | ---------------- C001 | Ana Brown | ana@new.com
Type 1 upsert idea
MERGE dim_customer d USING stg_customer s ON d.customer_id = s.customer_id WHEN MATCHED THEN UPDATE SET d.email = s.email, d.name = s.name WHEN NOT MATCHED THEN INSERT (customer_id, name, email) VALUES (s.customer_id, s.name, s.email);
Example 2 — SCD Type 2 (full history)
Track changes over time with dated rows.
dim_customer (Type 2) before customer_sk | customer_id | city | start_date | end_date | is_current | version ----------- | ----------- | ------- | ---------- | ---------- | ---------- | ------- 101 | C001 | Boston | 2023-01-01 | 9999-12-31 | 1 | 1 Incoming change on 2024-04-10: city=Chicago After Type 2 customer_sk | customer_id | city | start_date | end_date | is_current | version ----------- | ----------- | ------- | ---------- | ---------- | ---------- | ------- 101 | C001 | Boston | 2023-01-01 | 2024-04-09 | 0 | 1 202 | C001 | Chicago | 2024-04-10 | 9999-12-31 | 1 | 2
Type 2 key steps
- Detect change (compare attributes or hash).
- Expire the current row (set end_date to yesterday, is_current=0).
- Insert a new row with the new attributes (start_date=today, end_date=9999-12-31, is_current=1, version+1).
Example 3 — SCD Type 3 (limited history)
Keep current and previous value in fixed columns.
dim_employee (Type 3) employee_id | current_title | previous_title ----------- | ------------- | -------------- E777 | Analyst | (null) Change: title=Senior Analyst After update employee_id | current_title | previous_title ----------- | ------------- | -------------- E777 | Senior Analyst| Analyst
Type 3 update idea
UPDATE dim_employee d
SET previous_title = d.current_title,
current_title = s.new_title
FROM stg_employee_title s
WHERE d.employee_id = s.employee_id
AND d.current_title <> s.new_title;
Design choices and trade-offs
- Type 1: simplest, low storage, but loses history.
- Type 2: accurate time travel; more rows, careful joins from facts.
- Type 3: simple limited history; only a few prior values.
- Type 4: isolates history; two tables to manage.
- Type 6: flexible reporting; highest complexity.
How facts link to SCD2
When loading facts, look up the dimension row version valid at the fact event time and store its surrogate_key in the fact. This preserves historical correctness.
Step-by-step: a reliable SCD Type 2 pattern
- Stage data: Cleanse and deduplicate incoming rows by natural key.
- Compute hash_diff on business attributes for quick change detection.
- Split staging rows into: new (no match in dim), changed (match by natural key but different hash), unchanged.
- Expire changed rows in dim (set end_date, is_current=0).
- Insert new SCD2 rows for all new and changed records with start_date=today, end_date=9999-12-31, is_current=1, version handled per natural key.
- Ensure unique constraint: (natural_key, effective_start_date) or use surrogate key identity/sequence.
- Test with edge cases: multiple changes same day, reverts to previous value, late-arriving data.
Exercises
Do these in order. Aim for correctness first, then performance.
Exercise 1 (matches ex1)
Create an SCD Type 2 load for a customer dimension using a staging table. Include start/end dates and is_current logic.
Exercise 2 (matches ex2)
Implement a Type 3 update that moves current_city to previous_city and sets a new current_city.
Self-check checklist
- Does a change create a new SCD2 row and expire the previous one?
- Are unchanged rows left untouched?
- Do fact lookups return the correct surrogate key by event date?
- Can the process handle two changes for the same key on the same day?
- Are natural keys unique in staging before merge?
Common mistakes and how to self-check
- Overwriting history by accident (Type 1 instead of Type 2). Self-check: After change, do you still have both old and new rows?
- Missing end_date updates. Self-check: Only one row per natural key should have is_current=1 with end_date=9999-12-31.
- Incorrect fact-to-dimension join. Self-check: Use the surrogate_key captured at load time, or join by date ranges during lookup, not at query time.
- Not deduplicating staging. Self-check: Ensure one latest record per natural key in staging before merge.
- Ignoring time zone/clock issues. Self-check: Derive dates consistently (e.g., UTC) and avoid overlaps in validity windows.
Practical projects
- Build a Customer SCD2 dimension from CSV files and simulate daily loads with changed addresses.
- Create a Product Type 3 dimension with current_category and previous_category; update it from a small change feed.
- Design a Type 6 dimension for Employee with SCD2 versioning plus a previous_title column and one attribute overwritten Type 1 style.
Learning path
- Prerequisites: SQL joins, primary/foreign keys, basic ETL staging.
- Then: SCD types (0/1/2/3/4/6) and their use cases.
- Next: Implement SCD2 with MERGE/upserts, hashing, and date windows.
- Advance: Handle late-arriving data and same-day multi-change handling.
- Optimize: Indexing, constraints, and batch vs. micro-batch patterns.
Who this is for
- ETL/ELT Developers and Data Engineers building star schemas.
- Analysts transitioning into data warehousing.
- Anyone maintaining dimension tables that change slowly over time.
Prerequisites
- Intermediate SQL (INSERT/UPDATE/MERGE, joins).
- Understanding of dimensions vs. facts in star schemas.
- Basic ETL workflow knowledge (staging, cleansing, deduplication).
Mini challenge
You must support customer email corrections (no history) and address changes (with history). Propose a combined approach using Type 1 for email and Type 2 for address, listing the columns you need and the update rules in one paragraph.
Hint
Think: which attributes overwrite in place vs. which trigger new rows and date windows.
Next steps
- Review your Exercise 1 and 2 solutions against the checklist.
- Run edge-case tests: duplicate staging rows, reversion to a previous value, and late-arriving records.
- Take the quick test to confirm understanding. Everyone can take the test; logged-in users will have progress saved.
Quick Test
Start the quick test below when you are ready. Aim for 70% or higher to pass.