Why this matters
As a Data Architect, you decide how analytical systems remember changes to business entities like customers, products, and employees. The right Slowly Changing Dimension (SCD) pattern ensures accurate historical reporting, point-in-time analysis, and trustworthy dashboards.
- Regulatory and audit needs: prove what was true at a past date.
- Attribution: tie revenue to the customer segment valid at the time of purchase.
- Operations: show a sales rep’s current book vs. historical assignments.
Concept explained simply
A Slowly Changing Dimension is a way to record how descriptive attributes change over time. You choose whether to overwrite, add versioned rows, or keep snapshots elsewhere.
- Type 0 (Fixed): never change after initial load. Good for immutable facts like birthdate.
- Type 1 (Overwrite): update in place; no history. Good for corrections and non-historical KPIs.
- Type 2 (Row versioning): add a new row with date ranges and a current flag. Good for time-travel analysis.
- Type 3 (Previous value): keep limited history in extra columns (e.g., previous_department).
- Type 4 (History table): current table + separate history table for full changes.
- Type 6 (Hybrid 1+2+3): versioned rows plus current and previous columns for convenience.
- Type 7 (Dual key): expose both surrogate key (SK) and business key (BK) to support flexible joins with either current or historical context.
Mental model
Think of a dimension as a passport:
- Type 1: you edit the passport—only the latest details exist.
- Type 2: you get a new passport each change, with validity dates.
- Type 3: you keep the current passport and note the previous detail on a sticky note.
- Type 4: the archive office stores all old passports; you carry only the current one.
Pattern selection quick helper
- Need accurate point-in-time reporting? Choose Type 2 or 6.
- Only need the latest value and corrections? Choose Type 1.
- Need current plus “what it used to be” for one attribute? Choose Type 3.
- Worried about dimension size but need history? Consider Type 4.
- Need both BK-based and SK-based joins? Consider Type 7.
Schema building blocks
- Surrogate key (e.g., customer_sk): integer identity for each versioned row.
- Business key (e.g., customer_id): stable natural identifier from source.
- Effective dating: valid_from_date, valid_to_date (often use high date like 9999-12-31 for open-ended).
- Current flag: is_current = 1 for the active row.
- Audit columns: inserted_at, updated_at, source_system, change_reason.
Worked examples
Example 1: Customer address changes
Need: Report historical sales by the region that was true at the time of purchase.
- Choose: Type 2.
- Why: Region is derived from address and affects historical attribution.
- Design: customer_sk (PK), customer_id (BK), address, city, region, valid_from_date, valid_to_date, is_current.
- Impact: Facts join to customer_sk from the transaction date via point-in-time lookup.
Example 2: Product name correction (typo)
Need: Fix display names; history not required.
- Choose: Type 1.
- Why: It’s a correction; no business need to analyze past misspellings.
- Design: Update product_name in place; no new rows.
Example 3: Employee department reassignment
Need: Show historical department for audits and also quickly filter current and previous department.
- Choose: Type 6 (1+2+3 hybrid).
- Why: Combines full history (Type 2) with convenience columns (current/previous dept).
- Design: employee_sk, employee_id, department, prev_department, valid_from_date, valid_to_date, is_current; overwrite convenience columns (Type 1) while adding new row (Type 2).
Example 4: Customer marketing preferences history offloaded
Need: Keep the dimension slim, but store full opt-in/opt-out history.
- Choose: Type 4.
- Why: Current table stays lean; history table stores changes.
- Design: dim_customer_current + dim_customer_history with effective dating.
Implementation steps (Type 2 core)
- Stage changes: Land source deltas with a change hash for tracked attributes.
- Detect new vs. changed: Compare by business key; if hash differs, it’s a change.
- Close old row: Set valid_to_date = change_date - 1 day, is_current = 0.
- Insert new row: Copy attributes, new surrogate key, valid_from_date = change_date, valid_to_date = 9999-12-31, is_current = 1.
- Maintain indexes: BK + is_current; BK + valid_from_date for PIT joins.
Sample SQL skeleton (ANSI-like)
-- Assume staging_customer has the latest snapshot for the day, one row per customer_id.
-- dim_customer is Type 2.
-- 1) Close existing rows where change detected
UPDATE dim_customer d
JOIN staging_customer s ON d.customer_id = s.customer_id AND d.is_current = 1
SET d.valid_to_date = DATE_SUB(s.extract_date, INTERVAL 1 DAY),
d.is_current = 0
WHERE MD5(CONCAT_WS('|', d.name, d.address, d.city, d.region))
<> MD5(CONCAT_WS('|', s.name, s.address, s.city, s.region));
-- 2) Insert new versions for changed or new customers
INSERT INTO dim_customer (
customer_id, name, address, city, region, valid_from_date, valid_to_date, is_current, source_system
)
SELECT s.customer_id, s.name, s.address, s.city, s.region,
s.extract_date, DATE('9999-12-31'), 1, s.source_system
FROM staging_customer s
LEFT JOIN dim_customer d
ON d.customer_id = s.customer_id AND d.is_current = 1
WHERE d.customer_id IS NULL -- new
OR MD5(CONCAT_WS('|', d.name, d.address, d.city, d.region))
<> MD5(CONCAT_WS('|', s.name, s.address, s.city, s.region));
Common mistakes and self-check
- Forgetting to close prior row (leads to overlapping date ranges). Self-check: For a given business key, there must be at most one row with is_current = 1 and no overlapping valid_from/to ranges.
- Using natural keys as surrogate keys. Self-check: Facts should join to SK; BK remains for change detection.
- Versioning non-analytic attributes (e.g., phone formatting). Self-check: Only version attributes that drive analysis or compliance.
- Not handling late-arriving data. Self-check: Ensure point-in-time logic can backfill effective dates correctly.
- Ignoring null semantics. Self-check: Normalize null vs. empty strings before hashing/comparison.
Who this is for
- Data Architects designing marts and governance for historical accuracy.
- Data Engineers implementing CDC and dimension upserts.
- Analytics Engineers modeling semantic layers for BI.
Prerequisites
- Relational modeling basics and star schemas.
- Intermediate SQL (JOIN, MERGE/UPSERT, window functions).
- High-level understanding of CDC or batch snapshots.
Exercises you can do now
Do these before the quick test. Everyone can take the test; only logged-in users will have their progress saved.
- Exercise 1: Choose SCD types and design columns for multiple attributes.
- Exercise 2: Write a Type 2 merge for a customer dimension.
Exercise 1: Pattern selection and schema sketch
You own dim_customer. For each attribute, pick an SCD type and list the necessary columns or handling:
- customer_status (Active, Suspended): analysts need point-in-time churn reporting.
- email_address: corrections only; no need to analyze past emails.
- loyalty_tier: need current tier and the previous tier for retention dashboards; also need time-travel sales attribution.
Deliverable: Short justification per attribute and a column list for the dimension (include keys, dates, flags).
Exercise 2: Implement a Type 2 upsert
Given staging_customer_daily (one row per customer_id per run) with columns: customer_id, name, city, region, extract_date. Implement SQL to:
- Detect changes in name, city, or region.
- Close old versions and insert new versions with valid_from/to and is_current.
- Ensure no overlapping date ranges.
Self-checklist
- I can explain when to use Types 1, 2, 3, 4, 6, and 7.
- I can design a dimension with surrogate keys and effective dating.
- I can implement a Type 2 merge without overlapping validity.
- I can justify pattern choice with business questions.
Mini challenge
Your product dimension has attributes: product_name, brand, category, subcategory. Marketing wants only the latest names, but Finance needs historical category/subcategory for margin analysis. Propose a modeling approach (single dimension vs. hybrid), the SCD types per attribute, and how facts will point to the correct version. Keep it to 5-7 sentences.
Learning path
- Review star schema fundamentals: dimensions vs. facts and surrogate keys.
- Learn SCD types 0–7 and typical use cases.
- Practice Type 2 with effective dates, current flags, and hashing for change detection.
- Handle late-arriving and early-arriving dimensions; backfill dates safely.
- Integrate with CDC pipelines; reconcile deletes vs. soft-deletes.
- Add governance: data quality checks for overlaps, duplicate BKs, and null handling.
Practical projects
- Build a Type 2 customer dimension and join a sales fact table for point-in-time region reporting.
- Implement a Type 6 employee dimension and create a dashboard showing current and previous department headcount.
- Migrate a Type 1 product dimension to Type 2 for category changes; validate metrics before and after.
- Add automated checks: no overlapping validity, one current row per BK, and deterministic hashing.
Next steps
- Finish the exercises above.
- Take the quick test to confirm understanding.
- Apply one pattern to a real dataset in your environment and measure impact on a key KPI.