Why this matters
Analytics Engineers decide how changes to dimension attributes are stored. This affects trending, regulatory reporting, customer analytics, and how facts join to dimensions. You will routinely:
- Design customer/product dimensions that track history safely.
- Choose when to overwrite a value (Type 1) vs. add a new version (Type 2).
- Write MERGE/UPSERT logic that detects changes, sets validity dates, and flags current rows.
- Ensure fact tables consistently reference the correct dimension version.
Who this is for
- Analytics Engineers and BI Developers building star schemas.
- Data Analysts moving into data modeling and ELT/ETL ownership.
- Anyone responsible for reliable historical reporting.
Prerequisites
- Basic SQL (SELECT, JOIN, GROUP BY) and understanding of MERGE/UPSERT patterns.
- Familiarity with star schema terms: dimension, fact, surrogate key, natural key.
- Comfort with date/time fields and null handling.
Concept explained simply
Dimensions hold descriptive attributes (like customer city or product category). When those attributes change over time, we choose how to store the change.
- SCD Type 1: Overwrite the old value. No history is kept. Use for corrections and non-analytical changes (e.g., typo fix).
- SCD Type 2: Keep history by adding a new row with new values and validity dates. Prior row is closed. Facts can be analyzed against the attribute that was true at the time.
Mental model
Think of a dimension row as a versioned biography:
- Type 1: You replace a paragraph in the biography. Past references now see the updated text.
- Type 2: You add a new chapter. Past chapters remain intact, and the latest chapter is marked as current.
What columns are typical in a Type 2 dimension?
- surrogate_key (e.g., customer_sk)
- natural_key/business_key (e.g., customer_id)
- tracked attributes (e.g., city, email)
- valid_from, valid_to (dates or timestamps)
- is_current (boolean or 0/1)
- optional: hashdiff of tracked attributes for change detection
Worked examples
- Address change (keep history):
Customer 42 moves from Boston to Austin on 2024-04-15.
- Type 2: Close the Boston row (valid_to = 2024-04-14, is_current = 0), insert a new row with Austin (valid_from = 2024-04-15, is_current = 1).
- Result: Facts dated before 2024-04-15 show Boston; after show Austin.
- Typo correction (no history):
Customer name "Jonh" should be "John".
- Type 1: Overwrite the name on the current row.
- Result: Historical reports see the corrected name; no previous state preserved.
- Product reclassification (business decision):
Product P9 moves from category "Home" to "Garden" on 2025-02-01.
- If you need historical category analysis: Type 2 with valid_from = 2025-02-01.
- If you always want latest category on all reports: Type 1 overwrite (beware of back-in-time distortion).
Design choices and rules of thumb
- Regulatory or time-accurate analysis needs Type 2 for relevant attributes.
- Data quality fixes (typos, formatting) are typically Type 1.
- Choose the SCD type per attribute if needed (hybrid). For example, email = Type 2, name = Type 1.
- Always include a default/unknown member (e.g., surrogate_key = 0) for unmatched facts.
- Use a hash of tracked attributes to detect changes efficiently during MERGE.
SQL patterns (pseudo)
Type 1 MERGE (overwrite)
MERGE dim_customer d USING stage_customer s ON d.customer_id = s.customer_id WHEN MATCHED THEN UPDATE SET d.name = s.name, d.city = s.city, d.updated_at = CURRENT_TIMESTAMP WHEN NOT MATCHED THEN INSERT ( customer_sk, customer_id, name, city, valid_from, valid_to, is_current ) VALUES ( NEXTVAL, s.customer_id, s.name, s.city, CURRENT_DATE, NULL, 1 );
Type 2 MERGE (history)
-- detect changes via hashdiff across tracked attributes
MERGE dim_customer d
USING (
SELECT *, HASH(city, email, status) AS hashdiff FROM stage_customer
) s
ON d.customer_id = s.customer_id AND d.is_current = 1
WHEN MATCHED AND d.hashdiff <> s.hashdiff THEN
-- close old row
UPDATE SET d.valid_to = CURRENT_DATE - 1, d.is_current = 0
WHEN NOT MATCHED BY TARGET THEN
-- new business key
INSERT (customer_sk, customer_id, city, email, status, hashdiff,
valid_from, valid_to, is_current)
VALUES (NEXTVAL, s.customer_id, s.city, s.email, s.status, s.hashdiff,
CURRENT_DATE, NULL, 1);
-- then insert new versions for those we just closed
INSERT INTO dim_customer (
customer_sk, customer_id, city, email, status, hashdiff,
valid_from, valid_to, is_current
)
SELECT NEXTVAL, s.customer_id, s.city, s.email, s.status, s.hashdiff,
CURRENT_DATE, NULL, 1
FROM stage_customer s
JOIN dim_customer d_old
ON d_old.customer_id = s.customer_id
WHERE d_old.valid_to = CURRENT_DATE - 1; -- just closed
Common mistakes and self-checks
- Forgetting validity dates: Ensure valid_from and valid_to properly bound each version. Self-check: no overlapping windows for the same business key.
- Not marking current row: Always keep exactly one is_current = 1 per business key. Self-check: COUNT where is_current = 1 must be 1 per key.
- Changing natural keys retroactively: Do not rewrite historical natural keys; create new Type 2 row if attributes tracked by Type 2 change.
- Mismatched fact joins: Facts should join on surrogate keys captured at load time. Self-check: pick a sample fact and confirm it maps to the dimension version valid on its date.
- Overusing Type 2: Not all changes need history. Confirm with stakeholders which attributes drive analysis.
Exercises
Practice here, then take the Quick Test. Note: anyone can complete the test; only logged-in users will have their progress saved.
- Exercise 1 — Choose SCD Type and output rows
Staging records for customer_id = 100:
2024-03-01: name=Alex Chen, city=Denver, email=alex@example.com 2024-06-10: name=Alex Chen, city=Boulder, email=alex@example.com 2024-08-05: name=Alex Chen, city=Boulder, email=alex.chen@example.com (typo fix -> actual correct address)Rules: city is Type 2, name is Type 1, email is Type 2.
Create the resulting dim_customer rows with columns: customer_sk (placeholder OK), customer_id, name, city, email, valid_from, valid_to, is_current.
- Exercise 2 — Write the change detection logic
Given tracked Type 2 attributes: (city, email). Propose a change detection expression and outline the MERGE steps that ensure:
- Close old row when hashdiff changes.
- Insert a new current row with valid_from = CURRENT_DATE.
- Leave row untouched when no change.
- Checklist:
- One current row per business key.
- No overlapping validity windows.
- Only Type 2 attributes trigger new versions.
- Type 1 attributes overwrite within the current row.
Practical projects
- Build a Type 2 customer dimension that tracks city and email, but overwrites name (Type 1). Load three months of staged data and validate windows.
- Create a product dimension with Type 2 category and Type 1 description. Join a sales fact table and prove that historical reports reflect category at sale date.
- Add an unknown member row to both dimensions and verify fact loads never fail on missing keys.
Learning path
- Start: SCD Type 1 and Type 2 basics (this lesson).
- Next: Attribute-level SCD rules (hybrid models) and handling late arriving data.
- Then: SCD Type 3 and audit columns (who/when changed) for lineage.
- Finally: Performance patterns (partitioned MERGE, incremental loads, change data capture).
Mini challenge
Your marketing team wants to always see the latest customer status on all historical reports, but finance needs status as-of the transaction date. Propose a design that satisfies both with minimal duplication. Hint: Hybrid SCD by attribute, or separate views.
Next steps
- Finish the exercises, then take the Quick Test below.
- Implement a small Type 2 dimension in your sandbox and run a backfill + incremental load.
- Review with stakeholders which attributes truly need history.
Quick Test
Take the test to check your understanding. Anyone can attempt it; only logged-in users will have progress saved.