Who this is for
You are a BI Analyst or aspiring analyst who designs reports, models data for dashboards, or validates warehouse outputs. You need reliable joins across facts and dimensions.
Prerequisites
- Basic understanding of star schemas (facts and dimensions)
- Comfort with SQL joins and primary/foreign keys
- Awareness of Slowly Changing Dimensions (Type 1 and Type 2)
Why this matters
Real tasks you will face:
- Joining sales facts to the correct customer version after an address change
- Merging two CRMs where customer IDs overlap
- Handling late-arriving facts that reference old business identifiers
- Producing consistent, auditable KPIs even when source systems rename/reuse codes
Quick example of impact
If a product code is reused for a different item, reports joined by that code will mix two products. A surrogate key prevents this by assigning a new, unique warehouse key per product version.
Concept explained simply
Natural key (NK): A business identifier that already exists in source data, like customer_number, email, SKU, or ISO country code.
Surrogate key (SK): A warehouse-generated, meaningless identifier (often an integer) used to join facts to the right dimension row, regardless of how source identifiers change.
Mental model
- Natural key = "name on the door" (useful to humans, can change)
- Surrogate key = "internal badge ID" (used for secure, stable access)
- Facts carry the badge ID (SK) so they always point to the exact dimension row/version you intended
When to use which
- Use SK for most dimensions, especially with Type 2 history. This guarantees stable joins to the correct row over time.
- Use NK for lookup/search and deduplication logic. Keep it as an attribute and often as a business key in staging.
- Sometimes the NK is perfect as the key (e.g., Date dimension with yyyymmdd). In that case, you can use the natural date key directly.
Decision guide
- If the business identifier can change, be reused, or vary across sources → Create an SK.
- If you maintain history (Type 2) → Facts must reference SK.
- If the identifier is global, stable, and tiny (e.g., ISO country code) → NK can work; SK still acceptable.
Worked examples
Example 1: Customer dimension (Type 2)
Source provides customer_number and attributes. When address changes, a new dimension row is created with a new SK. Facts for a given date join to the SK that was active on that date. The customer_number (NK) is kept as a descriptive attribute.
Example 2: Product code reuse
Supplier reuses PRODUCT_CODE "A100" for a new product. With NK-only joins, historical sales mix old and new items. With SK, the warehouse creates a new SK for the new product row. Historical facts still point to the old SK; new facts point to the new SK.
Example 3: Merging two CRMs
CRM1 and CRM2 both have customer_id = 123 but refer to different people. A conformed Customer dimension uses a matching process to identify unique persons, assigns one SK per person-version, and stores original IDs as attributes. Facts from both CRMs join via SK, avoiding collisions.
Example 4: Date dimension
Using a natural key like 20250131 is fine. It is stable, compact, and meaningful. This is a common exception to the SK rule.
Hands-on: Exercises
These mirror the exercises below. Try them now; solutions are provided in expandable sections.
Exercise 1: Choose keys and map facts
- Given the Customer source data:
customer_number | email | city | effective_from | effective_to C001 | a@x.com | Boston | 2024-01-01 | 2024-06-30 C001 | a@x.com | Chicago | 2024-07-01 | 9999-12-31 C002 | b@example.org | Miami | 2024-02-10 | 9999-12-31
- Design the dimension keys and explain why SK is needed.
- Map these facts to SKs based on transaction_date:
transaction_id | customer_number | amount | transaction_date T1 | C001 | 120 | 2024-03-15 T2 | C001 | 95 | 2024-09-01 T3 | C002 | 40 | 2024-03-01
Exercise 2: Late-arriving fact and NK change
- Suppose product_code P9 was renamed to PX9 on 2024-05-01. The fact arrives late, referencing product_code P9 for a sale on 2024-05-20.
- Decide how to resolve the foreign key so the fact points to the correct SK.
- Explain what to do if no matching dimension row exists at fact load time.
Self-check checklist
- Facts always carry SK (except deliberate NK cases like Date)
- Type 2 changes create new SKs
- NKs are preserved as attributes for search and audit
- Late-arriving facts use effective-dated NK lookups to pick the right SK
- Unknown or missing NKs map to a standard unknown SK row
Common mistakes and how to self-check
- Joining facts on NK instead of SK for Type 2 dimensions. Self-check: Does your fact table store the dimension SK? If not, fix ETL to resolve SK at load time.
- Assuming NKs are globally unique across sources. Self-check: Do multiple systems use the same code space? If yes, include source-system qualifiers or use matching to a conformed SK.
- Not handling NK changes or reuse. Self-check: Do you maintain effective_from/to in the dimension and create new SKs upon change?
- No unknown or inferred members. Self-check: Do you have a default SK (e.g., -1) for unknown? Late facts should not fail the load.
- Forgetting that Date dimension can use an NK. Self-check: Are you overcomplicating dates with unnecessary SKs?
Practical projects
- Build a small Customer dimension with Type 2 history and load a Sales fact table. Verify joins return the correct city per sale date.
- Conform two Product files with overlapping codes. Create SKs, track Type 2 changes, and show a report that separates old-vs-new reused codes.
- Implement late-arriving fact handling with an inferred Product member that is later updated when full attributes arrive.
Learning path
- Start: Surrogate vs Natural keys (this lesson)
- Next: Slowly Changing Dimensions (Type 1/2/3) mechanics
- Then: Late-arriving facts and inferred members
- Finally: Conformed dimensions across multiple source systems
Next steps
- Complete the exercises above, then take the Quick Test below.
- Note: The quick test is available to everyone; only logged-in users get saved progress.
- Apply the patterns in your next data model: pick SKs for change-prone dimensions, and use NKs for business search.
Mini challenge
Your supplier introduces a new color variant but keeps the same product_code for marketing continuity. Design how your Product dimension handles this. Specify when a new SK is created, how historical facts remain intact, and what attribute changes trigger Type 2 vs Type 1 updates.