luvv to helpDiscover the Best Free Online Tools
Topic 5 of 8

Surrogate Keys And Natural Keys

Learn Surrogate Keys And Natural Keys for free with explanations, exercises, and a quick test (for Analytics Engineer).

Published: December 23, 2025 | Updated: December 23, 2025

Why this matters

Surrogate keys and natural keys determine how your fact tables reliably join to dimensions over time. As an Analytics Engineer you will:

  • Build star schemas where dimension rows can change (SCD Type 2) without breaking historical facts.
  • Integrate multiple sources that use different business identifiers.
  • Keep joins stable even when business attributes (like email or SKU) are edited or reused.
  • Design incremental models that upsert rows safely without duplication.
Real tasks you might do
  • Choose a surrogate key for dim_customer so facts still point to the right historical record after a customer changes email.
  • Merge product catalogs from two systems with conflicting product codes.
  • Refactor a model that used a natural key and now shows broken joins.

Concept explained simply

Natural key: a business identifier that already exists in source data (email, SKU, order_number). Users recognize it. It may change or be reused.

Surrogate key: an internally created, meaningless identifier (e.g., integer sequence) used purely for data modeling and joins. It never changes once assigned to a specific row.

Rule of thumb: Use a surrogate key for dimension primary keys. Keep the natural key too as a business_key to help match new data.

Mental model

Think of a license plate (natural key) vs the car's internal chip ID (surrogate key). People recognize plates and they can change. The chip ID is stable and used by systems to track the car precisely.

How to choose keys (decision guide)

  1. List candidate natural keys. What does the business use today (SKU, email, account_number)?
  2. Check stability. Has it ever changed or been reused? If yes or unknown, do not use it as the dimension primary key.
  3. Create a surrogate key. Use an auto-increment, sequence, or generated integer. Keep the natural key as a unique business key if it is stable enough for matching.
  4. For SCD Type 2 dimensions, the surrogate key changes per version; the business key stays the same.
  5. For facts, store the dimension surrogate key (foreign key), not the natural key.
Quick checks
  • If attribute updates should NOT relink history, you need a surrogate key.
  • If multiple systems send different IDs, use a surrogate key and map all natural keys to it.
  • If joining on the natural key would merge different people/products over time, never do it.

Worked examples

Example 1: Customer dimension (email changes)

Problem: Customers can change email. Facts should keep pointing to the correct historical profile.

  • Natural key candidates: email, external_customer_id.
  • Approach: Use surrogate key customer_sk as the PK. Keep external_customer_id as business_key. On email change (SCD2), insert a new row with a new customer_sk, same business_key.
  • Result: Facts from before the change still reference the old customer_sk.
Example 2: Product dimension (SKU reuse)

Problem: A SKU might be retired and later reused for a different product line.

  • Natural key candidate: sku.
  • Approach: Use product_sk as PK. sku remains a descriptive attribute or a constrained business_key only if guaranteed unique and non-reused. For SCD2, each change to attributes like brand or size gets a new product_sk.
  • Result: Historical orders still connect to the correct historical product definition.
Example 3: Date dimension

Problem: A date like 2025-12-31 is inherently stable.

  • Natural key: calendar_date (YYYY-MM-DD).
  • Approach: For dim_date, you can use a deterministic integer surrogate like date_key = 20251231. Here the natural key is stable enough, so a surrogate is simply a standardized representation.
  • Result: Efficient joins and clear readability.

Implementation patterns (SQL)

Surrogate key with identity/sequence
-- Example generic SQL (adjust to your warehouse)
create table dim_customer (
  customer_sk      bigint generated always as identity primary key,
  business_key     varchar not null,         -- e.g., external_customer_id
  email            varchar,
  first_name       varchar,
  last_name        varchar,
  is_current       boolean default true,
  valid_from       timestamp not null,
  valid_to         timestamp
);
create unique index uq_dim_customer_bk_current
  on dim_customer(business_key, is_current) where is_current = true;

Notes: business_key is the stable natural key from the source used to match new records. Surrogate key is the join target from facts.

Deterministic hash key for staging
-- Use for de-dup or change detection, not as the dimension PK if you need SCD2 versions
select md5(coalesce(business_key, '')) as bk_hash,
       md5(coalesce(business_key,'')||'|'||coalesce(email,'')) as scd2_fingerprint
from staging_customer;

Use hashes to detect changes and to match across systems. For the dimension primary key, prefer an integer surrogate. Hash collisions are possible but extremely rare with strong hashes; still, treat as a practical, not perfect, approach.

Fact loading pattern
-- Lookup current dimension row to fetch surrogate key
insert into fct_orders (order_id, order_ts, customer_sk, product_sk, amount)
select s.order_id,
       s.order_ts,
       d_c.customer_sk,
       d_p.product_sk,
       s.amount
from staging_orders s
join dim_customer d_c
  on d_c.business_key = s.external_customer_id and d_c.is_current = true
join dim_product d_p
  on d_p.business_key = s.product_code and d_p.is_current = true;

Facts reference surrogate keys. When a dimension changes, new facts automatically link to the new surrogate key version.

Exercises you can practice

These mirror the exercises below. Try them first, then open solutions.

Exercise 1: Identify keys for a star schema

You have orders with customers and products. Choose the surrogate and natural keys for dim_customer and dim_product. Explain how facts should join.

  • I selected a surrogate key for customer and product.
  • I kept a natural business key for matching.
  • I ensured facts reference surrogate keys only.

Exercise 2: Write a safe SCD2 upsert

Write SQL to insert a new dim_customer row when the email changes for the same business_key. Set is_current flags and valid_to correctly.

  • Existing current row closed with valid_to set.
  • New row inserted with is_current = true.
  • Surrogate key is new; business_key unchanged.

Exercise 3: Debug a broken join

A report shows fewer orders after a customer email migration. Hypothesize the key mistake and propose a fix.

  • I identified the wrong join on email (natural key).
  • I proposed joining on surrogate key via business_key lookup.

Common mistakes and self-checks

  • Using a mutable natural key (email, phone) as the dimension primary key. Self-check: Could this value change? If yes, use a surrogate key.
  • Not storing the business_key. Self-check: Can you match new rows to existing entities deterministically?
  • Linking facts with natural keys. Self-check: If the natural key changes, do old facts relink incorrectly?
  • Reusing surrogate keys across SCD2 versions. Self-check: Does each version have a unique surrogate key?
  • Failing to close is_current rows. Self-check: Are there overlapping validity ranges for the same business_key?
Quick self-audit
  • Pick any dimension row. If an attribute changes, would facts from last month remain correct?
  • Verify you can map any source record to exactly one current dimension row.
  • Ensure all fact tables store only surrogate keys for dimension references.

Practical projects

  • Project 1: Build dim_customer (SCD2) with customer_sk, business_key, is_current, valid_from/valid_to. Load a week of mocked changes and verify joins from fct_orders remain stable.
  • Project 2: Merge two product catalogs with conflicting SKUs into a single dim_product using product_sk. Prove that historical facts still point to the correct product definitions.
What good looks like
  • No duplicate current rows per business_key.
  • Facts resolve to exactly one current dimension row.
  • Historical facts keep their original meaning after dimension attribute changes.

Who this is for

Analytics Engineers, BI Developers, and Data Analysts building star schemas and preparing robust semantic layers.

Prerequisites

  • Basic SQL (joins, constraints, window functions).
  • Understanding of star schemas and SCD concepts.
  • ETL/ELT fundamentals.

Learning path

  1. Understand natural vs surrogate keys and when to use each.
  2. Design dimensions with business_key and surrogate primary key.
  3. Implement SCD2 with safe upserts and validity windows.
  4. Load facts using surrogate key lookups.
  5. Audit joins and backfill safely.

Mini challenge

You inherit a model where dim_customer uses email as the primary key, and facts join on email. The business is adding SSO, and emails will change for 20% of users. Describe the exact steps to refactor to a surrogate key without breaking reports.

Next steps

  • Complete the exercises and then take the quick test below.
  • If you pass, apply the patterns to one of your production dimensions.
  • If you struggle, rework Example 1 and Project 1 with smaller datasets.

Quick Test

Anyone can take the test for free. Only logged-in users will have their progress saved.

Practice Exercises

3 exercises to complete

Instructions

You have these source fields:

  • Customers: external_customer_id, email, first_name, last_name
  • Products: product_code, sku, name, category

Design dim_customer and dim_product. Pick the primary key, the business_key, and list the columns used for SCD2 change detection. Explain how facts will join.

Expected Output
A short design describing surrogate primary keys, chosen business keys, and a sentence on fact joins via surrogate keys.

Surrogate Keys And Natural Keys — Quick Test

Test your knowledge with 8 questions. Pass with 70% or higher.

8 questions70% to pass

Have questions about Surrogate Keys And Natural Keys?

AI Assistant

Ask questions about this tool