Why this matters
Surrogate keys and relationship design are the backbone of reliable BI data models. They determine how facts connect to dimensions, how history is tracked, and how queries stay fast and correct. In day-to-day BI Developer work, you will:
- Design star schemas where facts reference dimensions with stable keys.
- Implement slowly changing dimensions (Type 2) without breaking historical reports.
- Model many-to-many relationships (e.g., products and promotions) with bridge tables.
- Prevent orphan records and ensure consistent joins across datasets.
Who this is for
- BI Developers and Analytics Engineers building star/snowflake schemas.
- Data Analysts moving from report-writing to data modeling.
- Engineers responsible for dimensional conformance across marts.
Prerequisites
- Comfort with SQL SELECT/JOIN, and basic DDL (CREATE TABLE, constraints).
- Understanding of star schema basics: fact tables vs dimension tables.
- Awareness of Slowly Changing Dimensions (Type 1 vs Type 2) helps.
Concept explained simply
Surrogate keys are artificial identifiers (often integers or hashes) used to uniquely identify a dimension row. They have no business meaning. Natural keys come from the business (e.g., customer_id from CRM, email, SKU). In BI modeling, surrogate keys make joins stable even when business data changes or when multiple source systems disagree.
Relationship design is how tables connect: one-to-many (dimension to fact), many-to-many (use bridge tables), or role-playing relationships (same dimension used multiple times, like order_date and ship_date). Good design ensures facts at a clear grain, with foreign keys that always match valid dimension rows.
Mental model
Imagine your warehouse as a map:
- Nodes (dimensions) are stable landmarks with surrogate keys as their coordinates.
- Edges (facts) are events that point to landmarks via foreign keys.
- When a landmark changes (Type 2), it gets a new coordinate (new surrogate key) without moving the old one—so historical edges still point to the correct version.
When to prefer surrogate keys vs natural keys
- Prefer surrogate keys when: natural keys can change, multiple systems have different keys, you track SCD2 history, or you need narrow, fast joins.
- Natural keys are acceptable when: a degenerate dimension stores an event ID in a fact, or when a dimension is tiny and immutable (rare).
Core patterns and rules
- Use surrogate keys in dimensions. Keep the business key as an attribute (often called business_key or natural_key).
- Facts reference dimension surrogate keys. For SCD2, the fact row points to the specific historical version.
- Unknown and Not Applicable rows: Preload dimensions with special rows (e.g., -1 Unknown, -2 Not Applicable) and use them instead of NULL FKs.
- Many-to-many needs a bridge: Use a bridge table or a factless fact to model M:N (e.g., product_promotion_bridge).
- Role-playing dimensions: One date dimension can appear multiple times in a fact via different foreign keys (order_date_sk, ship_date_sk).
- Hash keys vs identity keys: Identity/sequence keys are simple in a single DB. Hash keys allow deterministic keying across environments or pipelines. Be mindful of collisions and data type lengths.
Worked examples
Example 1: Customer SCD2 with sales fact
-- Dimension: dim_customer (Type 2)
customer_sk BIGINT PRIMARY KEY, -- surrogate key
customer_bk VARCHAR, -- natural/business key from CRM
first_name VARCHAR,
last_name VARCHAR,
segment VARCHAR,
valid_from DATE,
valid_to DATE,
is_current BOOLEAN
-- Fact: fact_sales
sale_id BIGINT,
sale_date_sk INT, -- FK to dim_date
customer_sk BIGINT, -- FK to dim_customer (SCD2 version)
product_sk BIGINT,
quantity INT,
sales_amount NUMERIC(12,2)
If a customer changes segment, insert a new row in dim_customer with a new customer_sk, update valid_to/is_current on the old row. Facts keep pointing to the version that was current at the time of the sale.
Example 2: Product–Promotion many-to-many bridge
-- Bridge: product_promotion_bridge
product_sk BIGINT,
promotion_sk BIGINT,
bridge_weight NUMERIC(5,2) DEFAULT 1.0 -- optional allocation weight
-- Fact: fact_sales (add optional promotion link)
promotion_sk BIGINT NULL -- may be NULL if not applicable; prefer -2 for N/A if standardized
To analyze impact of promotions, you can join fact_sales to the bridge by product_sk and date range rules, or capture the specific promotion_sk on the fact at load time if known. For true M:N at query time, use the bridge and allocate measures with bridge_weight if a sale relates to multiple promotions.
Example 3: Unknown and Not Applicable dimension rows
-- dim_customer preload
customer_sk customer_bk first_name last_name ...
-1 NULL 'Unknown' 'Unknown' ...
-2 NULL 'N/A' 'N/A' ...
When a fact arrives before the customer dimension is available, load customer_sk = -1 (Unknown). If a concept doesn't apply (e.g., no shipping address for digital goods), use -2 (Not Applicable). This preserves referential integrity and keeps queries simple.
Example 4: Role-playing date dimension
-- dim_date used multiple times
fact_orders (
order_date_sk INT NOT NULL,
ship_date_sk INT NOT NULL,
invoice_date_sk INT NULL
)
All these FKs point to the same dim_date table but represent different roles. Create views or aliases (dim_order_date, dim_ship_date) for usability if desired.
Implementing surrogate keys
Option A: Sequence/Identity (recommended in a single warehouse)
-- Example (syntax varies by database)
CREATE TABLE dim_customer (
customer_sk BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
customer_bk VARCHAR NOT NULL,
...
);
Pros: simple, fast. Cons: not deterministic across environments or parallel loaders without coordination.
Option B: Hash keys (deterministic across systems)
-- Hash a stable key set that defines the row version
-- Example: concat business key + versioning attributes
customer_sk_hash = MD5(CONCAT(customer_bk, '|', segment, '|', effective_date))
Pros: deterministic and portable. Cons: risk of collisions (low but real), larger key size, careful canonicalization needed (trimming, case, delimiters).
Design checklist
- Have you defined the grain of each fact table clearly?
- Do all fact FKs point to valid dimension surrogate keys (including Unknown/N/A)?
- Are dimension surrogate keys stable, unique, and narrow?
- Is SCD2 implemented with correct valid_from/valid_to/is_current logic?
- Are many-to-many relationships handled via a bridge (with optional weights)?
- Are role-playing dimensions modeled with separate FKs?
Common mistakes and self-check
- Mistake: Using natural keys as fact FKs with SCD2 dimensions. Self-check: Does a customer attribute change create broken joins? If yes, switch to surrogate FKs.
- Mistake: Allowing NULL foreign keys. Self-check: Do counts drop unexpectedly after joins? Add Unknown/N/A rows and backfill.
- Mistake: No grain definition. Self-check: Can you describe one row of your fact in one sentence? If not, define the grain before modeling.
- Mistake: Bridge tables missing for M:N. Self-check: Do you duplicate rows or use DISTINCT to fix double-counting? Add a proper bridge with weights.
- Mistake: Inconsistent hash construction. Self-check: Are you trimming, lowercasing, and delimiting inputs consistently to avoid accidental changes?
Exercises
Do these to solidify the concepts. See solutions in the expandable sections. Your progress is saved if you are logged in; otherwise, you can still complete everything for practice.
-
Exercise 1: Model a sales star schema with SCD2 customers and promotions. Decide where surrogate keys are needed, define Unknown/N/A keys, and outline the bridge for product-promotion.
- Deliverable: a short DDL-style sketch and bullet list of relationships.
-
Exercise 2: Write SQL (pseudocode acceptable) to upsert an SCD2 dimension row for customer using a surrogate key strategy of your choice (identity or hash). Include logic for closing the old row and inserting the new version.
- Deliverable: merge/upsert logic and example rows before/after.
Exercise readiness checklist
- I can explain why facts should reference surrogate keys for SCD2 dimensions.
- I know when to use a bridge table for many-to-many relationships.
- I can implement Unknown/N/A dimension rows and use them instead of NULLs.
- I can choose between identity and hash surrogates and explain the trade-offs.
Practical projects
- Refactor an existing reporting dataset into a star schema: introduce surrogate keys, add SCD2 to one dimension, and measure query performance before/after.
- Implement a product-promotion bridge with allocation weights and demonstrate impact on revenue attribution.
- Create a role-playing date setup for orders (order_date, ship_date, invoice_date) and build a small dashboard using them.
Mini challenge
Given an OLTP orders schema (orders, order_items, customers, promotions), sketch a star schema:
- Which dimensions need surrogate keys?
- Where do you need a bridge table?
- What Unknown/N/A rows will you preload?
- How do you ensure facts reference the correct SCD2 version?
Quick Test and progress note
Take the Quick Test below to check your understanding. Available to everyone; only logged-in users get saved progress.
Learning path
- Before this: Star schema fundamentals, Fact vs Dimension modeling basics.
- Now: Surrogate keys and relationship design (this lesson).
- Next: Conformed dimensions and cross-mart consistency; SCD types in depth; Bridge table patterns and allocation methods.
Next steps
- Apply surrogate keys to one real table in your environment this week.
- Add Unknown/N/A rows to at least two dimensions and update loaders.
- Build a small report that demonstrates historical correctness with SCD2.