Why this matters
As an ETL Developer, you constantly convert raw keys and codes into consistent, meaningful values. Properly managing reference data (countries, currencies, product categories, statuses) and lookups lets you:
- Resolve business keys to surrogate keys for reliable joins.
- Maintain history when reference values change (SCD patterns).
- Handle late-arriving facts without losing referential integrity.
- Keep pipelines fast using efficient lookup and caching strategies.
Real tasks you will face:
- Building a dimension for product categories and joining it to facts with a stable surrogate key.
- Merging code lists from multiple sources into a conformed reference table.
- Designing safe defaults for missing or invalid reference values.
Concept explained simply
Reference data is the small, relatively stable set of values used to classify or interpret facts. Lookups are how your ETL finds the right reference row and attaches its surrogate key to a fact.
- Business key: The real-world identifier (e.g., ISO country code).
- Surrogate key (SK): Internal integer key for speed and stability.
- Validity: Reference values may change; track from/to dates if you need history.
Mental model
Think of reference data as a dictionary. Your facts contain words (codes). The dictionary maps them to definitions (attributes) and an index number (surrogate key). You attach the index number to each fact so future changes won’t break joins.
Core patterns and key decisions
Pattern: Preload and cache reference data
- Small and static: load into memory at job start for instant lookups.
- Medium and slowly changing: query once per batch and cache with TTL.
- Large or frequently changing: rely on indexed DB joins or key-value store.
Pattern: Surrogate key resolution
- Standardize incoming business keys (trim, case fold, validate).
- Upsert new/changed reference rows (MERGE).
- Lookup surrogate keys by business key (and validity, if SCD 2).
- Attach SKs to facts; default to an Unknown row if missing.
Pattern: SCD choice for reference data
- Type 0: Never change (rare; for truly static values).
- Type 1: Overwrite values (latest only; no history).
- Type 2: New row per change with valid_from/valid_to and current flag (historical accuracy).
Pattern: Late-arriving facts
- Use validity dates to find the correct historical reference row.
- If no match, attach Unknown SK, then reprocess once the reference appears.
Pattern: Conformed reference data
- Unify multiple source code lists into a single canonical table.
- Keep a mapping table: source_code → canonical_code (+ validity/version).
Pattern: Safe defaults and data quality
- Always create an Unknown/default reference row with SK = 0 or 1.
- Log and quarantine unexpected codes; alert upstream owners.
- Track match rate and unmatched rate as quality KPIs.
Worked examples
Example 1: Country code lookup (Type 1)
Input facts have country_code values. You maintain a Country dimension with SKs and names.
- Standardize code to uppercase 2-letter format.
- Upsert any new country codes.
- Left join facts to Country on business key; use Unknown SK if no match.
Sample SQL sketch
-- Upsert reference MERGE INTO dim_country d USING (SELECT DISTINCT UPPER(TRIM(country_code)) AS code FROM stg_orders) s ON d.country_code = s.code WHEN NOT MATCHED THEN INSERT (country_code, country_name) VALUES (s.code, 'Unknown Name'); -- Resolve SKs SELECT f.*, COALESCE(d.country_sk, 0) AS country_sk FROM stg_orders f LEFT JOIN dim_country d ON d.country_code = UPPER(TRIM(f.country_code));
Example 2: Product category with history (Type 2)
Category names can be renamed. You need historical accuracy in reports.
- dim_category has category_sk, business_key, name, valid_from, valid_to, is_current.
- On change, close current row (set valid_to) and insert new row with new name.
- When loading facts, join using business_key plus event_date between valid_from and valid_to.
Key join condition
LEFT JOIN dim_category d ON d.business_key = f.category_code AND f.event_date >= d.valid_from AND f.event_date < COALESCE(d.valid_to, '9999-12-31');
Example 3: Fast email domain lookup with cache
You tag customers by email domain (e.g., gmail.com → Personal). The mapping updates daily.
- Load mapping table at job start into memory.
- Cache with a daily TTL; refresh if job spans days.
- Fallback to Unknown category for unseen domains and log them.
How to implement (step-by-step)
- Profile and standardize: Trim, case-normalize, validate code formats, deduplicate.
- Define the reference model: Choose SCD type, set Unknown row, add indexes on business key (and validity columns for Type 2).
- Build the upsert: MERGE/UPSERT new business keys, close/open SCD 2 rows when changes occur.
- Resolve SKs: Join facts to reference by business key (+ date range if Type 2).
- Optimize lookups: Small tables → in-memory cache; otherwise ensure proper DB indexes and batch joins.
- Monitor quality: Track unmatched and late-match rates; alert on spikes.
Checklist: ready to ship
- Unknown/default row exists and is used when needed.
- Business keys are standardized before lookup.
- MERGE logic covers insert, update (Type 1), and close/open (Type 2) correctly.
- Indexes present on business key and date validity columns.
- Metrics collected for unmatched lookups.
Common mistakes and self-check
- Joining before standardizing: leads to false misses. Self-check: Is trim/case-fold applied in both staging and joins?
- No Unknown row: facts get dropped or cause referential errors. Self-check: Do unmatched facts still load with a safe SK?
- Wrong SCD type: historical reports disagree over time. Self-check: Does your business need to see old names or only the latest?
- Unindexed lookups: slow pipelines. Self-check: Explain plan shows index usage on business key and date ranges?
- Cache staleness: incorrect mappings after updates. Self-check: Do you have a cache TTL or versioning signal?
Exercises
Do these hands-on tasks. They mirror the tasks in the Exercises panel below.
- Exercise 1: Build a country reference upsert and surrogate key resolution for orders. Expected: All facts loaded with a valid country_sk, with Unknown for invalid codes.
- Exercise 2: Implement a Type 2 dimension for customer status and join facts using valid_from/valid_to. Expected: Facts reflect status as of the event date.
Self-check checklist
- Do unmatched codes route to Unknown without dropping rows?
- Do you see correct SKs for back-dated facts?
- Does MERGE logic avoid duplicate current rows in SCD 2?
- Are performance metrics acceptable after adding indexes?
Practical projects
- Conform two product category lists from different ERPs into one canonical dimension with a mapping bridge table.
- Build a reference service table for currencies with daily FX rates and implement a time-aware lookup for transactions.
- Create a data quality dashboard tracking unmatched rates, late-arriving facts, and cache hit ratios.
Who this is for
- ETL Developers and Data Engineers working on warehouse or lakehouse pipelines.
- Analysts/BI Developers who maintain dimensions and mapping tables.
Prerequisites
- Comfort with SQL (joins, MERGE/UPSERT, window functions).
- Basic understanding of dimensional modeling (facts/dimensions, SCD types).
Learning path
- Review dimensional basics and SCD types.
- Implement a simple Type 1 reference table with surrogate keys.
- Add Type 2 validity for one dimension and practice late-arriving facts.
- Introduce caching strategies and monitor unmatched rates.
Next steps
- Harden your MERGE logic and add unit tests on SCD transitions.
- Automate alerts for unmatched spikes and cache refresh failures.
Mini challenge
Your source changes from 2-letter to 3-letter country codes mid-year. Design a mapping strategy so historical facts still resolve correctly and no rows are lost. Hint: maintain a bridge from old to new codes with validity dates and keep a single business key in the dimension.
Ready for the Quick Test
The quick test is available to everyone. If you log in, your progress will be saved automatically.