Why this matters
As an ETL Developer, you often combine a base dataset with reference, master, or external data to make it more useful. Examples you will face:
- Enriching transactions with customer segments and product attributes for analytics.
- Mapping IPs or ZIP codes to regions for reporting and routing.
- Merging internal IDs with external partner IDs to activate ads or notifications.
- Joining events with Slowly Changing Dimensions (SCD) to get the valid attributes at event time.
Good enrichment boosts downstream accuracy, reduces manual patching, and keeps metrics consistent across teams.
Concept explained simply
Data enrichment means adding helpful attributes to a base record by looking them up in other sources. Think of it like adding labels to a box: the box (base record) stays the same, but you add more useful tags (attributes) by looking at trusted lists (dimensions) or external sources.
Mental model
- Base facts: the main table you must not duplicate or lose (e.g., transactions).
- Lookup sources: smaller, authoritative tables (e.g., product catalog, customer CRM) or APIs.
- Keys: how rows connect (customer_id, product_id, date ranges for SCD).
- Contracts: data types, uniqueness, and freshness expectations for each source.
Process: select base rows → standardize keys → join with the cleanest lookup → resolve conflicts (duplicates, timing) → add QA checks → publish.
Core enrichment patterns
- Left join on stable keys: keep all base records even if no match. Use DISTINCT/aggregations to avoid row-multiplication.
- Choose-one from many: when lookup has multiple matches (e.g., SCD Type 2), pick the valid one via effective_date ranges or a window function.
- Code mapping: convert raw values to canonical codes via mapping tables.
- Hierarchy lookup: roll up to category, region, or parent entity.
- API-based enrichment: use batch endpoints, caching, retries, and rate-limit aware scheduling.
Tip: join safety checklist
- Confirm base record cardinality (1 row in, 1 row out).
- Enforce uniqueness of lookup keys or pick a single record deterministically.
- Measure match_rate = matched_rows / total_base_rows.
- Null-safe join conditions (trim, lowercase, type-cast before joining).
- Add provenance columns: source_system, match_status, as_of_date.
Worked examples (3)
1) Transactions + CRM + Product Catalog
Goal: add customer segment and product category to each transaction.
- Base: transactions(tx_id, customer_id, product_id, amount, tx_ts)
- Lookup A (CRM, SCD2): crm_customers(customer_id, segment, vip_flag, valid_from, valid_to)
- Lookup B: products(product_id, category, brand)
Approach:
- Normalize keys (trim, case).
- Pick the CRM row valid at tx_ts via date-range filter.
- Left join products (assume unique product_id).
Show SQL logic idea
WITH crm_pick AS (
SELECT c.*
FROM crm_customers c
),
crm_active AS (
SELECT *
FROM crm_pick
)
SELECT t.tx_id, t.customer_id, t.product_id, t.amount,
c.segment, c.vip_flag,
p.category, p.brand
FROM transactions t
LEFT JOIN crm_active c
ON t.customer_id = c.customer_id
AND t.tx_ts >= c.valid_from AND (c.valid_to IS NULL OR t.tx_ts < c.valid_to)
LEFT JOIN products p
ON t.product_id = p.product_id;Notes: ensure crm_customers has non-overlapping date ranges per customer; if not, use ROW_NUMBER to pick the best row.
2) Addresses to Regions (ZIP mapping)
Goal: add region for reporting; unknown ZIPs become UNKNOWN.
- Base: orders(order_id, zipcode)
- Lookup: zip_dim(zipcode, city, region)
Show SQL logic idea
SELECT o.order_id,
o.zipcode,
COALESCE(z.region, 'UNKNOWN') AS region,
CASE WHEN z.zipcode IS NULL THEN 'NO_MATCH' ELSE 'MATCH' END AS match_status
FROM orders o
LEFT JOIN zip_dim z
ON LPAD(TRIM(o.zipcode), 5, '0') = z.zipcode;Add a match-rate metric and alert if it drops suddenly.
3) Multi-source attribute with precedence
Goal: pick product_category from the highest-priority source available.
- Base: sales(product_id)
- Lookups: vendor_catalog(category), internal_catalog(category)
Show SQL logic idea
WITH merged AS (
SELECT s.product_id,
ic.category AS cat_internal,
vc.category AS cat_vendor
FROM sales s
LEFT JOIN internal_catalog ic USING (product_id)
LEFT JOIN vendor_catalog vc USING (product_id)
)
SELECT product_id,
COALESCE(cat_internal, cat_vendor, 'UNCLASSIFIED') AS product_category,
CASE
WHEN cat_internal IS NOT NULL THEN 'INTERNAL'
WHEN cat_vendor IS NOT NULL THEN 'VENDOR'
ELSE 'NONE'
END AS category_source
FROM merged;Step-by-step playbook
- Define the contract: base table, required attributes, acceptable match_rate, allowed nulls.
- Profile sources: key uniqueness, date coverage, freshness, value ranges.
- Standardize keys: trim, case, type, padding, timezone normalization for time-based joins.
- Pick a join strategy: left join; pick-one-from-many via windows or date ranges.
- Add provenance: source_system, match_status, as_of_date/effective_ts.
- Validate: match_rate, duplicate check on base keys, null checks on critical fields.
- Publish atomically and idempotently (e.g., overwrite partition or use merge/upsert).
- Monitor: alerts on match_rate, null spikes, and row-count drifts.
Quality checks you should include
- Match rate threshold: e.g., >= 95% for stable dimensions.
- Duplicate detection: base key remains unique post-join.
- Not-null on critical attributes (or controlled null with fallback default).
- Freshness: lookup max(updated_at) within expected window.
- Schema drifts: unexpected new columns or type changes.
Self-check mini list
- [ ] Did any base row multiply? If yes, why?
- [ ] Is match_rate acceptable and explained?
- [ ] Are time-bound joins correct around boundary moments?
- [ ] Are defaults documented for unmatched rows?
Exercises
Do these to cement the skill. The quick test is available to everyone; sign in to save your progress.
Exercise 1: Enrich transactions with CRM + products
Data sample:
- transactions: (tx_id, customer_id, product_id, amount, tx_ts)
- crm_customers: (customer_id, segment, vip_flag, updated_at) — may have multiple rows per customer; pick the latest updated_at
- products: (product_id, category, brand)
Task:
- Standardize keys (trim and case).
- Pick latest CRM row per customer.
- Left join both lookups; keep one row per tx_id.
- Add match_status_crm and match_status_prod flags.
Checklist before you run
- [ ] Window function used to pick latest CRM row.
- [ ] No row-multiplication on tx_id.
- [ ] Flags correctly reflect matches.
- [ ] Null-safe key handling.
Exercise 2: Region enrichment with fallback + KPI
Data sample:
- orders: (order_id, zipcode)
- zip_dim: (zipcode, city, region)
Task:
- Normalize zipcode to 5 digits.
- Left join; default region to 'UNKNOWN'.
- Add match_status; compute match_rate KPI.
Checklist before you run
- [ ] Zipcode padded to 5 digits.
- [ ] Region fallback applied.
- [ ] Match rate computed as a query.
Common mistakes and how to self-check
- Unintended row multiplication: fix by pre-aggregating or selecting a single lookup row with ROW_NUMBER.
- Key mismatches (types/case/whitespace): standardize before join.
- Ignoring time validity: use effective_from/effective_to or latest-as-of logic.
- No provenance: add source fields and match flags to debug quickly.
- Silent nulls on critical attributes: set thresholds and alerts.
Quick self-audit
- [ ] For any base key, count(rows_out) == 1.
- [ ] Documented rules for ties and nulls.
- [ ] Tests exist for match_rate, duplicates, and nulls.
Practical projects
- Retail mini-mart: enrich sales with products and customers, publish a daily star-schema table with QA metrics.
- Geo routing: enrich orders with regions and create a weekly dashboard showing regional coverage and match_rate trends.
- Attribution prep: enrich web events with campaign metadata (SCD) valid on event_time; verify using boundary tests.
Mini challenge
Given events(event_id, user_id, event_ts) and users_scd(user_id, plan, valid_from, valid_to), produce one row per event with the correct plan at event_ts. Add match_status and a metric table with daily match_rate. What edge cases could break your join?
Who this is for
- ETL Developers building analytics or activation datasets.
- Data Engineers maintaining dimensional models and pipelines.
- Analysts who need reliable enriched facts.
Prerequisites
- Comfort with SQL joins and window functions.
- Basic understanding of SCD Type 2 concepts.
- Familiarity with data quality checks (nulls, duplicates, freshness).
Learning path
- Before: Keys and joins, SCD handling, data profiling.
- Now: Multi-source enrichment (this lesson).
- Next: Idempotent publishing, partitioning, and incremental loads; monitoring and alerting.
Next steps
- Implement the exercises on a sample warehouse.
- Add automated tests for match_rate and duplicates.
- Schedule a daily enrichment job and track metrics in a dashboard.
Quick Test
Take the quick test below. Anyone can take it; sign in to save your progress and resume later.