Who this is for
Analytics Engineers and BI developers who need fast, reliable joins and sensible strategies for very wide tables (hundreds of columns). If your queries shuffle huge datasets, time out, or produce duplicate rows, this lesson is for you.
Prerequisites
- Comfortable with SQL SELECT, JOIN, GROUP BY, and window functions
- Basic understanding of star schemas (facts and dimensions)
- Know your warehouse basics: columnar storage, partitions/clustering concepts
Why this matters
In real projects you will:
- Join large facts (orders, events) with multiple dimensions (customers, products, marketing sources)
- Decide between a convenient wide mart table vs. a normalized star schema
- Fix slow joins caused by mismatched data types, nulls, or skewed keys
- Control scan costs by selecting fewer columns and prefiltering before joins
Good join keys and sensible handling of wide tables can reduce runtime from minutes to seconds and avoid incorrect metrics.
Concept explained simply
A wide table is a table with many columns, usually made by flattening several entities into one. It is convenient for end users but can be expensive to store and maintain. Columnar warehouses read only the columns you select, but wide tables still increase maintenance risk, null sparsity, and schema churn.
Join keys are the fields you use to connect tables. Fast joins depend on keys that are stable, unique, consistently typed, cleaned (trim, lowercased), and not null. When in doubt, create a surrogate key (an integer ID or a hash) built from the natural business keys.
Mental model
- Think of joins as roads: clean keys are smooth highways; dirty keys are potholes causing reroutes and delays.
- Before joining, make each road straight: project only necessary columns, standardize types and formats, remove duplicates.
- Distribute data by the join key where possible (via clustering/partitioning choices), so cars meet faster.
Key principles
- Consistency first: same data type, same case/whitespace rules, and same timezone for timestamps used in composite keys.
- Never join on nullable fields directly. COALESCE to a sentinel or use a surrogate key for unknowns.
- Prefer surrogate keys (integers or hashes) over long strings for performance and stability.
- Handle composite keys by hashing normalized components or by safe concatenation with escaping.
- Reduce row width before joins: select only needed columns and pre-aggregate when possible.
- Watch cardinality and skew: very common key values can cause hotspots; consider techniques like salting for heavy hitters when aggregating.
- Enforce uniqueness: use window functions to deduplicate source keys before joining.
Worked examples
Example 1 — Stable customer join key (natural to surrogate)
Problem: Orders join to customers by email or phone, but they are messy (case differences, spaces, nulls). Build a stable surrogate key and deduplicate.
-- Normalize customer attributes and create a surrogate key
WITH customers_norm AS (
SELECT
customer_id,
LOWER(TRIM(email)) AS email_n,
REGEXP_REPLACE(phone, '[^0-9]', '') AS phone_n,
updated_at
FROM raw_customers
), customers_dedup AS (
SELECT *,
ROW_NUMBER() OVER (
PARTITION BY COALESCE(email_n, phone_n)
ORDER BY updated_at DESC
) AS rn
FROM customers_norm
), customers_sk AS (
SELECT
customer_id,
COALESCE(email_n, phone_n) AS join_natural,
MD5(COALESCE(email_n, phone_n)) AS customer_sk
FROM customers_dedup
WHERE rn = 1
)
SELECT o.order_id, c.customer_sk, o.amount
FROM fact_orders o
LEFT JOIN customers_sk c
ON MD5(LOWER(TRIM(COALESCE(o.email, REGEXP_REPLACE(o.phone, '[^0-9]', ''))))) = c.customer_sk;
Result: consistent, deduplicated join with a short key. Avoids string casts during the join.
Example 2 — Slim a wide event table before joining
Problem: events_wide has 300 columns, but the analysis only needs user_id, session_id, event_ts, event_name. First project only needed columns, then join.
WITH events_slim AS (
SELECT user_id, session_id, event_ts, event_name
FROM events_wide
WHERE event_ts >= DATE_TRUNC('day', CURRENT_DATE) - INTERVAL '30 day'
), users_sk AS (
SELECT user_id, MD5(LOWER(TRIM(email))) AS user_sk
FROM dim_users_clean
)
SELECT e.user_id, u.user_sk, e.session_id, e.event_name, COUNT(*) AS cnt
FROM events_slim e
LEFT JOIN users_sk u USING (user_id)
GROUP BY 1,2,3,4;
Result: reduced scans and faster join because we avoided touching hundreds of unused columns.
Example 3 — Composite product key via safe concatenation
Problem: Products are uniquely identified by (vendor_code, product_code). Implement a safe composite key for joins.
WITH products_clean AS (
SELECT
vendor_code,
product_code,
REPLACE(LOWER(TRIM(vendor_code)), '|', '\\|') AS v,
REPLACE(LOWER(TRIM(product_code)), '|', '\\|') AS p
FROM dim_products
), products_keyed AS (
SELECT *, MD5(CONCAT(v, '|', p)) AS product_sk
FROM products_clean
)
SELECT f.order_id, p.product_sk, f.qty
FROM fact_order_items f
JOIN products_keyed p
ON MD5(CONCAT(
REPLACE(LOWER(TRIM(f.vendor_code)), '|', '\\|'),
'|',
REPLACE(LOWER(TRIM(f.product_code)), '|', '\\|')
)) = p.product_sk;
Result: deterministic composite key, safe from delimiter collisions, and shorter to join on.
Practical steps you can follow
- Identify business keys for each dimension and fact. Decide what should be unique.
- Normalize text fields: TRIM, LOWER, remove punctuation where appropriate.
- Build surrogate keys (integer or MD5 hash) for natural or composite keys.
- Deduplicate each dimension by the natural key, keeping the most recent valid record.
- Project only necessary columns from wide tables before any join.
- Ensure same data types on both join sides; add explicit CASTs in staging if needed.
- Validate: uniqueness of keys, null rates, and post-join row counts.
Common mistakes and how to self-check
- Mistake: Joining on nullable fields. Fix: COALESCE or use a surrogate key; measure null rate before joins.
- Mistake: Implicit type casts (e.g., string vs. integer). Fix: Explicit CAST in staging; compare data types with DESCRIBE.
- Mistake: Duplicate rows after join. Fix: Deduplicate dimension by key using ROW_NUMBER; verify key uniqueness.
- Mistake: Selecting all columns from a wide table. Fix: Only select needed columns; pre-aggregate where possible.
- Mistake: Skew from heavy-hitter key values (like nulls or a default). Fix: Reduce nulls, consider salting during aggregations, or prefilter.
Self-check mini audit
- My join fields are the same type on both sides
- Null rate on join keys is near 0% (or handled)
- Dimension keys are unique after dedup
- I project only necessary columns from wide tables
- I can explain the business meaning of the key
Exercises
Do these now. You can compare with the solutions provided below each exercise.
-
Exercise 1 — Build a stable customer join key and join orders
Instructions
Create a surrogate key for customers using email or phone (whichever is available). Then join orders to customers using that key and deduplicate customers by the latest updated_at.
-- Sample data CREATE TEMP TABLE raw_customers(customer_id INT, email TEXT, phone TEXT, updated_at TIMESTAMP); INSERT INTO raw_customers VALUES (1,' Alice@example.com ','(555) 123-4567','2024-10-01'), (2,NULL,'5551234567','2024-10-02'), (3,'alice@example.com',NULL,'2024-10-03'); CREATE TEMP TABLE fact_orders(order_id INT, email TEXT, phone TEXT, amount NUMERIC); INSERT INTO fact_orders VALUES (101,'alice@example.com',NULL,120.00), (102,NULL,'555-123-4567',80.00), (103,'ALICE@EXAMPLE.COM',NULL,50.00);Task: output order_id, customer_sk, amount. Expect 3 rows with the same customer_sk for all three orders.
-
Exercise 2 — Slim a wide events table and aggregate
Instructions
Assume events_wide(user_id, session_id, event_ts, event_name, 280 other columns...). Create a CTE that selects only user_id, session_id, event_ts, event_name for the last 14 days, then aggregate to get event counts by user_id and event_name.
Task: output user_id, event_name, cnt where cnt >= 5 for the period.
Note: The Quick Test is available to everyone. Only logged-in users will have their progress saved.
Exercise solutions
Exercise 1 solution
WITH customers_norm AS (
SELECT
customer_id,
LOWER(TRIM(email)) AS email_n,
REGEXP_REPLACE(phone, '[^0-9]', '') AS phone_n,
updated_at
FROM raw_customers
), customers_dedup AS (
SELECT *,
ROW_NUMBER() OVER (
PARTITION BY COALESCE(email_n, phone_n)
ORDER BY updated_at DESC
) AS rn
FROM customers_norm
), customers_sk AS (
SELECT
customer_id,
COALESCE(email_n, phone_n) AS join_natural,
MD5(COALESCE(email_n, phone_n)) AS customer_sk
FROM customers_dedup
WHERE rn = 1
)
SELECT o.order_id,
MD5(COALESCE(LOWER(TRIM(o.email)), REGEXP_REPLACE(o.phone,'[^0-9]',''))) AS customer_sk,
o.amount
FROM fact_orders o
LEFT JOIN customers_sk c
ON MD5(COALESCE(LOWER(TRIM(o.email)), REGEXP_REPLACE(o.phone,'[^0-9]',''))) = c.customer_sk
ORDER BY o.order_id;
Exercise 2 solution
WITH events_slim AS (
SELECT user_id, session_id, event_ts, event_name
FROM events_wide
WHERE event_ts >= CURRENT_DATE - INTERVAL '14 day'
)
SELECT user_id, event_name, COUNT(*) AS cnt
FROM events_slim
GROUP BY 1,2
HAVING COUNT(*) >= 5
ORDER BY cnt DESC;
Practical projects
- Customer 360 mart: Build a deduplicated dim_customer with a stable customer_sk and a wide convenience mart that joins orders, tickets, and marketing touches using that key.
- Event performance audit: Create a slim view on your wide events table and compare query runtimes and scanned columns vs. the full table on 3 typical BI questions.
- Composite product catalog: Implement a product_sk based on (vendor_code, product_code); add tests to enforce uniqueness and non-null join keys.
Learning path
- Clean and standardize natural keys for 1–2 dimensions.
- Add surrogate keys (hash or integer) and enforce uniqueness.
- Refactor one wide table into a slim staging view plus a documented wide mart.
- Benchmark: compare join performance before vs. after cleanup.
- Automate checks: uniqueness, null rate, and post-join row count validations.
Mini challenge
You have fact_sessions with user_id and device_id, and dim_users where some rows miss user_id but have email. Design a join strategy that maximizes match rate without duplicates. Explain:
- How you will derive a user_sk from available attributes
- How to deduplicate dim_users to enforce uniqueness
- How to handle sessions with no match (sentinel or null strategy)
Tip
Unify to a surrogate key built from the best available natural key (email, then user_id), deduplicate, then left join. Track and report unmatched rates.
Next steps
- Harden your staging: explicit CASTs for all join fields
- Add uniqueness tests for each dimension key
- Create a slim view for your widest table used in BI dashboards