Why this matters
As a BI Analyst, most of your queries combine a central fact table (transactions, events) with descriptive dimensions (customer, product, date). Getting these joins right ensures your metrics are correct, repeatable, and fast to compute.
- Build KPI dashboards: revenue by product category and customer segment.
- Analyze cohorts: first order date vs. ship date to measure fulfillment speed.
- Explain performance: identify which regions or segments drive growth.
Concept explained simply
Star schemas place a big fact table in the middle, surrounded by small dimension tables. The fact table stores numeric measures (quantity, price) and foreign keys to each dimension. Dimensions store human-readable attributes (names, categories, dates).
Mental model
- Fact = events at a grain (one row per order line, per day, etc.).
- Dimension = lookup details for each key (product, customer, date).
- Join = attach readable attributes to facts using the keys.
Picture a hub-and-spoke wheel: the hub is your fact, the spokes are joins to dimensions.
Core patterns and join types
- Inner join: keeps only fact rows with matching dimension keys. Good for clean, conformed data.
- Left join: keeps all fact rows; unmatched dimensions show as NULL or an "Unknown" record. Safer for totals.
- Role-playing date dimensions: join the same dim_date multiple times (e.g., order_date_key and ship_date_key) using aliases.
- Surrogate keys: integers in facts that reference dimension primary keys. Avoid joining on names.
When to choose INNER vs LEFT
- Use INNER when you are sure every fact has a dimension match and you want to exclude invalid data.
- Use LEFT when you must preserve totals or investigate data quality gaps (NULLs reveal missing lookups).
Sample schema (for examples and exercises)
We will use a simplified star schema:
fact_sales - order_id (int) - order_line_id (int) - order_date_key (int) -- e.g., 20230115 - ship_date_key (int) - customer_key (int) - product_key (int) - quantity (int) - unit_price (decimal) - status (text) -- e.g., 'shipped', 'cancelled' dim_customer - customer_key (int PK) - customer_id (text) - segment (text) -- e.g., 'Consumer', 'Corporate', 'SMB' - region (text) dim_product - product_key (int PK) - product_id (text) - category (text) - subcategory (text) dim_date - date_key (int PK) -- yyyymmdd - date (date) - year (int) - month (int) - day (int)
Minimal example rows (conceptual):
dim_customer (1,'C-001','Consumer','East') (2,'C-002','Corporate','West') dim_product (10,'P-10','Furniture','Chairs') (11,'P-11','Technology','Phones') dim_date (20230101,'2023-01-01',2023,1,1) (20230102,'2023-01-02',2023,1,2) (20230103,'2023-01-03',2023,1,3) fact_sales (100,1,20230101,20230102,1,10,2,100.00,'shipped') (100,2,20230101,20230103,1,11,1,500.00,'shipped') (101,1,20230102,20230103,2,10,1,120.00,'shipped')
Worked examples
Example 1 — Revenue by product category and customer segment (INNER JOIN)
SELECT p.category, c.segment, SUM(f.quantity * f.unit_price) AS revenue FROM fact_sales f JOIN dim_product p ON f.product_key = p.product_key JOIN dim_customer c ON f.customer_key = c.customer_key GROUP BY p.category, c.segment ORDER BY revenue DESC;
Why it works: fact_sales holds measures; dimensions add labels. Grain stays at order line, but we aggregate safely by dimension attributes.
Example 2 — Preserve totals with LEFT JOIN and Unknowns
SELECT COALESCE(p.category,'Unknown') AS category, SUM(f.quantity) AS units, SUM(f.quantity * f.unit_price) AS revenue FROM fact_sales f LEFT JOIN dim_product p ON f.product_key = p.product_key GROUP BY COALESCE(p.category,'Unknown');
Why it matters: Using LEFT JOIN ensures you do not lose fact rows when a product_key is missing in the dimension. Unknown bucket exposes data quality issues.
Example 3 — Role-playing date dimension to compute days to ship
SELECT o.year AS order_year, AVG(DATEDIFF(day, o.date, s.date)) AS avg_days_to_ship FROM fact_sales f JOIN dim_date o ON f.order_date_key = o.date_key JOIN dim_date s ON f.ship_date_key = s.date_key WHERE f.status = 'shipped' GROUP BY o.year ORDER BY o.year;
Key idea: join dim_date twice with different aliases (o and s) because the fact references dates for different roles.
Common mistakes and self-checks
- Mixing grains: joining a fact at order line grain to a dimension that has multiple rows per key (e.g., Type 2 history) without filtering the current row leads to duplicates. Self-check: count rows before and after joins.
- Using names instead of keys: product_id names can change; always use surrogate keys. Self-check: review ON clauses.
- Accidentally using INNER join: dropping rows and undercounting totals. Self-check: compare LEFT vs INNER row counts.
- Forgetting role-playing aliases: joining dim_date once and reusing columns for multiple date roles yields nonsense. Self-check: each date_key → its own dim_date alias.
Exercises
Use the sample schema above. Aim to write queries, verify row counts, and sanity-check outputs.
- Exercise 1: Revenue by category and segment for shipped orders in 2023.
- Exercise 2: On-time shipping rate (orders shipped within 2 days of order date).
Checklist before you run your queries
- Are you joining on surrogate keys (e.g., product_key) and not names?
- Does the join type match your intent (LEFT to preserve totals)?
- Are date dimensions correctly aliased per role?
- Did you confirm row counts before/after joins?
Practical projects
- Build a weekly sales dashboard: revenue, units, average order value by region and category. Use LEFT joins and a date spine from dim_date.
- Create a fulfillment report: average and percentile days-to-ship by segment and category. Use role-playing date dimension joins.
- Data quality notebook: counts of unknown keys by dimension and week; alert when unknowns exceed a threshold.
Learning path
- Review star schema basics: facts, dimensions, surrogate keys.
- Practice INNER vs LEFT joins and confirm row counts.
- Master role-playing date joins and date differences.
- Add aggregations safely (GROUP BY at the correct grain).
- Harden queries with Unknown buckets and data quality checks.
Who this is for and prerequisites
Who this is for
- Aspiring and junior BI Analysts working with star schemas.
- Data-savvy PMs and analysts who need trustworthy metrics.
Prerequisites
- Basic SQL: SELECT, WHERE, GROUP BY.
- Understanding of primary/foreign keys.
Next steps
- Do the exercises below, then take the quick test. Note: anyone can take the test for free; only logged-in users will have their progress saved.
- Apply the patterns to your real warehouse tables.
Mini challenge
Find the top 3 product subcategories by revenue for Corporate segment in 2023, and include the average days to ship for those subcategories.
Hint
Join fact_sales to dim_customer, dim_product, and two aliases of dim_date. Filter year = 2023 on the order date alias. Use DATEDIFF and GROUP BY subcategory.
One possible approach
WITH joined AS (
SELECT
p.subcategory,
SUM(f.quantity * f.unit_price) AS revenue,
AVG(DATEDIFF(day, o.date, s.date)) AS avg_days_to_ship
FROM fact_sales f
JOIN dim_customer c ON f.customer_key = c.customer_key
JOIN dim_product p ON f.product_key = p.product_key
JOIN dim_date o ON f.order_date_key = o.date_key
JOIN dim_date s ON f.ship_date_key = s.date_key
WHERE c.segment = 'Corporate' AND o.year = 2023 AND f.status = 'shipped'
GROUP BY p.subcategory
)
SELECT *
FROM joined
ORDER BY revenue DESC
FETCH FIRST 3 ROWS ONLY; -- or LIMIT 3