luvv to helpDiscover the Best Free Online Tools
Topic 5 of 8

Building Facts And Dimensions In SQL

Learn Building Facts And Dimensions In SQL for free with explanations, exercises, and a quick test (for Analytics Engineer).

Published: December 23, 2025 | Updated: December 23, 2025

Why this matters

As an Analytics Engineer, you turn raw application tables into clean, trustworthy star schemas that power BI dashboards, forecasting models, and business decisions. Common tasks include:

  • Designing star schemas with clear grain and conformed dimensions
  • Building Type 1 and Type 2 dimensions with surrogate keys
  • Creating fact tables with accurate measures and foreign keys to dimensions
  • Handling incremental loads and late-arriving data
  • Ensuring data quality with self-checks and reconciliation
Real-world examples
  • E-commerce: customer, product, and date dimensions; orders and payments facts
  • SaaS: account, plan, and user dimensions; subscriptions and usage facts
  • Finance: chart of accounts dimension; journal entries fact

Concept explained simply

A dimension table describes things (who, what, when, where). A fact table records events or measurements at a defined granularity (grain).

Mental model: store receipts

Each shopping receipt line is a fact (one row per item bought). The product catalog, customers, and calendar are dimensions. Facts point to dimensions via keys and store quantities and amounts.

Who this is for

  • Aspiring and junior Analytics Engineers
  • Data Analysts moving toward modeling
  • BI Developers who need robust back-end models

Prerequisites

  • Solid SQL basics (SELECT, JOIN, GROUP BY, window functions)
  • Comfort with data types, NULLs, and casting
  • Understanding of business domain terms (orders, customers, etc.)

Learning path

  1. Define star schema and grains
  2. Build core dimensions (Type 1), then add Type 2 where history matters
  3. Create fact tables with measures and foreign keys
  4. Add incremental logic and QA checks
  5. Optimize joins and document grain/columns

Key steps to build dimensions

  1. Define the grain: what makes one row unique? (e.g., one row per customer)
  2. Select attributes: stable, descriptive columns (avoid volatile metrics)
  3. Create a surrogate key: customer_sk integer or a stable hash
  4. Deduplicate: use ROW_NUMBER() by natural key, keep the latest
  5. Choose SCD type:
    • Type 1: overwrite changes (no history)
    • Type 2: track history with effective dates or an is_current flag
Type 2 details
  • Columns: effective_from, effective_to, is_current
  • Join facts to the correct historical row using event timestamp between effective_from and effective_to

Key steps to build facts

  1. Pick the grain: e.g., one row per order line
  2. Measures: amounts and counts, computed at the same grain
  3. Foreign keys: join to dimensions using surrogate keys
  4. Filters: exclude test data, canceled rows if business requires
  5. Incremental: load only new/changed data using watermarks
Performance tips
  • Prefer integer surrogate keys for joins
  • Pre-aggregate to the needed grain for dashboards
  • Partition or cluster large facts by date

Worked examples

Example 1: Customer Dimension (Type 1)

-- Keep latest version per natural key (customer_id), overwrite changes
WITH src AS (
  SELECT customer_id, email, first_name, last_name, country, updated_at,
         ROW_NUMBER() OVER(PARTITION BY customer_id ORDER BY updated_at DESC) AS rn
  FROM stg_customers
)
SELECT
  CAST(ABS(CAST(MD5(COALESCE(LOWER(email), CAST(customer_id AS VARCHAR))) AS BIGINT)) AS BIGINT) AS customer_sk,
  customer_id AS customer_id_nk,
  email,
  first_name,
  last_name,
  country,
  CURRENT_DATE AS record_loaded_at
FROM src
WHERE rn = 1;
Why this works

ROW_NUMBER() keeps the latest row per customer. A surrogate key is generated deterministically from a natural identifier; in production, many warehouses use sequences/identity columns instead.

Example 2: Product Dimension (Type 2)

-- Build SCD2 for products tracking price and category changes
WITH staged AS (
  SELECT * FROM stg_products
), dedup AS (
  SELECT *,
         ROW_NUMBER() OVER (PARTITION BY product_id, valid_from ORDER BY updated_at DESC) AS rn
  FROM staged
), base AS (
  SELECT product_id, product_name, category, price, valid_from, valid_to
  FROM dedup WHERE rn = 1
), with_bounds AS (
  SELECT *,
         COALESCE(valid_to, DATE '9999-12-31') AS effective_to,
         valid_from AS effective_from
  FROM base
)
SELECT
  CAST(ABS(CAST(MD5(CAST(product_id AS VARCHAR) || '|' || CAST(effective_from AS VARCHAR)) AS BIGINT)) AS BIGINT) AS product_sk,
  product_id AS product_id_nk,
  product_name,
  category,
  price,
  effective_from,
  effective_to,
  CASE WHEN effective_to = DATE '9999-12-31' THEN TRUE ELSE FALSE END AS is_current
FROM with_bounds;
Notes
  • Use business validity dates when available; otherwise derive with window functions on change points.
  • Facts must time-travel join to the correct row.

Example 3: Orders Fact (one row per order line)

WITH order_lines AS (
  SELECT ol.order_id, ol.line_id, ol.product_id, ol.quantity, ol.unit_price,
         o.customer_id, o.order_timestamp::DATE AS order_date
  FROM stg_order_lines ol
  JOIN stg_orders o ON o.order_id = ol.order_id
  WHERE o.is_test = FALSE
), product_dim AS (
  SELECT * FROM dim_product  -- SCD2 with effective dates
), customer_dim AS (
  SELECT * FROM dim_customer -- Type 1 is fine for many cases
)
SELECT
  ol.order_id,
  ol.line_id,
  cd.customer_sk,
  pd.product_sk,
  CAST(ol.order_date AS DATE) AS order_date,
  ol.quantity,
  ol.unit_price,
  (ol.quantity * ol.unit_price) AS line_amount
FROM order_lines ol
LEFT JOIN customer_dim cd
  ON cd.customer_id_nk = ol.customer_id
LEFT JOIN product_dim pd
  ON pd.product_id_nk = ol.product_id
 AND CAST(ol.order_date AS DATE) BETWEEN pd.effective_from AND pd.effective_to;
Why LEFT JOIN?

To avoid dropping fact rows when a dimension row is temporarily missing. You can later track missing keys via data quality checks.

Common mistakes and self-checks

  • Mistake: Undefined grain. Fix: Write a one-sentence grain statement for every table.
  • Mistake: Joining facts to Type 2 dims without date filters. Fix: Join using event timestamp between effective dates.
  • Mistake: Using natural keys in facts. Fix: Map to surrogate keys for stable joins.
  • Mistake: Measures not at the fact grain. Fix: Recompute or pre-aggregate appropriately.
  • Mistake: Duplicates in dimensions. Fix: Deduplicate using ROW_NUMBER() by natural key.
Quick self-check queries
  • Fact row count vs source: sums should match after filters.
  • Check for NULL foreign keys in facts and track rates.
  • Ensure effective_from <= effective_to and exactly one current row per natural key for Type 2.

Exercises

These mirror the exercises below. Try them before peeking.

  1. Exercise 1: Build a Type 1 dim_customer from stg_customers with dedup and a surrogate key.
  2. Exercise 2: Create an orders_fact table at line grain and map to product and customer dimensions.
  • Checklist to complete each exercise:
    • Grain statement written in a comment
    • All joins explicit and justified
    • NULL handling and filters are documented
    • Foreign keys map to surrogate keys

Practical projects

  • E-commerce star schema: build dim_customer, dim_product, dim_date, and fact_orders with incremental loads.
  • SaaS subscriptions: dim_account (Type 2 on plan), fact_subscriptions, fact_usage_daily.
  • Finance mini-warehouse: dim_account, dim_date, fact_journal_entries with debits/credits validation.

Next steps

  • Add data quality tests (row counts, null rates, uniqueness)
  • Document tables: grain, columns, and business rules
  • Introduce incremental patterns and partitioning for large facts

Mini challenge

Your PM wants historical product category reporting. Convert your product dimension from Type 1 to Type 2 and update the fact join to time-travel. Validate that last quarter's category revenue matches after the change.

Quick Test access

The test is available to everyone. If you log in, your progress will be saved.

Practice Exercises

2 exercises to complete

Instructions

Create dim_customer (Type 1) from stg_customers. Keep the latest record per customer_id. Include:

  • customer_sk (surrogate), customer_id_nk, email, first_name, last_name, country, record_loaded_at
  • Deduplicate using ROW_NUMBER() by customer_id, ordered by updated_at DESC
  • Generate a deterministic surrogate key from email or id (hash or sequence)
Expected Output
A dimension table with one row per customer_id, latest attributes only, unique customer_sk, no duplicates.

Building Facts And Dimensions In SQL — Quick Test

Test your knowledge with 7 questions. Pass with 70% or higher.

7 questions70% to pass

Have questions about Building Facts And Dimensions In SQL?

AI Assistant

Ask questions about this tool