Why this matters
As an Analytics Engineer, you turn raw application tables into clean, trustworthy star schemas that power BI dashboards, forecasting models, and business decisions. Common tasks include:
- Designing star schemas with clear grain and conformed dimensions
- Building Type 1 and Type 2 dimensions with surrogate keys
- Creating fact tables with accurate measures and foreign keys to dimensions
- Handling incremental loads and late-arriving data
- Ensuring data quality with self-checks and reconciliation
Real-world examples
- E-commerce: customer, product, and date dimensions; orders and payments facts
- SaaS: account, plan, and user dimensions; subscriptions and usage facts
- Finance: chart of accounts dimension; journal entries fact
Concept explained simply
A dimension table describes things (who, what, when, where). A fact table records events or measurements at a defined granularity (grain).
Mental model: store receipts
Each shopping receipt line is a fact (one row per item bought). The product catalog, customers, and calendar are dimensions. Facts point to dimensions via keys and store quantities and amounts.
Who this is for
- Aspiring and junior Analytics Engineers
- Data Analysts moving toward modeling
- BI Developers who need robust back-end models
Prerequisites
- Solid SQL basics (SELECT, JOIN, GROUP BY, window functions)
- Comfort with data types, NULLs, and casting
- Understanding of business domain terms (orders, customers, etc.)
Learning path
- Define star schema and grains
- Build core dimensions (Type 1), then add Type 2 where history matters
- Create fact tables with measures and foreign keys
- Add incremental logic and QA checks
- Optimize joins and document grain/columns
Key steps to build dimensions
- Define the grain: what makes one row unique? (e.g., one row per customer)
- Select attributes: stable, descriptive columns (avoid volatile metrics)
- Create a surrogate key:
customer_skinteger or a stable hash - Deduplicate: use
ROW_NUMBER()by natural key, keep the latest - Choose SCD type:
- Type 1: overwrite changes (no history)
- Type 2: track history with effective dates or an
is_currentflag
Type 2 details
- Columns:
effective_from,effective_to,is_current - Join facts to the correct historical row using event timestamp between
effective_fromandeffective_to
Key steps to build facts
- Pick the grain: e.g., one row per order line
- Measures: amounts and counts, computed at the same grain
- Foreign keys: join to dimensions using surrogate keys
- Filters: exclude test data, canceled rows if business requires
- Incremental: load only new/changed data using watermarks
Performance tips
- Prefer integer surrogate keys for joins
- Pre-aggregate to the needed grain for dashboards
- Partition or cluster large facts by date
Worked examples
Example 1: Customer Dimension (Type 1)
-- Keep latest version per natural key (customer_id), overwrite changes
WITH src AS (
SELECT customer_id, email, first_name, last_name, country, updated_at,
ROW_NUMBER() OVER(PARTITION BY customer_id ORDER BY updated_at DESC) AS rn
FROM stg_customers
)
SELECT
CAST(ABS(CAST(MD5(COALESCE(LOWER(email), CAST(customer_id AS VARCHAR))) AS BIGINT)) AS BIGINT) AS customer_sk,
customer_id AS customer_id_nk,
email,
first_name,
last_name,
country,
CURRENT_DATE AS record_loaded_at
FROM src
WHERE rn = 1;Why this works
ROW_NUMBER() keeps the latest row per customer. A surrogate key is generated deterministically from a natural identifier; in production, many warehouses use sequences/identity columns instead.
Example 2: Product Dimension (Type 2)
-- Build SCD2 for products tracking price and category changes
WITH staged AS (
SELECT * FROM stg_products
), dedup AS (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY product_id, valid_from ORDER BY updated_at DESC) AS rn
FROM staged
), base AS (
SELECT product_id, product_name, category, price, valid_from, valid_to
FROM dedup WHERE rn = 1
), with_bounds AS (
SELECT *,
COALESCE(valid_to, DATE '9999-12-31') AS effective_to,
valid_from AS effective_from
FROM base
)
SELECT
CAST(ABS(CAST(MD5(CAST(product_id AS VARCHAR) || '|' || CAST(effective_from AS VARCHAR)) AS BIGINT)) AS BIGINT) AS product_sk,
product_id AS product_id_nk,
product_name,
category,
price,
effective_from,
effective_to,
CASE WHEN effective_to = DATE '9999-12-31' THEN TRUE ELSE FALSE END AS is_current
FROM with_bounds;Notes
- Use business validity dates when available; otherwise derive with window functions on change points.
- Facts must time-travel join to the correct row.
Example 3: Orders Fact (one row per order line)
WITH order_lines AS (
SELECT ol.order_id, ol.line_id, ol.product_id, ol.quantity, ol.unit_price,
o.customer_id, o.order_timestamp::DATE AS order_date
FROM stg_order_lines ol
JOIN stg_orders o ON o.order_id = ol.order_id
WHERE o.is_test = FALSE
), product_dim AS (
SELECT * FROM dim_product -- SCD2 with effective dates
), customer_dim AS (
SELECT * FROM dim_customer -- Type 1 is fine for many cases
)
SELECT
ol.order_id,
ol.line_id,
cd.customer_sk,
pd.product_sk,
CAST(ol.order_date AS DATE) AS order_date,
ol.quantity,
ol.unit_price,
(ol.quantity * ol.unit_price) AS line_amount
FROM order_lines ol
LEFT JOIN customer_dim cd
ON cd.customer_id_nk = ol.customer_id
LEFT JOIN product_dim pd
ON pd.product_id_nk = ol.product_id
AND CAST(ol.order_date AS DATE) BETWEEN pd.effective_from AND pd.effective_to;Why LEFT JOIN?
To avoid dropping fact rows when a dimension row is temporarily missing. You can later track missing keys via data quality checks.
Common mistakes and self-checks
- Mistake: Undefined grain. Fix: Write a one-sentence grain statement for every table.
- Mistake: Joining facts to Type 2 dims without date filters. Fix: Join using event timestamp between effective dates.
- Mistake: Using natural keys in facts. Fix: Map to surrogate keys for stable joins.
- Mistake: Measures not at the fact grain. Fix: Recompute or pre-aggregate appropriately.
- Mistake: Duplicates in dimensions. Fix: Deduplicate using
ROW_NUMBER()by natural key.
Quick self-check queries
- Fact row count vs source: sums should match after filters.
- Check for NULL foreign keys in facts and track rates.
- Ensure
effective_from <= effective_toand exactly one current row per natural key for Type 2.
Exercises
These mirror the exercises below. Try them before peeking.
- Exercise 1: Build a Type 1
dim_customerfromstg_customerswith dedup and a surrogate key. - Exercise 2: Create an
orders_facttable at line grain and map to product and customer dimensions.
- Checklist to complete each exercise:
- Grain statement written in a comment
- All joins explicit and justified
- NULL handling and filters are documented
- Foreign keys map to surrogate keys
Practical projects
- E-commerce star schema: build
dim_customer,dim_product,dim_date, andfact_orderswith incremental loads. - SaaS subscriptions:
dim_account(Type 2 on plan),fact_subscriptions,fact_usage_daily. - Finance mini-warehouse:
dim_account,dim_date,fact_journal_entrieswith debits/credits validation.
Next steps
- Add data quality tests (row counts, null rates, uniqueness)
- Document tables: grain, columns, and business rules
- Introduce incremental patterns and partitioning for large facts
Mini challenge
Your PM wants historical product category reporting. Convert your product dimension from Type 1 to Type 2 and update the fact join to time-travel. Validate that last quarter's category revenue matches after the change.
Quick Test access
The test is available to everyone. If you log in, your progress will be saved.