Who this is for
You work with data (or plan to) and want to turn messy raw tables into stable, analysis-ready marts. Ideal for aspiring Analytics Engineers, BI Developers, and SQL-savvy Data Analysts.
Prerequisites
- Comfort with SELECT, JOIN, GROUP BY, and window functions
- Basic understanding of star schemas (facts and dimensions)
- Familiarity with timestamps and data types
Why this matters
Analytics Engineers ship reliable, repeatable pipelines. Your job is to convert raw logs and app tables into:
- Clean staging tables (consistent names, types, and deduplicated rows)
- Dimensions (business entities like customers and products)
- Facts (transactions and events at a clearly defined grain)
These transformations power dashboards, experiments, finance reports, and machine learning features.
Concept explained simply
Think of raw data as ingredients straight from the market. Staging is washing and chopping. Dimensions are labeled containers (customers, products). Facts are the actual meals (orders, payments) that reference those containers. Marts are the clean, ready-to-serve buffet for analysts.
Mental model
- Raw: source-shaped, noisy, often duplicated
- Staging (stg_): clean and standardized, one source at a time
- Core: conformed dimensions and facts with clear keys and grain
- Mart: curated, business-ready aggregates (e.g., daily_revenue_by_channel)
Core workflow: raw → staging → dimensions/facts → marts
- Define the grain: Exactly one row represents what? (e.g., one order item)
- Standardize in staging: Rename, cast, trim, dedupe with clear rules.
- Build dimensions: Add surrogate keys, manage history (SCD1 or SCD2).
- Build facts: Join to dimensions via keys, compute metrics, enforce the grain.
- Aggregate into marts: Daily/weekly/monthly rollups for BI.
- Make runs incremental: Watermarks and idempotent merges.
- Add tests: Uniqueness, not null, accepted values, and referential integrity.
Worked examples
We will use four raw tables: raw.customers, raw.products, raw.orders, raw.order_items.
Raw schemas (for reference)
raw.customers(customer_id, email, first_name, last_name, phone, country, updated_at)
raw.products(product_id, sku, name, category, price, updated_at)
raw.orders(order_id, customer_id, order_ts, status, updated_at)
raw.order_items(order_item_id, order_id, product_id, quantity, unit_price, discount, updated_at)
Example 1: Staging customers (clean + dedupe)
Goal: One row per customer_id, most recent data, tidy columns and types.
-- stg_customers
WITH ranked AS (
SELECT
customer_id,
LOWER(TRIM(email)) AS email,
INITCAP(TRIM(first_name)) AS first_name,
INITCAP(TRIM(last_name)) AS last_name,
CAST(phone AS VARCHAR) AS phone,
UPPER(TRIM(country)) AS country_code,
updated_at,
ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY updated_at DESC) AS rn
FROM raw.customers
)
SELECT
customer_id,
email,
first_name,
last_name,
phone,
country_code,
updated_at
FROM ranked
WHERE rn = 1;
Checks:
- customer_id is unique
- emails are lowercase; country codes uppercase
- latest updated_at wins ties
Example 2: Product dimension with SCD2
We track changes in product name and category (history preserved). Use a surrogate key product_sk and validity windows.
-- dim_product (SCD2)
-- Assume we ingest daily snapshots into stg_products with unique (product_id, snapshot_date)
WITH changes AS (
SELECT
product_id,
sku,
name,
category,
price,
updated_at,
-- Detect change vs previous snapshot
LAG(name) OVER (PARTITION BY product_id ORDER BY updated_at) AS prev_name,
LAG(category) OVER (PARTITION BY product_id ORDER BY updated_at) AS prev_category
FROM stg_products
), scd2 AS (
SELECT
product_id,
sku,
name,
category,
price,
updated_at AS effective_from,
LEAD(updated_at) OVER (PARTITION BY product_id ORDER BY updated_at) AS effective_to
FROM changes
WHERE (prev_name IS DISTINCT FROM name) OR (prev_category IS DISTINCT FROM category) OR prev_name IS NULL
)
SELECT
-- surrogate key stable per version
ABS(MOD(FARM_FINGERPRINT(CONCAT(product_id, '|', COALESCE(name,''), '|', COALESCE(category,''), '|', CAST(effective_from AS STRING))), 9223372036854775807)) AS product_sk,
product_id,
sku,
name,
category,
price,
effective_from,
COALESCE(effective_to, TIMESTAMP '9999-12-31 00:00:00') AS effective_to,
(effective_to IS NULL) AS is_current
FROM scd2;
Notes:
- effective_from/effective_to create non-overlapping validity windows
- is_current marks the latest active version
Example 3: Order item fact with dimensional joins
Grain: one row per order item. Join to customer and product dimensions via natural keys + validity.
-- fact_order_item
WITH items AS (
SELECT oi.order_item_id, oi.order_id, oi.product_id, oi.quantity,
COALESCE(oi.unit_price, p.price) AS unit_price,
COALESCE(oi.discount, 0) AS discount,
o.customer_id, o.order_ts
FROM stg_order_items oi
JOIN stg_orders o ON oi.order_id = o.order_id
LEFT JOIN stg_products p ON oi.product_id = p.product_id
), product_v AS (
SELECT d.* FROM dim_product d WHERE d.is_current = TRUE
), customer_v AS (
SELECT
-- One current row per customer
ABS(MOD(FARM_FINGERPRINT(CAST(customer_id AS STRING)), 9223372036854775807)) AS customer_sk,
customer_id
FROM stg_customers
)
SELECT
i.order_item_id,
DATE(i.order_ts) AS order_date,
c.customer_sk,
pv.product_sk,
i.quantity,
i.unit_price,
(i.quantity * i.unit_price) AS gross_amount,
(i.quantity * i.discount) AS discount_amount,
(i.quantity * (i.unit_price - i.discount)) AS net_amount
FROM items i
LEFT JOIN customer_v c ON i.customer_id = c.customer_id
LEFT JOIN product_v pv ON i.product_id = pv.product_id;
Checks:
- Grain is exactly one row per order_item_id
- Foreign keys (customer_sk, product_sk) are not null for valid records
- Measures compute correctly (gross, discount, net)
Incremental and idempotent runs
Most tables should process only new/changed data. Use a watermark (e.g., updated_at) and write idempotent merges.
-- Pseudo-ANSI example for incremental upsert into fact_order_item
MERGE INTO mart.fact_order_item t
USING (
SELECT * FROM source_items WHERE updated_at >= @last_success_watermark
) s
ON t.order_item_id = s.order_item_id
WHEN MATCHED THEN UPDATE SET
order_date = s.order_date,
customer_sk = s.customer_sk,
product_sk = s.product_sk,
quantity = s.quantity,
unit_price = s.unit_price,
gross_amount = s.gross_amount,
discount_amount = s.discount_amount,
net_amount = s.net_amount,
updated_at = s.updated_at
WHEN NOT MATCHED THEN INSERT (
order_item_id, order_date, customer_sk, product_sk,
quantity, unit_price, gross_amount, discount_amount, net_amount, updated_at
) VALUES (
s.order_item_id, s.order_date, s.customer_sk, s.product_sk,
s.quantity, s.unit_price, s.gross_amount, s.discount_amount, s.net_amount, s.updated_at
);
Idempotent means: re-running produces the same final table state.
Data quality and self-checks
- Not null: keys and critical columns
- Unique: natural keys in staging, surrogate keys in dimensions, grain keys in facts
- Accepted values: order status within an allowed set
- Referential integrity: each fact foreign key resolves to a dimension key
- Row-count deltas: investigate sudden drops or spikes
Common mistakes
- Unclear grain: leads to double-counting. Fix: write the grain in a comment at the top of your SQL.
- Mixing business rules in staging: keep staging source-shaped and clean, not business-derived.
- Dropping history accidentally: choose SCD type deliberately (SCD1 overwrite vs SCD2 history).
- Non-idempotent updates: avoid UPDATEs without deterministic filters; prefer MERGE.
- Key drift: joining facts to dimensions on names instead of IDs. Use stable keys.
Exercises (practice inside your SQL warehouse)
Mirror tasks listed below. A sample solution format is provided in each exercise. Try first; then open the solution.
Exercise ex1: Build stg_customers
- One row per customer_id (latest updated_at)
- email lowercase, country uppercase, trimmed names
- Output columns: customer_id, email, first_name, last_name, phone, country_code, updated_at
Exercise ex2: Build dim_customer (SCD1)
- Surrogate key customer_sk (hash of customer_id)
- Overwrite attributes on change (SCD1)
- Ensure customer_sk is unique and stable
Exercise ex3: Build fact_order_item (incremental)
- Grain: one row per order_item_id
- Measures: quantity, unit_price, gross_amount, discount_amount, net_amount
- Join to dim_customer and dim_product, write a MERGE for incremental loads
Self-check checklist
- Did you state the grain at the top of each model?
- Are keys unique (tested) and not null?
- Can you re-run without duplicates or drift?
- Do foreign keys in facts resolve to dimensions?
Practical projects
- Sales mart: daily revenue by channel and product category (7-day rolling net revenue)
- Customer 360: dim_customer with first_order_date, last_order_date, lifetime_value
- Product performance: mart with conversion rate from views to purchases per category
Mini challenge
Design a mart table daily_customer_net_spend with columns (date, customer_sk, net_spend). Rules:
- Source: fact_order_item
- Aggregate net_amount by date and customer_sk
- Include only completed orders (assume stg_orders.status = 'completed')
- Write it as an incremental upsert using order updated_at as a watermark
Peek at an outline
WITH base AS (
SELECT DATE(o.order_ts) AS d, foi.customer_sk, SUM(foi.net_amount) AS net_spend,
GREATEST(MAX(o.updated_at), MAX(foi.updated_at)) AS updated_at
FROM fact_order_item foi
JOIN stg_orders o ON foi.order_item_id = foi.order_item_id AND o.status = 'completed'
WHERE o.updated_at >= @last_success_watermark
GROUP BY 1,2
)
MERGE INTO mart.daily_customer_net_spend t
USING base s
ON t.date = s.d AND t.customer_sk = s.customer_sk
WHEN MATCHED THEN UPDATE SET t.net_spend = s.net_spend, t.updated_at = s.updated_at
WHEN NOT MATCHED THEN INSERT (date, customer_sk, net_spend, updated_at) VALUES (s.d, s.customer_sk, s.net_spend, s.updated_at);
Learning path
- Practice staging patterns: renaming, casting, trimming, deduping
- Model dimensions: surrogate keys, SCD1 vs SCD2
- Model facts: define grain, join to dims, compute measures
- Make it incremental: watermarks and MERGE
- Add tests and basic monitoring
Next steps
- Complete the exercises below, then take the Quick Test at the end of the page
- Refactor your SQL into reusable CTEs and consistent naming
- Add simple data quality checks to your warehouse
Note on progress: The Quick Test is available to everyone. Sign in to save your results and track progress over time.