How to learn Building Transformations From Raw To Mart for SQL in Analytics Engineer for free

Who this is for

You work with data (or plan to) and want to turn messy raw tables into stable, analysis-ready marts. Ideal for aspiring Analytics Engineers, BI Developers, and SQL-savvy Data Analysts.

Prerequisites

Comfort with SELECT, JOIN, GROUP BY, and window functions
Basic understanding of star schemas (facts and dimensions)
Familiarity with timestamps and data types

Why this matters

Analytics Engineers ship reliable, repeatable pipelines. Your job is to convert raw logs and app tables into:

Clean staging tables (consistent names, types, and deduplicated rows)
Dimensions (business entities like customers and products)
Facts (transactions and events at a clearly defined grain)

These transformations power dashboards, experiments, finance reports, and machine learning features.

Concept explained simply

Think of raw data as ingredients straight from the market. Staging is washing and chopping. Dimensions are labeled containers (customers, products). Facts are the actual meals (orders, payments) that reference those containers. Marts are the clean, ready-to-serve buffet for analysts.

Mental model

Raw: source-shaped, noisy, often duplicated
Staging (stg_): clean and standardized, one source at a time
Core: conformed dimensions and facts with clear keys and grain
Mart: curated, business-ready aggregates (e.g., daily_revenue_by_channel)

Core workflow: raw → staging → dimensions/facts → marts

Define the grain: Exactly one row represents what? (e.g., one order item)
Standardize in staging: Rename, cast, trim, dedupe with clear rules.
Build dimensions: Add surrogate keys, manage history (SCD1 or SCD2).
Build facts: Join to dimensions via keys, compute metrics, enforce the grain.
Aggregate into marts: Daily/weekly/monthly rollups for BI.
Make runs incremental: Watermarks and idempotent merges.
Add tests: Uniqueness, not null, accepted values, and referential integrity.

Worked examples

We will use four raw tables: raw.customers, raw.products, raw.orders, raw.order_items.

Raw schemas (for reference)

raw.customers(customer_id, email, first_name, last_name, phone, country, updated_at)
raw.products(product_id, sku, name, category, price, updated_at)
raw.orders(order_id, customer_id, order_ts, status, updated_at)
raw.order_items(order_item_id, order_id, product_id, quantity, unit_price, discount, updated_at)

Example 1: Staging customers (clean + dedupe)

Goal: One row per customer_id, most recent data, tidy columns and types.

-- stg_customers
WITH ranked AS (
  SELECT
    customer_id,
    LOWER(TRIM(email)) AS email,
    INITCAP(TRIM(first_name)) AS first_name,
    INITCAP(TRIM(last_name))  AS last_name,
    CAST(phone AS VARCHAR)    AS phone,
    UPPER(TRIM(country))      AS country_code,
    updated_at,
    ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY updated_at DESC) AS rn
  FROM raw.customers
)
SELECT
  customer_id,
  email,
  first_name,
  last_name,
  phone,
  country_code,
  updated_at
FROM ranked
WHERE rn = 1;

Checks:

customer_id is unique
emails are lowercase; country codes uppercase
latest updated_at wins ties

Example 2: Product dimension with SCD2

We track changes in product name and category (history preserved). Use a surrogate key product_sk and validity windows.

-- dim_product (SCD2)
-- Assume we ingest daily snapshots into stg_products with unique (product_id, snapshot_date)
WITH changes AS (
  SELECT
    product_id,
    sku,
    name,
    category,
    price,
    updated_at,
    -- Detect change vs previous snapshot
    LAG(name)     OVER (PARTITION BY product_id ORDER BY updated_at) AS prev_name,
    LAG(category) OVER (PARTITION BY product_id ORDER BY updated_at) AS prev_category
  FROM stg_products
), scd2 AS (
  SELECT
    product_id,
    sku,
    name,
    category,
    price,
    updated_at AS effective_from,
    LEAD(updated_at) OVER (PARTITION BY product_id ORDER BY updated_at) AS effective_to
  FROM changes
  WHERE (prev_name IS DISTINCT FROM name) OR (prev_category IS DISTINCT FROM category) OR prev_name IS NULL
)
SELECT
  -- surrogate key stable per version
  ABS(MOD(FARM_FINGERPRINT(CONCAT(product_id, '|', COALESCE(name,''), '|', COALESCE(category,''), '|', CAST(effective_from AS STRING))), 9223372036854775807)) AS product_sk,
  product_id,
  sku,
  name,
  category,
  price,
  effective_from,
  COALESCE(effective_to, TIMESTAMP '9999-12-31 00:00:00') AS effective_to,
  (effective_to IS NULL) AS is_current
FROM scd2;

Notes:

effective_from/effective_to create non-overlapping validity windows
is_current marks the latest active version

Example 3: Order item fact with dimensional joins

Grain: one row per order item. Join to customer and product dimensions via natural keys + validity.

-- fact_order_item
WITH items AS (
  SELECT oi.order_item_id, oi.order_id, oi.product_id, oi.quantity,
         COALESCE(oi.unit_price, p.price) AS unit_price,
         COALESCE(oi.discount, 0) AS discount,
         o.customer_id, o.order_ts
  FROM stg_order_items oi
  JOIN stg_orders o      ON oi.order_id = o.order_id
  LEFT JOIN stg_products p ON oi.product_id = p.product_id
), product_v AS (
  SELECT d.* FROM dim_product d WHERE d.is_current = TRUE
), customer_v AS (
  SELECT
    -- One current row per customer
    ABS(MOD(FARM_FINGERPRINT(CAST(customer_id AS STRING)), 9223372036854775807)) AS customer_sk,
    customer_id
  FROM stg_customers
)
SELECT
  i.order_item_id,
  DATE(i.order_ts) AS order_date,
  c.customer_sk,
  pv.product_sk,
  i.quantity,
  i.unit_price,
  (i.quantity * i.unit_price) AS gross_amount,
  (i.quantity * i.discount)   AS discount_amount,
  (i.quantity * (i.unit_price - i.discount)) AS net_amount
FROM items i
LEFT JOIN customer_v c ON i.customer_id = c.customer_id
LEFT JOIN product_v  pv ON i.product_id = pv.product_id;

Checks:

Grain is exactly one row per order_item_id
Foreign keys (customer_sk, product_sk) are not null for valid records
Measures compute correctly (gross, discount, net)

Incremental and idempotent runs

Most tables should process only new/changed data. Use a watermark (e.g., updated_at) and write idempotent merges.

-- Pseudo-ANSI example for incremental upsert into fact_order_item
MERGE INTO mart.fact_order_item t
USING (
  SELECT * FROM source_items WHERE updated_at >= @last_success_watermark
) s
ON t.order_item_id = s.order_item_id
WHEN MATCHED THEN UPDATE SET
  order_date = s.order_date,
  customer_sk = s.customer_sk,
  product_sk = s.product_sk,
  quantity = s.quantity,
  unit_price = s.unit_price,
  gross_amount = s.gross_amount,
  discount_amount = s.discount_amount,
  net_amount = s.net_amount,
  updated_at = s.updated_at
WHEN NOT MATCHED THEN INSERT (
  order_item_id, order_date, customer_sk, product_sk,
  quantity, unit_price, gross_amount, discount_amount, net_amount, updated_at
) VALUES (
  s.order_item_id, s.order_date, s.customer_sk, s.product_sk,
  s.quantity, s.unit_price, s.gross_amount, s.discount_amount, s.net_amount, s.updated_at
);

Idempotent means: re-running produces the same final table state.

Data quality and self-checks

Not null: keys and critical columns
Unique: natural keys in staging, surrogate keys in dimensions, grain keys in facts
Accepted values: order status within an allowed set
Referential integrity: each fact foreign key resolves to a dimension key
Row-count deltas: investigate sudden drops or spikes

Common mistakes

Unclear grain: leads to double-counting. Fix: write the grain in a comment at the top of your SQL.
Mixing business rules in staging: keep staging source-shaped and clean, not business-derived.
Dropping history accidentally: choose SCD type deliberately (SCD1 overwrite vs SCD2 history).
Non-idempotent updates: avoid UPDATEs without deterministic filters; prefer MERGE.
Key drift: joining facts to dimensions on names instead of IDs. Use stable keys.

Exercises (practice inside your SQL warehouse)

Mirror tasks listed below. A sample solution format is provided in each exercise. Try first; then open the solution.

Exercise ex1: Build stg_customers

One row per customer_id (latest updated_at)
email lowercase, country uppercase, trimmed names
Output columns: customer_id, email, first_name, last_name, phone, country_code, updated_at

Exercise ex2: Build dim_customer (SCD1)

Surrogate key customer_sk (hash of customer_id)
Overwrite attributes on change (SCD1)
Ensure customer_sk is unique and stable

Exercise ex3: Build fact_order_item (incremental)

Grain: one row per order_item_id
Measures: quantity, unit_price, gross_amount, discount_amount, net_amount
Join to dim_customer and dim_product, write a MERGE for incremental loads

Self-check checklist

Did you state the grain at the top of each model?
Are keys unique (tested) and not null?
Can you re-run without duplicates or drift?
Do foreign keys in facts resolve to dimensions?

Practical projects

Sales mart: daily revenue by channel and product category (7-day rolling net revenue)
Customer 360: dim_customer with first_order_date, last_order_date, lifetime_value
Product performance: mart with conversion rate from views to purchases per category

Mini challenge

Design a mart table daily_customer_net_spend with columns (date, customer_sk, net_spend). Rules:

Source: fact_order_item
Aggregate net_amount by date and customer_sk
Include only completed orders (assume stg_orders.status = 'completed')
Write it as an incremental upsert using order updated_at as a watermark

Peek at an outline

WITH base AS (
  SELECT DATE(o.order_ts) AS d, foi.customer_sk, SUM(foi.net_amount) AS net_spend,
         GREATEST(MAX(o.updated_at), MAX(foi.updated_at)) AS updated_at
  FROM fact_order_item foi
  JOIN stg_orders o ON foi.order_item_id = foi.order_item_id AND o.status = 'completed'
  WHERE o.updated_at >= @last_success_watermark
  GROUP BY 1,2
)
MERGE INTO mart.daily_customer_net_spend t
USING base s
ON t.date = s.d AND t.customer_sk = s.customer_sk
WHEN MATCHED THEN UPDATE SET t.net_spend = s.net_spend, t.updated_at = s.updated_at
WHEN NOT MATCHED THEN INSERT (date, customer_sk, net_spend, updated_at) VALUES (s.d, s.customer_sk, s.net_spend, s.updated_at);

Learning path

Practice staging patterns: renaming, casting, trimming, deduping
Model dimensions: surrogate keys, SCD1 vs SCD2
Model facts: define grain, join to dims, compute measures
Make it incremental: watermarks and MERGE
Add tests and basic monitoring

Next steps

Complete the exercises below, then take the Quick Test at the end of the page
Refactor your SQL into reusable CTEs and consistent naming
Add simple data quality checks to your warehouse

Note on progress: The Quick Test is available to everyone. Sign in to save your results and track progress over time.

Menu

Building Transformations From Raw To Mart

Table of Contents