luvv to helpDiscover the Best Free Online Tools
Topic 8 of 8

Building Transformations From Raw To Mart

Learn Building Transformations From Raw To Mart for free with explanations, exercises, and a quick test (for Analytics Engineer).

Published: December 23, 2025 | Updated: December 23, 2025

Who this is for

You work with data (or plan to) and want to turn messy raw tables into stable, analysis-ready marts. Ideal for aspiring Analytics Engineers, BI Developers, and SQL-savvy Data Analysts.

Prerequisites

  • Comfort with SELECT, JOIN, GROUP BY, and window functions
  • Basic understanding of star schemas (facts and dimensions)
  • Familiarity with timestamps and data types

Why this matters

Analytics Engineers ship reliable, repeatable pipelines. Your job is to convert raw logs and app tables into:

  • Clean staging tables (consistent names, types, and deduplicated rows)
  • Dimensions (business entities like customers and products)
  • Facts (transactions and events at a clearly defined grain)

These transformations power dashboards, experiments, finance reports, and machine learning features.

Concept explained simply

Think of raw data as ingredients straight from the market. Staging is washing and chopping. Dimensions are labeled containers (customers, products). Facts are the actual meals (orders, payments) that reference those containers. Marts are the clean, ready-to-serve buffet for analysts.

Mental model

  • Raw: source-shaped, noisy, often duplicated
  • Staging (stg_): clean and standardized, one source at a time
  • Core: conformed dimensions and facts with clear keys and grain
  • Mart: curated, business-ready aggregates (e.g., daily_revenue_by_channel)

Core workflow: raw → staging → dimensions/facts → marts

  1. Define the grain: Exactly one row represents what? (e.g., one order item)
  2. Standardize in staging: Rename, cast, trim, dedupe with clear rules.
  3. Build dimensions: Add surrogate keys, manage history (SCD1 or SCD2).
  4. Build facts: Join to dimensions via keys, compute metrics, enforce the grain.
  5. Aggregate into marts: Daily/weekly/monthly rollups for BI.
  6. Make runs incremental: Watermarks and idempotent merges.
  7. Add tests: Uniqueness, not null, accepted values, and referential integrity.

Worked examples

We will use four raw tables: raw.customers, raw.products, raw.orders, raw.order_items.

Raw schemas (for reference)
raw.customers(customer_id, email, first_name, last_name, phone, country, updated_at)
raw.products(product_id, sku, name, category, price, updated_at)
raw.orders(order_id, customer_id, order_ts, status, updated_at)
raw.order_items(order_item_id, order_id, product_id, quantity, unit_price, discount, updated_at)

Example 1: Staging customers (clean + dedupe)

Goal: One row per customer_id, most recent data, tidy columns and types.

-- stg_customers
WITH ranked AS (
  SELECT
    customer_id,
    LOWER(TRIM(email)) AS email,
    INITCAP(TRIM(first_name)) AS first_name,
    INITCAP(TRIM(last_name))  AS last_name,
    CAST(phone AS VARCHAR)    AS phone,
    UPPER(TRIM(country))      AS country_code,
    updated_at,
    ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY updated_at DESC) AS rn
  FROM raw.customers
)
SELECT
  customer_id,
  email,
  first_name,
  last_name,
  phone,
  country_code,
  updated_at
FROM ranked
WHERE rn = 1;

Checks:

  • customer_id is unique
  • emails are lowercase; country codes uppercase
  • latest updated_at wins ties

Example 2: Product dimension with SCD2

We track changes in product name and category (history preserved). Use a surrogate key product_sk and validity windows.

-- dim_product (SCD2)
-- Assume we ingest daily snapshots into stg_products with unique (product_id, snapshot_date)
WITH changes AS (
  SELECT
    product_id,
    sku,
    name,
    category,
    price,
    updated_at,
    -- Detect change vs previous snapshot
    LAG(name)     OVER (PARTITION BY product_id ORDER BY updated_at) AS prev_name,
    LAG(category) OVER (PARTITION BY product_id ORDER BY updated_at) AS prev_category
  FROM stg_products
), scd2 AS (
  SELECT
    product_id,
    sku,
    name,
    category,
    price,
    updated_at AS effective_from,
    LEAD(updated_at) OVER (PARTITION BY product_id ORDER BY updated_at) AS effective_to
  FROM changes
  WHERE (prev_name IS DISTINCT FROM name) OR (prev_category IS DISTINCT FROM category) OR prev_name IS NULL
)
SELECT
  -- surrogate key stable per version
  ABS(MOD(FARM_FINGERPRINT(CONCAT(product_id, '|', COALESCE(name,''), '|', COALESCE(category,''), '|', CAST(effective_from AS STRING))), 9223372036854775807)) AS product_sk,
  product_id,
  sku,
  name,
  category,
  price,
  effective_from,
  COALESCE(effective_to, TIMESTAMP '9999-12-31 00:00:00') AS effective_to,
  (effective_to IS NULL) AS is_current
FROM scd2;

Notes:

  • effective_from/effective_to create non-overlapping validity windows
  • is_current marks the latest active version

Example 3: Order item fact with dimensional joins

Grain: one row per order item. Join to customer and product dimensions via natural keys + validity.

-- fact_order_item
WITH items AS (
  SELECT oi.order_item_id, oi.order_id, oi.product_id, oi.quantity,
         COALESCE(oi.unit_price, p.price) AS unit_price,
         COALESCE(oi.discount, 0) AS discount,
         o.customer_id, o.order_ts
  FROM stg_order_items oi
  JOIN stg_orders o      ON oi.order_id = o.order_id
  LEFT JOIN stg_products p ON oi.product_id = p.product_id
), product_v AS (
  SELECT d.* FROM dim_product d WHERE d.is_current = TRUE
), customer_v AS (
  SELECT
    -- One current row per customer
    ABS(MOD(FARM_FINGERPRINT(CAST(customer_id AS STRING)), 9223372036854775807)) AS customer_sk,
    customer_id
  FROM stg_customers
)
SELECT
  i.order_item_id,
  DATE(i.order_ts) AS order_date,
  c.customer_sk,
  pv.product_sk,
  i.quantity,
  i.unit_price,
  (i.quantity * i.unit_price) AS gross_amount,
  (i.quantity * i.discount)   AS discount_amount,
  (i.quantity * (i.unit_price - i.discount)) AS net_amount
FROM items i
LEFT JOIN customer_v c ON i.customer_id = c.customer_id
LEFT JOIN product_v  pv ON i.product_id = pv.product_id;

Checks:

  • Grain is exactly one row per order_item_id
  • Foreign keys (customer_sk, product_sk) are not null for valid records
  • Measures compute correctly (gross, discount, net)

Incremental and idempotent runs

Most tables should process only new/changed data. Use a watermark (e.g., updated_at) and write idempotent merges.

-- Pseudo-ANSI example for incremental upsert into fact_order_item
MERGE INTO mart.fact_order_item t
USING (
  SELECT * FROM source_items WHERE updated_at >= @last_success_watermark
) s
ON t.order_item_id = s.order_item_id
WHEN MATCHED THEN UPDATE SET
  order_date = s.order_date,
  customer_sk = s.customer_sk,
  product_sk = s.product_sk,
  quantity = s.quantity,
  unit_price = s.unit_price,
  gross_amount = s.gross_amount,
  discount_amount = s.discount_amount,
  net_amount = s.net_amount,
  updated_at = s.updated_at
WHEN NOT MATCHED THEN INSERT (
  order_item_id, order_date, customer_sk, product_sk,
  quantity, unit_price, gross_amount, discount_amount, net_amount, updated_at
) VALUES (
  s.order_item_id, s.order_date, s.customer_sk, s.product_sk,
  s.quantity, s.unit_price, s.gross_amount, s.discount_amount, s.net_amount, s.updated_at
);

Idempotent means: re-running produces the same final table state.

Data quality and self-checks

  • Not null: keys and critical columns
  • Unique: natural keys in staging, surrogate keys in dimensions, grain keys in facts
  • Accepted values: order status within an allowed set
  • Referential integrity: each fact foreign key resolves to a dimension key
  • Row-count deltas: investigate sudden drops or spikes

Common mistakes

  • Unclear grain: leads to double-counting. Fix: write the grain in a comment at the top of your SQL.
  • Mixing business rules in staging: keep staging source-shaped and clean, not business-derived.
  • Dropping history accidentally: choose SCD type deliberately (SCD1 overwrite vs SCD2 history).
  • Non-idempotent updates: avoid UPDATEs without deterministic filters; prefer MERGE.
  • Key drift: joining facts to dimensions on names instead of IDs. Use stable keys.

Exercises (practice inside your SQL warehouse)

Mirror tasks listed below. A sample solution format is provided in each exercise. Try first; then open the solution.

Exercise ex1: Build stg_customers

  • One row per customer_id (latest updated_at)
  • email lowercase, country uppercase, trimmed names
  • Output columns: customer_id, email, first_name, last_name, phone, country_code, updated_at

Exercise ex2: Build dim_customer (SCD1)

  • Surrogate key customer_sk (hash of customer_id)
  • Overwrite attributes on change (SCD1)
  • Ensure customer_sk is unique and stable

Exercise ex3: Build fact_order_item (incremental)

  • Grain: one row per order_item_id
  • Measures: quantity, unit_price, gross_amount, discount_amount, net_amount
  • Join to dim_customer and dim_product, write a MERGE for incremental loads

Self-check checklist

  • Did you state the grain at the top of each model?
  • Are keys unique (tested) and not null?
  • Can you re-run without duplicates or drift?
  • Do foreign keys in facts resolve to dimensions?

Practical projects

  • Sales mart: daily revenue by channel and product category (7-day rolling net revenue)
  • Customer 360: dim_customer with first_order_date, last_order_date, lifetime_value
  • Product performance: mart with conversion rate from views to purchases per category

Mini challenge

Design a mart table daily_customer_net_spend with columns (date, customer_sk, net_spend). Rules:

  • Source: fact_order_item
  • Aggregate net_amount by date and customer_sk
  • Include only completed orders (assume stg_orders.status = 'completed')
  • Write it as an incremental upsert using order updated_at as a watermark
Peek at an outline
WITH base AS (
  SELECT DATE(o.order_ts) AS d, foi.customer_sk, SUM(foi.net_amount) AS net_spend,
         GREATEST(MAX(o.updated_at), MAX(foi.updated_at)) AS updated_at
  FROM fact_order_item foi
  JOIN stg_orders o ON foi.order_item_id = foi.order_item_id AND o.status = 'completed'
  WHERE o.updated_at >= @last_success_watermark
  GROUP BY 1,2
)
MERGE INTO mart.daily_customer_net_spend t
USING base s
ON t.date = s.d AND t.customer_sk = s.customer_sk
WHEN MATCHED THEN UPDATE SET t.net_spend = s.net_spend, t.updated_at = s.updated_at
WHEN NOT MATCHED THEN INSERT (date, customer_sk, net_spend, updated_at) VALUES (s.d, s.customer_sk, s.net_spend, s.updated_at);

Learning path

  1. Practice staging patterns: renaming, casting, trimming, deduping
  2. Model dimensions: surrogate keys, SCD1 vs SCD2
  3. Model facts: define grain, join to dims, compute measures
  4. Make it incremental: watermarks and MERGE
  5. Add tests and basic monitoring

Next steps

  • Complete the exercises below, then take the Quick Test at the end of the page
  • Refactor your SQL into reusable CTEs and consistent naming
  • Add simple data quality checks to your warehouse

Note on progress: The Quick Test is available to everyone. Sign in to save your results and track progress over time.

Practice Exercises

3 exercises to complete

Instructions

Create a staging table stg_customers from raw.customers with these rules:

  • One row per customer_id (keep the row with the latest updated_at)
  • email lowercase; country uppercase
  • Trimmed first_name/last_name; phone cast to string
  • Columns: customer_id, email, first_name, last_name, phone, country_code, updated_at
Expected Output
A result set with unique customer_id and standardized columns/types; no duplicates.

Building Transformations From Raw To Mart — Quick Test

Test your knowledge with 8 questions. Pass with 70% or higher.

8 questions70% to pass

Have questions about Building Transformations From Raw To Mart?

AI Assistant

Ask questions about this tool