luvv to helpDiscover the Best Free Online Tools
Topic 1 of 8

Transform Layers Raw Staging Mart

Learn Transform Layers Raw Staging Mart for free with explanations, exercises, and a quick test (for Data Engineer).

Published: January 8, 2026 | Updated: January 8, 2026

Why this matters

Data engineers ship reliable analytics by shaping data through three clear layers: Raw, Staging, and Mart. This layering makes pipelines auditable, re-runnable, and fast for business queries.

  • Real tasks you will do:
    • Design schemas for landing (raw), cleaning (staging), and analytics (mart).
    • Implement incremental, idempotent transforms that can reprocess safely.
    • Model dimensions and facts (including SCD Type 2) for consistent reporting.
    • Add data quality checks and audit columns at each hop.
    • Partition and cluster data for performance and cost control.

Concept explained simply

Think of your data like drinking water:

  • Raw (ingest): water straight from the source. Immutable, exactly as received.
  • Staging (refine): filtered, standardized, and conformed to business keys.
  • Mart (serve): bottled for specific consumers (sales, marketing, finance) with business logic applied.

Mental model: Every layer answers a question.

  • Raw: What exactly arrived and when?
  • Staging: What is the clean, deduplicated truth keyed by our business IDs?
  • Mart: What does the business want to ask quickly and repeatedly?

ETL vs ELT: Either compute before load (ETL) or load then compute in the warehouse (ELT). The layering stays the same; only where the compute happens changes.

Core design choices

  • Idempotency: Rerunning a job should not create duplicates. Use MERGE/UPSERT patterns, deduping windows, and stable keys.
  • Incremental loads: Track watermarks by event time or ingestion time; avoid full refresh unless necessary.
  • Partitioning: Partition by event date or load date; cluster/sort by keys used in joins and filters.
  • Auditability: Add columns like _ingested_at, _source_file, _load_id, _processed_at.
  • Contracts: Define required fields and types at staging. Raw can be schemaless; staging is not.
  • Modeling: Use stars (dimensions + facts) for marts. Consider SCD Type 2 for slowly changing attributes.

Worked examples

Example 1 — Raw to Staging: Deduplicate and standardize

Scenario: Clickstream events in JSON at raw.clickstream with fields: event_id, user_id, event_ts (string ISO), url, _ingested_at.

  1. Parse types and deduplicate by event_id, keeping the latest _ingested_at.
  2. Store in staging.clickstream, partitioned by event_date.
-- Idempotent staging load
CREATE TABLE IF NOT EXISTS staging.clickstream (
  event_id STRING,
  user_id STRING,
  event_ts TIMESTAMP,
  event_date DATE,
  url STRING,
  _ingested_at TIMESTAMP,
  _processed_at TIMESTAMP
) PARTITION BY (event_date);

MERGE INTO staging.clickstream t
USING (
  SELECT
    event_id,
    user_id,
    TIMESTAMP(event_ts) AS event_ts,
    DATE(TIMESTAMP(event_ts)) AS event_date,
    url,
    _ingested_at,
    CURRENT_TIMESTAMP() AS _processed_at,
    ROW_NUMBER() OVER (PARTITION BY event_id ORDER BY _ingested_at DESC) AS rn
  FROM raw.clickstream
) s
ON t.event_id = s.event_id
WHEN MATCHED AND s.rn = 1 THEN UPDATE SET
  user_id = s.user_id,
  event_ts = s.event_ts,
  event_date = s.event_date,
  url = s.url,
  _ingested_at = s._ingested_at,
  _processed_at = s._processed_at
WHEN NOT MATCHED AND s.rn = 1 THEN INSERT (
  event_id, user_id, event_ts, event_date, url, _ingested_at, _processed_at
) VALUES (
  s.event_id, s.user_id, s.event_ts, s.event_date, s.url, s._ingested_at, s._processed_at
);
Example 2 — Staging to Dimensions (SCD2) and Fact

Scenario: staging.customers with customer_id, email, country, _ingested_at. Build dim_customer SCD2 and fact_orders.

-- Dimension with SCD2
CREATE TABLE IF NOT EXISTS mart.dim_customer (
  customer_sk BIGINT GENERATED ALWAYS AS IDENTITY,
  customer_id STRING,
  email STRING,
  country STRING,
  effective_from TIMESTAMP,
  effective_to TIMESTAMP,
  is_current BOOLEAN,
  _processed_at TIMESTAMP
);

-- Upsert changes (SCD2)
MERGE INTO mart.dim_customer d
USING (
  SELECT customer_id, email, country, _ingested_at AS change_ts FROM staging.customers
) s
ON d.customer_id = s.customer_id AND d.is_current = TRUE
WHEN MATCHED AND (d.email <> s.email OR d.country <> s.country) THEN
  UPDATE SET d.effective_to = s.change_ts, d.is_current = FALSE
WHEN NOT MATCHED THEN
  INSERT (customer_id, email, country, effective_from, effective_to, is_current, _processed_at)
  VALUES (s.customer_id, s.email, s.country, s.change_ts, TIMESTAMP '9999-12-31 00:00:00', TRUE, CURRENT_TIMESTAMP());

-- Fact table referencing current dimension row
CREATE TABLE IF NOT EXISTS mart.fact_orders (
  order_id STRING,
  customer_sk BIGINT,
  order_ts TIMESTAMP,
  order_date DATE,
  amount NUMERIC(18,2),
  status STRING,
  _processed_at TIMESTAMP
);

INSERT INTO mart.fact_orders
SELECT
  o.order_id,
  d.customer_sk,
  o.order_ts,
  DATE(o.order_ts) AS order_date,
  o.amount,
  o.status,
  CURRENT_TIMESTAMP()
FROM staging.orders o
JOIN mart.dim_customer d
  ON d.customer_id = o.customer_id AND d.is_current = TRUE;
Example 3 — Build an Aggregated Sales Mart

Daily sales summary by country for dashboard performance.

CREATE TABLE IF NOT EXISTS mart.sales_daily_country (
  order_date DATE,
  country STRING,
  orders INT64,
  gross_revenue NUMERIC(18,2),
  _processed_at TIMESTAMP
) PARTITION BY (order_date);

INSERT INTO mart.sales_daily_country
SELECT
  f.order_date,
  d.country,
  COUNT(*) AS orders,
  SUM(f.amount) AS gross_revenue,
  CURRENT_TIMESTAMP()
FROM mart.fact_orders f
JOIN mart.dim_customer d ON f.customer_sk = d.customer_sk AND d.is_current = TRUE
GROUP BY 1, 2;

Quality guardrails and self-checks

  • Raw: Row counts vs source; capture source file/path; never mutate.
  • Staging: Dedup by stable keys; enforce types and not-null on required columns.
  • Mart: Referential integrity to dimensions; total revenue within expected bounds.
Minimal checks to add
  • Count drift check between staging and mart within tolerance.
  • Null checks on business keys.
  • Freshness check: max(event_ts) within SLA.

Who this is for and prerequisites

Who this is for

  • Aspiring and junior data engineers building their first production pipelines.
  • Analysts moving into engineering who need reliable marts.
  • Software engineers owning analytics endpoints.

Prerequisites

  • Comfort with SQL (SELECT, JOIN, GROUP BY, window functions, MERGE).
  • Basic understanding of data warehouses/lakehouses and partitions.
  • Familiarity with dimensional modeling terms (fact, dimension, SCD).

Learning path

  1. Map your layers
    • Define raw, staging, and mart locations/schemas.
    • Choose partition keys and clustering/sorting strategy.
  2. Make staging deterministic
    • Implement deduplication, type casting, and column contracts.
    • Add audit columns and idempotent MERGE logic.
  3. Model marts
    • Create core dimensions (customers, products) and facts (orders, events).
    • Implement SCD2 for slowly changing attributes.
  4. Optimize and validate
    • Add indexes/clustering where supported; confirm partition pruning.
    • Install freshness and volume checks.

Exercises

Complete the tasks below. The quick test is available to everyone; only logged-in users will see saved progress.

  1. Exercise 1: Raw to Staging — Orders

    Transform raw.orders (JSON) into staging.orders with deduplication and typing.

    • Deduplicate by order_id, keep latest _ingested_at.
    • Cast order_ts to TIMESTAMP and derive order_date.
    • Include audit columns: _ingested_at, _processed_at.
    Mini checklist
    • MERGE is idempotent
    • Types are enforced
    • Partition by order_date
  2. Exercise 2: Staging to Mart — Customers and Sales

    Build mart.dim_customer (SCD2) and mart.fact_orders; then aggregate daily sales by country.

    • Use customer_id as business key; create customer_sk.
    • Mark current rows with is_current and bound history with effective_to.
    • Produce mart.sales_daily_country: date, country, orders, revenue.
    Mini checklist
    • SCD2 closes and opens rows correctly
    • Facts join to current dimension row
    • Aggregation matches sum of facts

Common mistakes and how to self-check

  • Mutating raw data: Raw must be append-only. Self-check: Verify no UPDATE/DELETE on raw.
  • Non-idempotent loads: INSERT-only creates duplicates on rerun. Self-check: Rerun job; counts should not inflate.
  • Missing business keys: Staging without stable keys breaks joins. Self-check: Ensure keys are NOT NULL and unique after dedup.
  • SCD2 errors: Overlapping histories. Self-check: For each key, no overlaps between effective windows.
  • Over-aggregating in staging: Keep business logic in marts. Self-check: Staging should resemble clean source, not reports.
  • Poor partition choice: Partition on high-cardinality columns hurts performance. Self-check: Confirm partition pruning in query plans.

Practical projects

  • Mini retail warehouse: Orders, customers, products from CSV/JSON into a star schema with daily sales mart.
  • Event analytics: Clickstream to sessionized fact with user dimension and a funnel mart.
  • Finance snapshot: Daily balances with SCD2 account attributes and a P&L mart by month.

Mini challenge

Your PM asks for a new KPI: repeat customers in the last 90 days by country. Sketch the tables and SQL you would add in the mart layer. Identify the partitioning and any SCD concerns.

Next steps

  • Finish the exercises, then take the quick test below to check understanding.
  • Remember: the test is available to everyone; only logged-in users will have progress saved.
  • Extend your pipeline with one additional dimension and one new mart metric.

Practice Exercises

2 exercises to complete

Instructions

From raw.orders (JSON) create staging.orders that is deduplicated, typed, and partitioned by order_date.

  • Deduplicate by order_id, keeping the row with the latest _ingested_at.
  • Cast order_ts to TIMESTAMP and derive order_date.
  • Include _ingested_at and _processed_at audit columns.
  • Make the load idempotent using MERGE.
Expected Output
Table staging.orders with columns: order_id STRING, customer_id STRING, order_ts TIMESTAMP, order_date DATE, amount NUMERIC(18,2), status STRING, _ingested_at TIMESTAMP, _processed_at TIMESTAMP. Running the job twice yields identical row counts and values.

Transform Layers Raw Staging Mart — Quick Test

Test your knowledge with 10 questions. Pass with 70% or higher.

10 questions70% to pass

Have questions about Transform Layers Raw Staging Mart?

AI Assistant

Ask questions about this tool