How to learn Transform Layers Raw Staging Mart for ETL ELT Development in Data Engineer for free

Why this matters

Data engineers ship reliable analytics by shaping data through three clear layers: Raw, Staging, and Mart. This layering makes pipelines auditable, re-runnable, and fast for business queries.

Real tasks you will do:
- Design schemas for landing (raw), cleaning (staging), and analytics (mart).
- Implement incremental, idempotent transforms that can reprocess safely.
- Model dimensions and facts (including SCD Type 2) for consistent reporting.
- Add data quality checks and audit columns at each hop.
- Partition and cluster data for performance and cost control.

Concept explained simply

Think of your data like drinking water:

Raw (ingest): water straight from the source. Immutable, exactly as received.
Staging (refine): filtered, standardized, and conformed to business keys.
Mart (serve): bottled for specific consumers (sales, marketing, finance) with business logic applied.

Mental model: Every layer answers a question.

Raw: What exactly arrived and when?
Staging: What is the clean, deduplicated truth keyed by our business IDs?
Mart: What does the business want to ask quickly and repeatedly?

ETL vs ELT: Either compute before load (ETL) or load then compute in the warehouse (ELT). The layering stays the same; only where the compute happens changes.

Core design choices

Idempotency: Rerunning a job should not create duplicates. Use MERGE/UPSERT patterns, deduping windows, and stable keys.
Incremental loads: Track watermarks by event time or ingestion time; avoid full refresh unless necessary.
Partitioning: Partition by event date or load date; cluster/sort by keys used in joins and filters.
Auditability: Add columns like _ingested_at, _source_file, _load_id, _processed_at.
Contracts: Define required fields and types at staging. Raw can be schemaless; staging is not.
Modeling: Use stars (dimensions + facts) for marts. Consider SCD Type 2 for slowly changing attributes.

Worked examples

Example 1 — Raw to Staging: Deduplicate and standardize

Scenario: Clickstream events in JSON at raw.clickstream with fields: event_id, user_id, event_ts (string ISO), url, _ingested_at.

Parse types and deduplicate by event_id, keeping the latest _ingested_at.
Store in staging.clickstream, partitioned by event_date.

-- Idempotent staging load
CREATE TABLE IF NOT EXISTS staging.clickstream (
  event_id STRING,
  user_id STRING,
  event_ts TIMESTAMP,
  event_date DATE,
  url STRING,
  _ingested_at TIMESTAMP,
  _processed_at TIMESTAMP
) PARTITION BY (event_date);

MERGE INTO staging.clickstream t
USING (
  SELECT
    event_id,
    user_id,
    TIMESTAMP(event_ts) AS event_ts,
    DATE(TIMESTAMP(event_ts)) AS event_date,
    url,
    _ingested_at,
    CURRENT_TIMESTAMP() AS _processed_at,
    ROW_NUMBER() OVER (PARTITION BY event_id ORDER BY _ingested_at DESC) AS rn
  FROM raw.clickstream
) s
ON t.event_id = s.event_id
WHEN MATCHED AND s.rn = 1 THEN UPDATE SET
  user_id = s.user_id,
  event_ts = s.event_ts,
  event_date = s.event_date,
  url = s.url,
  _ingested_at = s._ingested_at,
  _processed_at = s._processed_at
WHEN NOT MATCHED AND s.rn = 1 THEN INSERT (
  event_id, user_id, event_ts, event_date, url, _ingested_at, _processed_at
) VALUES (
  s.event_id, s.user_id, s.event_ts, s.event_date, s.url, s._ingested_at, s._processed_at
);

Example 2 — Staging to Dimensions (SCD2) and Fact

Scenario: staging.customers with customer_id, email, country, _ingested_at. Build dim_customer SCD2 and fact_orders.

-- Dimension with SCD2
CREATE TABLE IF NOT EXISTS mart.dim_customer (
  customer_sk BIGINT GENERATED ALWAYS AS IDENTITY,
  customer_id STRING,
  email STRING,
  country STRING,
  effective_from TIMESTAMP,
  effective_to TIMESTAMP,
  is_current BOOLEAN,
  _processed_at TIMESTAMP
);

-- Upsert changes (SCD2)
MERGE INTO mart.dim_customer d
USING (
  SELECT customer_id, email, country, _ingested_at AS change_ts FROM staging.customers
) s
ON d.customer_id = s.customer_id AND d.is_current = TRUE
WHEN MATCHED AND (d.email <> s.email OR d.country <> s.country) THEN
  UPDATE SET d.effective_to = s.change_ts, d.is_current = FALSE
WHEN NOT MATCHED THEN
  INSERT (customer_id, email, country, effective_from, effective_to, is_current, _processed_at)
  VALUES (s.customer_id, s.email, s.country, s.change_ts, TIMESTAMP '9999-12-31 00:00:00', TRUE, CURRENT_TIMESTAMP());

-- Fact table referencing current dimension row
CREATE TABLE IF NOT EXISTS mart.fact_orders (
  order_id STRING,
  customer_sk BIGINT,
  order_ts TIMESTAMP,
  order_date DATE,
  amount NUMERIC(18,2),
  status STRING,
  _processed_at TIMESTAMP
);

INSERT INTO mart.fact_orders
SELECT
  o.order_id,
  d.customer_sk,
  o.order_ts,
  DATE(o.order_ts) AS order_date,
  o.amount,
  o.status,
  CURRENT_TIMESTAMP()
FROM staging.orders o
JOIN mart.dim_customer d
  ON d.customer_id = o.customer_id AND d.is_current = TRUE;

Example 3 — Build an Aggregated Sales Mart

Daily sales summary by country for dashboard performance.

CREATE TABLE IF NOT EXISTS mart.sales_daily_country (
  order_date DATE,
  country STRING,
  orders INT64,
  gross_revenue NUMERIC(18,2),
  _processed_at TIMESTAMP
) PARTITION BY (order_date);

INSERT INTO mart.sales_daily_country
SELECT
  f.order_date,
  d.country,
  COUNT(*) AS orders,
  SUM(f.amount) AS gross_revenue,
  CURRENT_TIMESTAMP()
FROM mart.fact_orders f
JOIN mart.dim_customer d ON f.customer_sk = d.customer_sk AND d.is_current = TRUE
GROUP BY 1, 2;

Quality guardrails and self-checks

Raw: Row counts vs source; capture source file/path; never mutate.
Staging: Dedup by stable keys; enforce types and not-null on required columns.
Mart: Referential integrity to dimensions; total revenue within expected bounds.

Minimal checks to add

Count drift check between staging and mart within tolerance.
Null checks on business keys.
Freshness check: max(event_ts) within SLA.

Who this is for and prerequisites

Who this is for

Aspiring and junior data engineers building their first production pipelines.
Analysts moving into engineering who need reliable marts.
Software engineers owning analytics endpoints.

Prerequisites

Comfort with SQL (SELECT, JOIN, GROUP BY, window functions, MERGE).
Basic understanding of data warehouses/lakehouses and partitions.
Familiarity with dimensional modeling terms (fact, dimension, SCD).

Learning path

Map your layers
- Define raw, staging, and mart locations/schemas.
- Choose partition keys and clustering/sorting strategy.
Make staging deterministic
- Implement deduplication, type casting, and column contracts.
- Add audit columns and idempotent MERGE logic.
Model marts
- Create core dimensions (customers, products) and facts (orders, events).
- Implement SCD2 for slowly changing attributes.
Optimize and validate
- Add indexes/clustering where supported; confirm partition pruning.
- Install freshness and volume checks.

Exercises

Complete the tasks below. The quick test is available to everyone; only logged-in users will see saved progress.

Exercise 1: Raw to Staging — Orders
Transform raw.orders (JSON) into staging.orders with deduplication and typing.
- Deduplicate by order_id, keep latest _ingested_at.
- Cast order_ts to TIMESTAMP and derive order_date.
- Include audit columns: _ingested_at, _processed_at.
Mini checklist
- MERGE is idempotent
- Types are enforced
- Partition by order_date
Exercise 2: Staging to Mart — Customers and Sales
Build mart.dim_customer (SCD2) and mart.fact_orders; then aggregate daily sales by country.
- Use customer_id as business key; create customer_sk.
- Mark current rows with is_current and bound history with effective_to.
- Produce mart.sales_daily_country: date, country, orders, revenue.
Mini checklist
- SCD2 closes and opens rows correctly
- Facts join to current dimension row
- Aggregation matches sum of facts

Common mistakes and how to self-check

Mutating raw data: Raw must be append-only. Self-check: Verify no UPDATE/DELETE on raw.
Non-idempotent loads: INSERT-only creates duplicates on rerun. Self-check: Rerun job; counts should not inflate.
Missing business keys: Staging without stable keys breaks joins. Self-check: Ensure keys are NOT NULL and unique after dedup.
SCD2 errors: Overlapping histories. Self-check: For each key, no overlaps between effective windows.
Over-aggregating in staging: Keep business logic in marts. Self-check: Staging should resemble clean source, not reports.
Poor partition choice: Partition on high-cardinality columns hurts performance. Self-check: Confirm partition pruning in query plans.

Practical projects

Mini retail warehouse: Orders, customers, products from CSV/JSON into a star schema with daily sales mart.
Event analytics: Clickstream to sessionized fact with user dimension and a funnel mart.
Finance snapshot: Daily balances with SCD2 account attributes and a P&L mart by month.

Mini challenge

Your PM asks for a new KPI: repeat customers in the last 90 days by country. Sketch the tables and SQL you would add in the mart layer. Identify the partitioning and any SCD concerns.

Next steps

Finish the exercises, then take the quick test below to check understanding.
Remember: the test is available to everyone; only logged-in users will have progress saved.
Extend your pipeline with one additional dimension and one new mart metric.

Menu

Transform Layers Raw Staging Mart

Table of Contents

Why this matters

Concept explained simply

Core design choices

Worked examples

Quality guardrails and self-checks

Who this is for and prerequisites

Who this is for

Prerequisites

Learning path

Exercises

Common mistakes and how to self-check

Practical projects

Mini challenge

Next steps

Practice Exercises

Raw to Staging — Build staging.orders

Instructions

Expected Output

Staging to Mart — dim_customer (SCD2) and sales_daily_country

Transform Layers Raw Staging Mart — Quick Test

Have questions about Transform Layers Raw Staging Mart?

AI Assistant