Why this matters
Data engineers ship reliable analytics by shaping data through three clear layers: Raw, Staging, and Mart. This layering makes pipelines auditable, re-runnable, and fast for business queries.
- Real tasks you will do:
- Design schemas for landing (raw), cleaning (staging), and analytics (mart).
- Implement incremental, idempotent transforms that can reprocess safely.
- Model dimensions and facts (including SCD Type 2) for consistent reporting.
- Add data quality checks and audit columns at each hop.
- Partition and cluster data for performance and cost control.
Concept explained simply
Think of your data like drinking water:
- Raw (ingest): water straight from the source. Immutable, exactly as received.
- Staging (refine): filtered, standardized, and conformed to business keys.
- Mart (serve): bottled for specific consumers (sales, marketing, finance) with business logic applied.
Mental model: Every layer answers a question.
- Raw: What exactly arrived and when?
- Staging: What is the clean, deduplicated truth keyed by our business IDs?
- Mart: What does the business want to ask quickly and repeatedly?
ETL vs ELT: Either compute before load (ETL) or load then compute in the warehouse (ELT). The layering stays the same; only where the compute happens changes.
Core design choices
- Idempotency: Rerunning a job should not create duplicates. Use MERGE/UPSERT patterns, deduping windows, and stable keys.
- Incremental loads: Track watermarks by event time or ingestion time; avoid full refresh unless necessary.
- Partitioning: Partition by event date or load date; cluster/sort by keys used in joins and filters.
- Auditability: Add columns like
_ingested_at,_source_file,_load_id,_processed_at. - Contracts: Define required fields and types at staging. Raw can be schemaless; staging is not.
- Modeling: Use stars (dimensions + facts) for marts. Consider SCD Type 2 for slowly changing attributes.
Worked examples
Example 1 — Raw to Staging: Deduplicate and standardize
Scenario: Clickstream events in JSON at raw.clickstream with fields: event_id, user_id, event_ts (string ISO), url, _ingested_at.
- Parse types and deduplicate by
event_id, keeping the latest_ingested_at. - Store in
staging.clickstream, partitioned byevent_date.
-- Idempotent staging load
CREATE TABLE IF NOT EXISTS staging.clickstream (
event_id STRING,
user_id STRING,
event_ts TIMESTAMP,
event_date DATE,
url STRING,
_ingested_at TIMESTAMP,
_processed_at TIMESTAMP
) PARTITION BY (event_date);
MERGE INTO staging.clickstream t
USING (
SELECT
event_id,
user_id,
TIMESTAMP(event_ts) AS event_ts,
DATE(TIMESTAMP(event_ts)) AS event_date,
url,
_ingested_at,
CURRENT_TIMESTAMP() AS _processed_at,
ROW_NUMBER() OVER (PARTITION BY event_id ORDER BY _ingested_at DESC) AS rn
FROM raw.clickstream
) s
ON t.event_id = s.event_id
WHEN MATCHED AND s.rn = 1 THEN UPDATE SET
user_id = s.user_id,
event_ts = s.event_ts,
event_date = s.event_date,
url = s.url,
_ingested_at = s._ingested_at,
_processed_at = s._processed_at
WHEN NOT MATCHED AND s.rn = 1 THEN INSERT (
event_id, user_id, event_ts, event_date, url, _ingested_at, _processed_at
) VALUES (
s.event_id, s.user_id, s.event_ts, s.event_date, s.url, s._ingested_at, s._processed_at
);
Example 2 — Staging to Dimensions (SCD2) and Fact
Scenario: staging.customers with customer_id, email, country, _ingested_at. Build dim_customer SCD2 and fact_orders.
-- Dimension with SCD2
CREATE TABLE IF NOT EXISTS mart.dim_customer (
customer_sk BIGINT GENERATED ALWAYS AS IDENTITY,
customer_id STRING,
email STRING,
country STRING,
effective_from TIMESTAMP,
effective_to TIMESTAMP,
is_current BOOLEAN,
_processed_at TIMESTAMP
);
-- Upsert changes (SCD2)
MERGE INTO mart.dim_customer d
USING (
SELECT customer_id, email, country, _ingested_at AS change_ts FROM staging.customers
) s
ON d.customer_id = s.customer_id AND d.is_current = TRUE
WHEN MATCHED AND (d.email <> s.email OR d.country <> s.country) THEN
UPDATE SET d.effective_to = s.change_ts, d.is_current = FALSE
WHEN NOT MATCHED THEN
INSERT (customer_id, email, country, effective_from, effective_to, is_current, _processed_at)
VALUES (s.customer_id, s.email, s.country, s.change_ts, TIMESTAMP '9999-12-31 00:00:00', TRUE, CURRENT_TIMESTAMP());
-- Fact table referencing current dimension row
CREATE TABLE IF NOT EXISTS mart.fact_orders (
order_id STRING,
customer_sk BIGINT,
order_ts TIMESTAMP,
order_date DATE,
amount NUMERIC(18,2),
status STRING,
_processed_at TIMESTAMP
);
INSERT INTO mart.fact_orders
SELECT
o.order_id,
d.customer_sk,
o.order_ts,
DATE(o.order_ts) AS order_date,
o.amount,
o.status,
CURRENT_TIMESTAMP()
FROM staging.orders o
JOIN mart.dim_customer d
ON d.customer_id = o.customer_id AND d.is_current = TRUE;
Example 3 — Build an Aggregated Sales Mart
Daily sales summary by country for dashboard performance.
CREATE TABLE IF NOT EXISTS mart.sales_daily_country (
order_date DATE,
country STRING,
orders INT64,
gross_revenue NUMERIC(18,2),
_processed_at TIMESTAMP
) PARTITION BY (order_date);
INSERT INTO mart.sales_daily_country
SELECT
f.order_date,
d.country,
COUNT(*) AS orders,
SUM(f.amount) AS gross_revenue,
CURRENT_TIMESTAMP()
FROM mart.fact_orders f
JOIN mart.dim_customer d ON f.customer_sk = d.customer_sk AND d.is_current = TRUE
GROUP BY 1, 2;
Quality guardrails and self-checks
- Raw: Row counts vs source; capture source file/path; never mutate.
- Staging: Dedup by stable keys; enforce types and not-null on required columns.
- Mart: Referential integrity to dimensions; total revenue within expected bounds.
Minimal checks to add
- Count drift check between staging and mart within tolerance.
- Null checks on business keys.
- Freshness check: max(event_ts) within SLA.
Who this is for and prerequisites
Who this is for
- Aspiring and junior data engineers building their first production pipelines.
- Analysts moving into engineering who need reliable marts.
- Software engineers owning analytics endpoints.
Prerequisites
- Comfort with SQL (SELECT, JOIN, GROUP BY, window functions, MERGE).
- Basic understanding of data warehouses/lakehouses and partitions.
- Familiarity with dimensional modeling terms (fact, dimension, SCD).
Learning path
- Map your layers
- Define raw, staging, and mart locations/schemas.
- Choose partition keys and clustering/sorting strategy.
- Make staging deterministic
- Implement deduplication, type casting, and column contracts.
- Add audit columns and idempotent MERGE logic.
- Model marts
- Create core dimensions (customers, products) and facts (orders, events).
- Implement SCD2 for slowly changing attributes.
- Optimize and validate
- Add indexes/clustering where supported; confirm partition pruning.
- Install freshness and volume checks.
Exercises
Complete the tasks below. The quick test is available to everyone; only logged-in users will see saved progress.
- Exercise 1: Raw to Staging — Orders
Transform
raw.orders(JSON) intostaging.orderswith deduplication and typing.- Deduplicate by
order_id, keep latest_ingested_at. - Cast
order_tsto TIMESTAMP and deriveorder_date. - Include audit columns:
_ingested_at,_processed_at.
Mini checklist
- MERGE is idempotent
- Types are enforced
- Partition by
order_date
- Deduplicate by
- Exercise 2: Staging to Mart — Customers and Sales
Build
mart.dim_customer(SCD2) andmart.fact_orders; then aggregate daily sales by country.- Use
customer_idas business key; createcustomer_sk. - Mark current rows with
is_currentand bound history witheffective_to. - Produce
mart.sales_daily_country: date, country, orders, revenue.
Mini checklist
- SCD2 closes and opens rows correctly
- Facts join to current dimension row
- Aggregation matches sum of facts
- Use
Common mistakes and how to self-check
- Mutating raw data: Raw must be append-only. Self-check: Verify no UPDATE/DELETE on raw.
- Non-idempotent loads: INSERT-only creates duplicates on rerun. Self-check: Rerun job; counts should not inflate.
- Missing business keys: Staging without stable keys breaks joins. Self-check: Ensure keys are NOT NULL and unique after dedup.
- SCD2 errors: Overlapping histories. Self-check: For each key, no overlaps between effective windows.
- Over-aggregating in staging: Keep business logic in marts. Self-check: Staging should resemble clean source, not reports.
- Poor partition choice: Partition on high-cardinality columns hurts performance. Self-check: Confirm partition pruning in query plans.
Practical projects
- Mini retail warehouse: Orders, customers, products from CSV/JSON into a star schema with daily sales mart.
- Event analytics: Clickstream to sessionized fact with user dimension and a funnel mart.
- Finance snapshot: Daily balances with SCD2 account attributes and a P&L mart by month.
Mini challenge
Your PM asks for a new KPI: repeat customers in the last 90 days by country. Sketch the tables and SQL you would add in the mart layer. Identify the partitioning and any SCD concerns.
Next steps
- Finish the exercises, then take the quick test below to check understanding.
- Remember: the test is available to everyone; only logged-in users will have progress saved.
- Extend your pipeline with one additional dimension and one new mart metric.