How to learn Data Warehousing for Data Engineer for free

Why Data Warehousing matters for Data Engineers

Data Warehousing turns raw data into reliable, query-ready datasets for analytics, BI, and machine learning. As a Data Engineer, you design schemas, build incremental pipelines, control costs, and ensure fast, concurrent access across teams. Mastering it unlocks trusted reporting, stable pipelines, and efficient spend.

What you will be able to do

Design star/snowflake schemas and choose table types for performance.
Implement partitioning and clustering for fast scans.
Build incremental models with MERGE and materializations.
Isolate workloads and manage concurrency for stable SLAs.
Control costs with pruning, caching, and storage/compute choices.
Optimize queries using plans, statistics, and pre-aggregations.

Who this is for

Junior–mid Data Engineers wanting production-grade analytics models.
Analytics Engineers building performant transformations.
Software Engineers supporting BI/ML datasets.

Prerequisites

Comfortable with SQL (SELECT, JOIN, GROUP BY, window functions).
Basic data modeling concepts (entities, relationships, keys).
Familiarity with a cloud warehouse or MPP database concepts.

Learning path

Foundations: Warehouse concepts
- Differentiate OLTP vs OLAP; star vs snowflake; ELT vs ETL.
- Define fact table grain and surrogate keys.
Mini task
Write one-sentence grains for 3 facts: orders, pageviews, payments.
Design for performance
- Choose column types, surrogate keys, and dimension SCD strategy.
- Select clustering/partitioning aligned to query filters.
Mini task
Pick a partition key for a 2-year sales fact queried by date and region. Explain why.
Build incremental models
- Implement MERGE-based upserts and watermarks.
- Create materialized views for heavy aggregations.
Mini task
Sketch an incremental load plan with a last_modified watermark and late-arriving data handling.
Cost management
- Use pruning, clustering, and pre-aggregations.
- Right-size compute; auto-suspend/auto-resume if available.
Mini task
List 3 queries you can rewrite to scan fewer bytes without losing accuracy.
Concurrency and isolation
- Separate ETL vs BI compute; set query limits/timeouts where supported.
- Schedule heavy jobs off-peak.
Mini task
Propose three workload classes (ETL, BI-Prod, BI-Adhoc) with priorities.
Optimize and validate
- Read plans; confirm partition pruning; remove unnecessary columns/joins.
- Backfill safely with checkpoints.
Mini task
Take an expensive query and list 3 changes to cut scan time in half.

Worked examples

1) Star schema for Orders Analytics

Goal: Normalize dimensions lightly, keep a clear fact grain, and add SCD2 to customer.

-- Dimensions (SCD2 example for customer)
CREATE TABLE dim_customer (
  customer_sk       BIGINT PRIMARY KEY,
  customer_id       VARCHAR,
  name              VARCHAR,
  region            VARCHAR,
  valid_from        TIMESTAMP,
  valid_to          TIMESTAMP,
  is_current        BOOLEAN
);

-- Product dimension (static)
CREATE TABLE dim_product (
  product_sk  BIGINT PRIMARY KEY,
  product_id  VARCHAR,
  category    VARCHAR,
  price_usd   DECIMAL(10,2)
);

-- Fact table grain: one row per order line item
CREATE TABLE fct_order_item (
  order_item_id   BIGINT,
  order_id        BIGINT,
  order_ts        TIMESTAMP,
  customer_sk     BIGINT,
  product_sk      BIGINT,
  quantity        INT,
  unit_price_usd  DECIMAL(10,2),
  total_amount_usd AS (quantity * unit_price_usd)
);

Why this works

Clear fact grain simplifies joins and aggregates. SCD2 enables accurate historical reporting when customers change region.

2) Partitioning and Clustering

Goal: Speed up time-based queries and common filters.

-- Generic pattern; adapt syntax to your warehouse
CREATE TABLE fct_order_item_partitioned (
  order_item_id   BIGINT,
  order_ts        TIMESTAMP,
  customer_sk     BIGINT,
  region          VARCHAR,
  product_sk      BIGINT,
  quantity        INT,
  total_amount_usd DECIMAL(12,2)
)
PARTITION BY DATE(order_ts)   -- ensures pruning by date
CLUSTER BY region, customer_sk; -- improves locality for filters/grouping

Notes

Pick a partition key most used in WHERE clauses with balanced distribution (often event date). Cluster on columns with moderate-to-high cardinality that appear in filters/joins.

3) Incremental upsert with watermark

Goal: Load only changed data from staging using MERGE.

-- Assume staging_orders has late updates and last_modified column
MERGE INTO fct_order_item t
USING (
  SELECT *
  FROM staging_order_items
  WHERE last_modified >= (SELECT COALESCE(MAX(processed_at), TIMESTAMP '1970-01-01') FROM etl_checkpoints)
) s
ON t.order_item_id = s.order_item_id
WHEN MATCHED THEN UPDATE SET
  t.quantity = s.quantity,
  t.unit_price_usd = s.unit_price_usd,
  t.total_amount_usd = s.quantity * s.unit_price_usd,
  t.order_ts = s.order_ts
WHEN NOT MATCHED THEN INSERT (
  order_item_id, order_ts, customer_sk, product_sk, quantity, unit_price_usd, total_amount_usd
) VALUES (
  s.order_item_id, s.order_ts, s.customer_sk, s.product_sk, s.quantity, s.unit_price_usd, s.quantity * s.unit_price_usd
);

-- Update checkpoint
INSERT INTO etl_checkpoints(processed_at) VALUES (CURRENT_TIMESTAMP);

Tip

Handle late-arriving records by overlapping the watermark window slightly (e.g., subtract 1–2 hours) and de-duplicate by a stable key.

4) Materialized pre-aggregation

Goal: Speed up dashboards with a daily sales summary.

-- Create a materialized view (adapt syntax per engine)
CREATE MATERIALIZED VIEW mv_daily_sales AS
SELECT
  DATE(order_ts) AS order_date,
  region,
  SUM(total_amount_usd) AS sales_usd,
  COUNT(*) AS line_items
FROM fct_order_item_partitioned
GROUP BY 1,2;

-- Dashboard queries read from mv_daily_sales instead of the raw fact
SELECT order_date, region, sales_usd
FROM mv_daily_sales
WHERE order_date >= CURRENT_DATE - INTERVAL '30' DAY;

When to use

Use when tables are large and the same aggregation is queried frequently. Avoid for highly volatile data where maintenance cost outweighs benefits.

5) Query plan and pruning check

Goal: Confirm the engine uses partition pruning and clustering.

-- Inspect plan (keyword varies: EXPLAIN, EXPLAIN ANALYZE, PROFILE)
EXPLAIN
SELECT SUM(total_amount_usd)
FROM fct_order_item_partitioned
WHERE order_ts >= CURRENT_DATE - INTERVAL '7' DAY
  AND region = 'EMEA';

What to look for

The plan should show date partition pruning and reduced scan ranges for the clustered column. If not, rewrite filters to be sargable (avoid wrapping partition columns in functions that hide their values).

Drills and exercises

Define the grain for three new facts: shipments, returns, and invoice payments.
Choose partition and cluster columns for a 3-year web events table with country and user_id filters.
Write a MERGE for a dimension with SCD2, including valid_from, valid_to, is_current.
Create a daily revenue materialized view and list its refresh strategy.
Rewrite one query to scan fewer columns and use a partition-friendly date filter.
Propose a workload isolation plan with at least two compute pools and quotas.

Common mistakes and debugging tips

Vague fact grain — Leads to duplicates and wrong totals. Tip: Write a one-sentence grain; enforce unique keys in tests.
Wrong partition key — Too high cardinality or rarely filtered. Tip: Prefer event date; validate with query logs.
Missing late-arriving handling — Data drifts. Tip: Overlap watermark, MERGE by business key, and dedupe.
Over-normalized analytics schema — Excessive joins slow queries. Tip: Use star schemas; snowflake only when necessary.
Materialized view overload — Too many MV refreshes. Tip: Reserve MVs for heavy, repeated aggregations.
Assuming LIMIT reduces cost — Often scans full partitions anyway. Tip: Use selective WHERE filters and partition pruning.
No workload isolation — BI users blocked by ETL. Tip: Separate compute, set timeouts, and schedule heavy jobs off-peak.

Mini project: Sales Analytics Warehouse

Build a small analytics warehouse for orders, customers, and products, supporting a revenue dashboard.

Scope

3 dimensions: customer (SCD2), product, date
1 fact: order_item (daily partition)
Incremental pipeline with MERGE and watermark
Materialized daily sales view

Steps

Model schemas: define grains and keys; sketch ER diagram.
Create tables with partitioning and clustering.
Load initial full data (staging -> dimensions/fact).
Implement incremental MERGE from staging using last_modified.
Create materialized view mv_daily_sales and test refresh.
Validate with 5 queries: revenue by day, by region, top products, returning customers, average order value.
Document cost controls and workload isolation plan.

Acceptance checklist

Fact table enforces unique grain key.
Partition pruning observed for date filters.
Incremental loads are idempotent and handle late data.
Dashboards read from pre-aggregations where applicable.
Separate compute (or equivalent) defined for ETL vs BI.

Practical roadmap milestones

M1: Schema clarity — All facts have explicit grain; SCD strategy chosen.
M2: Performance-ready tables — Partitioned and clustered; stats collected where applicable.
M3: Incremental reliability — Watermarks, MERGE, dedupe logic proven with backfill.
M4: Cost guardrails — Pre-aggregations defined; bytes scanned reduced; compute right-sized.
M5: Production readiness — Workload isolation, query limits, monitoring checks in place.

Subskills

Warehouse Concepts And Architectures — Core OLAP ideas, star vs snowflake, ELT patterns, layers (staging, core, marts).
Table Design For Performance — Data types, keys, SCD strategies, column order, wide vs narrow tables.
Clustering And Partitioning — Choosing keys to maximize pruning and locality.
Incremental Models And Materializations — Watermarks, MERGE/UPSERT, materialized views.
Cost Management Basics — Pruning, pre-aggregation, right-sizing compute, scheduling.
Concurrency And Workload Isolation — Separate compute, queues, timeouts, SLAs.
Data Sharing And Access Patterns — Secure views, row/column-level access, cross-team sharing without duplication.
Query Optimization Basics — Reading plans, sargable filters, reducing scanned columns, pre-joins.

Practical projects

Marketing attribution mart: session, touchpoint, and conversion facts with daily partitions and a 7-day lookback MV.
Inventory snapshot fact: daily snapshot with SCD2 product hierarchy and backfill process.
Streaming clickhouse-style events to warehouse with micro-batch upserts and clustering by user_id.

Next steps

Finish the mini project and run performance checks (pruning, plan review).
Compare costs before/after pre-aggregations and clustering.
When ready, take the skill exam on this page. Everyone can try it; sign in to save your progress.

Menu

Data Warehousing

Table of Contents

Why Data Warehousing matters for Data Engineers

What you will be able to do

Who this is for

Prerequisites

Learning path

Worked examples

1) Star schema for Orders Analytics

2) Partitioning and Clustering

3) Incremental upsert with watermark

4) Materialized pre-aggregation

5) Query plan and pruning check

Drills and exercises

Common mistakes and debugging tips

Mini project: Sales Analytics Warehouse

Practical roadmap milestones

Subskills

Practical projects

Next steps

Data Warehousing — Skill Exam

Topics

Warehouse Concepts And Architectures

Table Design For Performance

Clustering And Partitioning

Incremental Models And Materializations

Cost Management Basics

Data Sharing And Access Patterns

Concurrency And Workload Isolation

Query Optimization Basics

Have questions about Data Warehousing?

AI Assistant