Why Data Warehousing matters for Data Engineers
Data Warehousing turns raw data into reliable, query-ready datasets for analytics, BI, and machine learning. As a Data Engineer, you design schemas, build incremental pipelines, control costs, and ensure fast, concurrent access across teams. Mastering it unlocks trusted reporting, stable pipelines, and efficient spend.
What you will be able to do
- Design star/snowflake schemas and choose table types for performance.
- Implement partitioning and clustering for fast scans.
- Build incremental models with MERGE and materializations.
- Isolate workloads and manage concurrency for stable SLAs.
- Control costs with pruning, caching, and storage/compute choices.
- Optimize queries using plans, statistics, and pre-aggregations.
Who this is for
- Junior–mid Data Engineers wanting production-grade analytics models.
- Analytics Engineers building performant transformations.
- Software Engineers supporting BI/ML datasets.
Prerequisites
- Comfortable with SQL (SELECT, JOIN, GROUP BY, window functions).
- Basic data modeling concepts (entities, relationships, keys).
- Familiarity with a cloud warehouse or MPP database concepts.
Learning path
- Foundations: Warehouse concepts
- Differentiate OLTP vs OLAP; star vs snowflake; ELT vs ETL.
- Define fact table grain and surrogate keys.
Mini task
Write one-sentence grains for 3 facts: orders, pageviews, payments. - Design for performance
- Choose column types, surrogate keys, and dimension SCD strategy.
- Select clustering/partitioning aligned to query filters.
Mini task
Pick a partition key for a 2-year sales fact queried by date and region. Explain why. - Build incremental models
- Implement MERGE-based upserts and watermarks.
- Create materialized views for heavy aggregations.
Mini task
Sketch an incremental load plan with a last_modified watermark and late-arriving data handling. - Cost management
- Use pruning, clustering, and pre-aggregations.
- Right-size compute; auto-suspend/auto-resume if available.
Mini task
List 3 queries you can rewrite to scan fewer bytes without losing accuracy. - Concurrency and isolation
- Separate ETL vs BI compute; set query limits/timeouts where supported.
- Schedule heavy jobs off-peak.
Mini task
Propose three workload classes (ETL, BI-Prod, BI-Adhoc) with priorities. - Optimize and validate
- Read plans; confirm partition pruning; remove unnecessary columns/joins.
- Backfill safely with checkpoints.
Mini task
Take an expensive query and list 3 changes to cut scan time in half.
Worked examples
1) Star schema for Orders Analytics
Goal: Normalize dimensions lightly, keep a clear fact grain, and add SCD2 to customer.
-- Dimensions (SCD2 example for customer)
CREATE TABLE dim_customer (
customer_sk BIGINT PRIMARY KEY,
customer_id VARCHAR,
name VARCHAR,
region VARCHAR,
valid_from TIMESTAMP,
valid_to TIMESTAMP,
is_current BOOLEAN
);
-- Product dimension (static)
CREATE TABLE dim_product (
product_sk BIGINT PRIMARY KEY,
product_id VARCHAR,
category VARCHAR,
price_usd DECIMAL(10,2)
);
-- Fact table grain: one row per order line item
CREATE TABLE fct_order_item (
order_item_id BIGINT,
order_id BIGINT,
order_ts TIMESTAMP,
customer_sk BIGINT,
product_sk BIGINT,
quantity INT,
unit_price_usd DECIMAL(10,2),
total_amount_usd AS (quantity * unit_price_usd)
);
Why this works
2) Partitioning and Clustering
Goal: Speed up time-based queries and common filters.
-- Generic pattern; adapt syntax to your warehouse
CREATE TABLE fct_order_item_partitioned (
order_item_id BIGINT,
order_ts TIMESTAMP,
customer_sk BIGINT,
region VARCHAR,
product_sk BIGINT,
quantity INT,
total_amount_usd DECIMAL(12,2)
)
PARTITION BY DATE(order_ts) -- ensures pruning by date
CLUSTER BY region, customer_sk; -- improves locality for filters/grouping
Notes
3) Incremental upsert with watermark
Goal: Load only changed data from staging using MERGE.
-- Assume staging_orders has late updates and last_modified column
MERGE INTO fct_order_item t
USING (
SELECT *
FROM staging_order_items
WHERE last_modified >= (SELECT COALESCE(MAX(processed_at), TIMESTAMP '1970-01-01') FROM etl_checkpoints)
) s
ON t.order_item_id = s.order_item_id
WHEN MATCHED THEN UPDATE SET
t.quantity = s.quantity,
t.unit_price_usd = s.unit_price_usd,
t.total_amount_usd = s.quantity * s.unit_price_usd,
t.order_ts = s.order_ts
WHEN NOT MATCHED THEN INSERT (
order_item_id, order_ts, customer_sk, product_sk, quantity, unit_price_usd, total_amount_usd
) VALUES (
s.order_item_id, s.order_ts, s.customer_sk, s.product_sk, s.quantity, s.unit_price_usd, s.quantity * s.unit_price_usd
);
-- Update checkpoint
INSERT INTO etl_checkpoints(processed_at) VALUES (CURRENT_TIMESTAMP);
Tip
4) Materialized pre-aggregation
Goal: Speed up dashboards with a daily sales summary.
-- Create a materialized view (adapt syntax per engine)
CREATE MATERIALIZED VIEW mv_daily_sales AS
SELECT
DATE(order_ts) AS order_date,
region,
SUM(total_amount_usd) AS sales_usd,
COUNT(*) AS line_items
FROM fct_order_item_partitioned
GROUP BY 1,2;
-- Dashboard queries read from mv_daily_sales instead of the raw fact
SELECT order_date, region, sales_usd
FROM mv_daily_sales
WHERE order_date >= CURRENT_DATE - INTERVAL '30' DAY;
When to use
5) Query plan and pruning check
Goal: Confirm the engine uses partition pruning and clustering.
-- Inspect plan (keyword varies: EXPLAIN, EXPLAIN ANALYZE, PROFILE)
EXPLAIN
SELECT SUM(total_amount_usd)
FROM fct_order_item_partitioned
WHERE order_ts >= CURRENT_DATE - INTERVAL '7' DAY
AND region = 'EMEA';
What to look for
Drills and exercises
- Define the grain for three new facts: shipments, returns, and invoice payments.
- Choose partition and cluster columns for a 3-year web events table with country and user_id filters.
- Write a MERGE for a dimension with SCD2, including valid_from, valid_to, is_current.
- Create a daily revenue materialized view and list its refresh strategy.
- Rewrite one query to scan fewer columns and use a partition-friendly date filter.
- Propose a workload isolation plan with at least two compute pools and quotas.
Common mistakes and debugging tips
- Vague fact grain — Leads to duplicates and wrong totals. Tip: Write a one-sentence grain; enforce unique keys in tests.
- Wrong partition key — Too high cardinality or rarely filtered. Tip: Prefer event date; validate with query logs.
- Missing late-arriving handling — Data drifts. Tip: Overlap watermark, MERGE by business key, and dedupe.
- Over-normalized analytics schema — Excessive joins slow queries. Tip: Use star schemas; snowflake only when necessary.
- Materialized view overload — Too many MV refreshes. Tip: Reserve MVs for heavy, repeated aggregations.
- Assuming LIMIT reduces cost — Often scans full partitions anyway. Tip: Use selective WHERE filters and partition pruning.
- No workload isolation — BI users blocked by ETL. Tip: Separate compute, set timeouts, and schedule heavy jobs off-peak.
Mini project: Sales Analytics Warehouse
Build a small analytics warehouse for orders, customers, and products, supporting a revenue dashboard.
- 3 dimensions: customer (SCD2), product, date
- 1 fact: order_item (daily partition)
- Incremental pipeline with MERGE and watermark
- Materialized daily sales view
- Model schemas: define grains and keys; sketch ER diagram.
- Create tables with partitioning and clustering.
- Load initial full data (staging -> dimensions/fact).
- Implement incremental MERGE from staging using last_modified.
- Create materialized view mv_daily_sales and test refresh.
- Validate with 5 queries: revenue by day, by region, top products, returning customers, average order value.
- Document cost controls and workload isolation plan.
Acceptance checklist
- Fact table enforces unique grain key.
- Partition pruning observed for date filters.
- Incremental loads are idempotent and handle late data.
- Dashboards read from pre-aggregations where applicable.
- Separate compute (or equivalent) defined for ETL vs BI.
Practical roadmap milestones
- M1: Schema clarity — All facts have explicit grain; SCD strategy chosen.
- M2: Performance-ready tables — Partitioned and clustered; stats collected where applicable.
- M3: Incremental reliability — Watermarks, MERGE, dedupe logic proven with backfill.
- M4: Cost guardrails — Pre-aggregations defined; bytes scanned reduced; compute right-sized.
- M5: Production readiness — Workload isolation, query limits, monitoring checks in place.
Subskills
- Warehouse Concepts And Architectures — Core OLAP ideas, star vs snowflake, ELT patterns, layers (staging, core, marts).
- Table Design For Performance — Data types, keys, SCD strategies, column order, wide vs narrow tables.
- Clustering And Partitioning — Choosing keys to maximize pruning and locality.
- Incremental Models And Materializations — Watermarks, MERGE/UPSERT, materialized views.
- Cost Management Basics — Pruning, pre-aggregation, right-sizing compute, scheduling.
- Concurrency And Workload Isolation — Separate compute, queues, timeouts, SLAs.
- Data Sharing And Access Patterns — Secure views, row/column-level access, cross-team sharing without duplication.
- Query Optimization Basics — Reading plans, sargable filters, reducing scanned columns, pre-joins.
Practical projects
- Marketing attribution mart: session, touchpoint, and conversion facts with daily partitions and a 7-day lookback MV.
- Inventory snapshot fact: daily snapshot with SCD2 product hierarchy and backfill process.
- Streaming clickhouse-style events to warehouse with micro-batch upserts and clustering by user_id.
Next steps
- Finish the mini project and run performance checks (pruning, plan review).
- Compare costs before/after pre-aggregations and clustering.
- When ready, take the skill exam on this page. Everyone can try it; sign in to save your progress.