How to learn Query Optimization Basics for Data Warehousing in Data Engineer for free

Who this is for

Data Engineers and Analytics Engineers who run or maintain warehouse queries (ELT pipelines, dashboards, ad-hoc analyses) and want faster, cheaper, more reliable results.

Prerequisites

Comfortable writing basic SQL (SELECT, WHERE, JOIN, GROUP BY).
Know your warehouse basics: tables, partitions/clustering/sort keys (conceptually), and how to run EXPLAIN/QUERY PLAN.

Why this matters

In a data warehouse, inefficient queries waste time and money. As a Data Engineer you will:

Keep BI dashboards responsive during peak hours.
Cut compute/storage costs by scanning less data.
Meet SLAs for batch jobs and backfills.
Make incident triage faster with explain plans and quick fixes.

Reality check: where performance goes

Scanning entire fact tables when only a recent window is needed.
Joining before filtering, exploding rows and network shuffle.
SELECT * pulling unneeded columns through every stage.
Non-sargable predicates (wrapping columns in functions) that break partition pruning and indexes.

Concept explained simply

Query optimization means returning the same correct results while reading less data, moving less data, and doing less work. In warehouses, that mostly means: prune early, project fewer columns, filter before join, and aggregate as soon as possible.

Mental model: shortest path with the smallest load

Imagine moving boxes across a warehouse:

Pick only the boxes you need (project needed columns).
Pick from the right aisle first (filter on partition/sort keys).
Combine boxes near their source (pre-aggregate before big joins).
Avoid re-walking the warehouse (reuse results/materialize when stable).

Core techniques you will reuse

Project only needed columns. Replace SELECT * with explicit columns. Smaller rows = less I/O and shuffle.
Filter early using sargable predicates. Write predicates that let the engine prune partitions and skip blocks (e.g., order_date >= '2026-01-01', not DATE(order_date) >= ...).
Partition pruning. Ensure queries include filters on the partition key (often date) with simple comparisons, not wrapped in functions.
Join after filtering. Reduce each table as much as possible before joining. Join on keys, not derived expressions.
Pre-aggregate facts. Aggregate big fact tables to the needed grain before joining to dimensions.
Right-sized windows. For window functions, partition and order on selective keys; avoid unnecessary PARTITION BY that multiplies work.
Avoid unnecessary DISTINCT. Prefer GROUP BY with intended grain. DISTINCT over large joins can be very expensive.
Use appropriate sampling/approximation when acceptable. For exploratory analysis, use LIMIT/SAMPLE or approximate aggregates as allowed by requirements.
Keep statistics fresh. Up-to-date stats/cardinality help the optimizer pick join orders and algorithms.
Persist hot aggregates. Materialize stable, frequently used aggregates to avoid repeating heavy computation (e.g., daily rollups).

Worked examples

Example 1 — Partition pruning and sargable filters

Goal: last 30 days revenue by region.

Unoptimized:

SELECT r.region_name, SUM(f.revenue) AS revenue
FROM fact_sales f
JOIN dim_region r ON f.region_id = r.region_id
WHERE DATE(f.order_ts) >= CURRENT_DATE - 30  -- non-sargable
GROUP BY 1;

Issues: wrapping order_ts prevents pruning; joins before filtering.

Optimized:

WITH recent AS (
  SELECT region_id, revenue
  FROM fact_sales
  WHERE order_ts >= CURRENT_DATE - INTERVAL '30 day'  -- sargable
)
SELECT r.region_name, SUM(revenue) AS revenue
FROM recent
JOIN dim_region r USING (region_id)
GROUP BY r.region_name;

Effect: partition pruning on order_ts; fewer rows joined and shuffled.

Example 2 — Project fewer columns and join after filtering

Unoptimized:

SELECT *
FROM fact_orders f
JOIN dim_customers c ON f.customer_id = c.customer_id
WHERE f.status = 'DELIVERED';

Optimized (only needed cols, filter first):

WITH delivered AS (
  SELECT order_id, customer_id
  FROM fact_orders
  WHERE status = 'DELIVERED'
)
SELECT d.order_id, c.customer_tier
FROM delivered d
JOIN dim_customers c USING (customer_id);

Effect: smaller scan and join width, lower shuffle volume.

Example 3 — Pre-aggregate before joining dimensions

Unoptimized (join then aggregate):

SELECT c.segment, SUM(f.revenue) AS revenue
FROM fact_sales f
JOIN dim_customers c ON f.customer_id = c.customer_id
WHERE f.order_date >= '2026-01-01'
GROUP BY c.segment;

Optimized (aggregate fact first):

WITH sales_agg AS (
  SELECT customer_id, SUM(revenue) AS revenue
  FROM fact_sales
  WHERE order_date >= '2026-01-01'
  GROUP BY customer_id
)
SELECT c.segment, SUM(s.revenue) AS revenue
FROM sales_agg s
JOIN dim_customers c USING (customer_id)
GROUP BY c.segment;

Effect: much smaller intermediate result feeding the join.

Practice: Try it now

Anyone can do the exercise and take the quick test. If you log in, your progress and results are saved.

Exercise: Optimize a daily revenue query

Tables:

fact_orders(order_id, order_date, customer_id, product_id, qty, revenue) — partitioned by order_date
dim_customers(customer_id, region_id, customer_tier)
dim_regions(region_id, region_name)

Goal: last 30 days revenue by region_name, descending.

Starting query (slow):

SELECT r.region_name, SUM(f.revenue) AS revenue
FROM fact_orders f
JOIN dim_customers c ON f.customer_id = c.customer_id
JOIN dim_regions r ON c.region_id = r.region_id
WHERE DATE(f.order_date) >= CURRENT_DATE - 30
GROUP BY r.region_name
ORDER BY revenue DESC;

Tasks:

Make the date filter sargable for partition pruning.
Project only necessary columns.
Pre-aggregate fact_orders if helpful.

Show solution

WITH recent AS (
  SELECT customer_id, SUM(revenue) AS revenue
  FROM fact_orders
  WHERE order_date >= CURRENT_DATE - INTERVAL '30 day'
  GROUP BY customer_id
)
SELECT r.region_name, SUM(x.revenue) AS revenue
FROM recent x
JOIN dim_customers c USING (customer_id)
JOIN dim_regions r ON c.region_id = r.region_id
GROUP BY r.region_name
ORDER BY revenue DESC;

Why it works: partition pruning on order_date; less data moved into joins; only necessary columns carried.

Self-check checklist

Is the date predicate directly on order_date without wrapping functions?
Did you avoid SELECT * and only keep needed columns?
Did you reduce the fact table before joining?
Does the result ordering and totals match the original?

Common mistakes and how to self-check

Non-sargable filters. If EXPLAIN shows full scan despite a date filter, check for functions on the column (e.g., DATE(col)).
Joining before filtering. If row counts explode between scan and join, push filters/aggregations earlier.
Overusing DISTINCT. If DISTINCT is used to “fix” duplicates, revisit join keys and intended grain.
SELECT * everywhere. If network/shuffle is high, explicitly select needed columns.
Stale statistics. If the optimizer picks poor join orders, refresh stats/catalog where supported.

Practical projects

Dashboard rescue: Take a slow dashboard query, record baseline metrics (scan bytes, runtime), apply pruning, projection, and pre-aggregation, and re-measure.
Daily rollup table: Build a materialized or scheduled rollup (daily sales by region/product) and point a report to it.
Explain plan library: Collect before/after EXPLAIN plans for 3 queries showing reduced scans, fewer joins, or smaller shuffles.

Mini challenge

You have a table fact_events(event_ts, user_id, event_type, payload) partitioned by event_date. A PM wants weekly active users for the past 8 weeks by country. Draft an optimized query plan in bullets (not code) that minimizes scans and joins.

Reveal a strong plan

Filter on event_date for last 8 weeks (sargable) to prune partitions.
Project user_id and event_date only; drop payload early.
Pre-aggregate to user_id x week (distinct user_id per week).
Join the small result to dim_users(country) or dim_geo, not the raw events.
Group by country, week and count distinct user_id.

Learning path

Learn how your warehouse partitions/clusters data and how to read EXPLAIN/QUERY PLAN.
Practice pruning and projection on fact tables with date filters.
Optimize joins: filter first, pre-aggregate facts, verify row counts between steps.
Apply window functions carefully: set correct PARTITION BY/ORDER BY and reduce inputs.
Materialize stable aggregates for repeated use (daily/weekly rollups).
Build a habit: measure baseline, change one thing, re-measure, document.

Next steps

Revisit 1–2 core production queries and apply at least two optimizations each.
Create a shared checklist for your team (pruning, projection, join order, stats).
When ready, take the Quick Test below to confirm understanding.

Menu

Query Optimization Basics

Table of Contents

Who this is for

Prerequisites

Why this matters

Concept explained simply

Core techniques you will reuse

Worked examples

Practice: Try it now

Exercise: Optimize a daily revenue query

Self-check checklist

Common mistakes and how to self-check

Practical projects

Mini challenge

Learning path

Next steps

Practice Exercises

Optimize a daily revenue query

Instructions

Expected Output

Query Optimization Basics — Quick Test

Have questions about Query Optimization Basics?

AI Assistant