Who this is for
- BI Analysts who build dashboards and scheduled reports.
- Analysts who maintain ELT/ETL SQL for data marts or semantic layers.
- Anyone facing slow nightly loads, timeouts, or rising warehouse costs.
Prerequisites
- Comfortable with SQL SELECT, JOIN, WHERE, GROUP BY, and window functions.
- Basic understanding of your warehouse (e.g., columnar vs row-store) and table layouts.
- Ability to run queries and see execution stats (duration, rows scanned).
Why this matters
- Faster dashboards: Optimized BI loads produce smaller, query-ready tables that power snappy visuals.
- Reliable schedules: Nightly/Hourly jobs finish on time, reducing failures and firefighting.
- Lower costs: Less data scanned and fewer compute slots used.
- Cleaner models: Purpose-built aggregates simplify downstream queries and logic.
Concept explained simply
Query optimization for BI loads means shaping data early so downstream steps read fewer rows and join less. You push filters down, choose only needed columns, aggregate before wide joins, and leverage partitions and indexes so the database touches the smallest possible slice of data.
Mental model
Imagine water flowing through pipes. Wide, unfiltered tables are floods. Your job is to add gates and sieves upstream: filter by date and business rules first, aggregate to the grain you actually need, and only then join to enrich. Smaller, tidier flows keep everything fast.
Techniques that matter for BI loads
1) Read only what you need
- Select explicit columns, not SELECT *.
- Push filters to the earliest step (on the base table).
- Use range filters on time columns for partition pruning (e.g., >= start AND < end).
2) Join efficiently
- Pre-aggregate before joining many-to-many to avoid row explosion.
- Join on well-typed, indexed keys; avoid functions on join keys.
- Choose join order so the largest reduction happens earliest.
3) Partitioning, pruning, and statistics
- Partition large fact tables by date/time; always filter on the partition column without wrapping it in functions.
- Keep statistics up to date so the optimizer makes good plans.
- In columnar warehouses, use distribution/sort keys thoughtfully for big joins.
4) CTEs, temp tables, and reuse
- Some engines inline CTEs; if a heavy subquery is reused, materialize it into a temporary/intermediate table.
- Index or sort that intermediate table if reused many times.
5) Aggregate at the BI grain
- Build data marts at the grain you report on (e.g., daily_product_category_sales), not raw line-level if unneeded.
- Window functions are great, but GROUP BY is usually cheaper when you do not need row context.
6) Incremental loads
- Load only new/changed data using watermarks (e.g., last_update > max_processed_at).
- Use MERGE/UPSERT for late-arriving updates.
7) Avoid anti-patterns
- Do not wrap partition keys in functions in WHERE (e.g., DATE(order_date)); use ranges.
- Avoid DISTINCT as a band-aid for bad joins; fix the join strategy.
- Do not ORDER BY in intermediate steps unless required.
Worked examples
Assume tables: orders(order_id, customer_id, order_date, status, updated_at), order_items(order_id, sku_id, quantity, unit_price), sku(sku_id, category).
Example 1 — Filter early and select less
Slow:
SELECT o.*, oi.*, s.*
FROM orders o
JOIN order_items oi ON oi.order_id = o.order_id
LEFT JOIN sku s ON s.sku_id = oi.sku_id
WHERE DATE(o.order_date) = CURRENT_DATE - INTERVAL '1 day';Optimized:
WITH day_window AS (
SELECT
date_trunc('day', current_date - interval '1 day') AS d_start,
date_trunc('day', current_date) AS d_end
), orders_day AS (
SELECT o.order_id, o.order_date
FROM orders o, day_window dw
WHERE o.order_date >= dw.d_start AND o.order_date < dw.d_end
)
SELECT od.order_date, oi.sku_id, s.category,
SUM(oi.quantity) AS qty,
SUM(oi.quantity * oi.unit_price) AS revenue
FROM orders_day od
JOIN order_items oi ON oi.order_id = od.order_id
LEFT JOIN sku s ON s.sku_id = oi.sku_id
GROUP BY od.order_date, oi.sku_id, s.category;- No SELECT *.
- No function on partition key in WHERE.
- Group only the needed output grain.
Example 2 — Pre-aggregate before joining a wide dimension
Goal: revenue by day and category. Pre-aggregate items, then join SKU once.
WITH d AS (
SELECT date_trunc('day', current_date - interval '1 day') AS start_d,
date_trunc('day', current_date) AS end_d
), od AS (
SELECT order_id, order_date
FROM orders o, d
WHERE o.order_date >= d.start_d AND o.order_date < d.end_d
), item_agg AS (
SELECT oi.order_id, oi.sku_id,
SUM(oi.quantity * oi.unit_price) AS revenue
FROM order_items oi
JOIN od ON od.order_id = oi.order_id
GROUP BY oi.order_id, oi.sku_id
)
SELECT od.order_date,
s.category,
SUM(ia.revenue) AS revenue
FROM item_agg ia
JOIN od ON od.order_id = ia.order_id
JOIN sku s ON s.sku_id = ia.sku_id
GROUP BY od.order_date, s.category;We avoid joining SKU at line-level before reducing rows.
Example 3 — Incremental load with watermark
-- Suppose the target mart has a column processed_at tracking the last processed update.
-- 1) Get watermark
WITH wm AS (
SELECT COALESCE(MAX(processed_at), TIMESTAMP '2000-01-01 00:00:00') AS wm
FROM daily_category_sales_mart
), changed_orders AS (
SELECT o.order_id, o.order_date, o.updated_at
FROM orders o, wm
WHERE o.updated_at > wm.wm
), agg AS (
SELECT o.order_date AS day,
s.category,
SUM(oi.quantity * oi.unit_price) AS revenue
FROM changed_orders o
JOIN order_items oi ON oi.order_id = o.order_id
JOIN sku s ON s.sku_id = oi.sku_id
GROUP BY o.order_date, s.category
)
-- 2) Upsert into target
-- Syntax varies by engine; conceptual MERGE shown
MERGE INTO daily_category_sales_mart t
USING agg a
ON (t.day = a.day AND t.category = a.category)
WHEN MATCHED THEN UPDATE SET revenue = a.revenue, processed_at = CURRENT_TIMESTAMP
WHEN NOT MATCHED THEN INSERT (day, category, revenue, processed_at)
VALUES (a.day, a.category, a.revenue, CURRENT_TIMESTAMP);Loads only changed data.
Example 4 — Partition pruning
-- Good: range filter on partition column order_date
SELECT COUNT(*)
FROM orders
WHERE order_date >= DATE '2025-01-01'
AND order_date < DATE '2025-02-01';
-- Bad: wrapping the column prevents pruning
SELECT COUNT(*)
FROM orders
WHERE DATE(order_date) BETWEEN DATE '2025-01-01' AND DATE '2025-01-31';Step-by-step playbook
- Define the BI grain. Example: daily x category.
- Get the smallest input slice. Use partition range on date and updated_at if incremental.
- Pre-aggregate early. Reduce line-level data to the target grain (or close to it).
- Join dimensions late. Join small dimensions after reducing facts.
- Validate. Check row counts and totals against a trusted sample.
- Measure. Compare runtime, rows scanned, and costs before/after.
Exercises
These mirror the tasks in the Exercises panel below.
Exercise 1 (ex1) — Rewrite a slow daily KPI query
Given a slow query that uses SELECT * and DATE(order_date) in WHERE, rewrite it to:
- Filter by a date range on order_date.
- Select only needed columns.
- Pre-aggregate before joining SKU.
Deliverable: an optimized SQL that computes daily revenue by SKU category for the previous day.
Exercise 2 (ex2) — Prevent row explosion
You need daily revenue and a boolean flag is_promoted based on promotion_map(sku_id, promo_id). Avoid double counting from many-to-many joins.
- Use EXISTS or pre-aggregate item revenue before attaching the boolean flag.
Deliverable: a query that returns day, category, revenue, is_promoted without inflated revenue.
- Self-check checklist:
- Did you remove SELECT *?
- Are date filters ranges (>= start AND < end)?
- Did you pre-aggregate before wide joins?
- No functions on join/partition keys?
- No unnecessary ORDER BY in intermediate steps?
Common mistakes and how to self-check
- Using DATE(column) in WHERE. Fix: use range on the raw column.
- Joining before aggregating. Fix: aggregate facts first, then enrich.
- Slapping DISTINCT to hide duplicates. Fix the join logic and grain.
- SELECT *. Fix: select only columns you ship downstream.
- Forgetting statistics. Fix: refresh stats on large tables after big changes.
- Overusing window functions. If you do not need row context, GROUP BY is cheaper.
Quick self-audit mini-task
- Locate the largest scan in your job. Can you reduce the scanned range?
- Find any function-wrapped filters or join keys. Can you remove the function?
- Identify one opportunity to materialize a reused heavy subquery.
Practical projects
- Project A: Build a daily_category_sales mart. Inputs: orders, order_items, sku. Output columns: day, category, revenue, qty. Use incremental loads with a watermark.
- Project B: Optimize a dashboard’s slowest tile. Trace its upstream SQL, push filters earlier, and pre-aggregate. Document before/after runtime and rows scanned.
- Project C: Row explosion clinic. Reproduce a many-to-many join issue and fix it via pre-aggregation or semi-joins. Write a 5-point checklist for your team.
Learning path
- Master partition-range filtering and column selection.
- Practice pre-aggregating before joins for common BI grains.
- Adopt incremental load patterns (watermarks, MERGE).
- Implement a repeatable optimization checklist for all jobs.
- Validate and monitor: set expected runtime and row-scan budgets.
Next steps
- Apply the playbook to one production BI load and measure improvements.
- Set a standard template for new marts (grain, filters, join order, validations).
- Proceed to the Quick Test to check your understanding. Note: the test is available to everyone; only logged-in users get saved progress.
Mini challenge
Open challenge
Your mart refresh window shrank from 60 to 30 minutes. You must refresh the last 3 days of sales with late-arriving updates. Sketch the SQL flow:
- Pick the right time window and watermark.
- Show where you pre-aggregate.
- Explain how you upsert into the mart.