How to learn Performance Aware Querying For Warehouses for SQL in Analytics Engineer for free

What this subskill covers

Performance-aware querying means writing SQL that scans less data, joins efficiently, and returns correct results faster at lower cost. You will learn practical patterns that apply to columnar cloud warehouses.

Who this is for

Analytics Engineers and BI Developers building models, marts, and dashboards.
Data Analysts who want queries that finish quickly and reliably.
Anyone paying per terabyte scanned or fighting slow dashboards.

Prerequisites

Comfort with SELECT, WHERE, GROUP BY, JOIN, and basic window functions.
Know your warehouse has columnar storage and supports partitioning/clustering (names vary by vendor).

Why this matters

Dashboards need predictable performance to refresh on schedule.
Data modeling jobs must finish within SLAs to unblock downstream teams.
Cloud warehouses often bill by data scanned or compute time—waste less, pay less.

Real tasks in the profession

Designing a daily sales mart that must finish under 10 minutes.
Optimizing a JOIN between a 2B-row fact and small dimensions.
Reducing the cost of a COUNT DISTINCT metric in a product analytics dashboard.

Concept explained simply

A warehouse stores columns separately. If you read fewer columns and fewer rows, queries run faster. Your job is to help the engine skip as much data as possible and do the minimum work needed for correct results.

Mental model

Read less: select only needed columns; filter early with sargable predicates.
Touch less: prune partitions and use clustering/sort/distribution keys well.
Move less: prefer broadcast of tiny tables to large shuffles; pre-aggregate before big joins.
Compute less: avoid unnecessary DISTINCT, over-wide windows, full sorts; use approximate functions when they are acceptable.

Core principles for warehouses

Avoid SELECT * in production. Pick only columns you need; columnar engines reward this.

Write sargable filters that enable pruning. Example for a date column:

-- Good (range, inclusive-exclusive)
WHERE order_date >= '2024-01-01' AND order_date < '2024-01-02'
-- Risky: applying functions to the column often blocks pruning
WHERE DATE(order_date) = '2024-01-01' -- often slower

Partition pruning and clustering: Filter by partition keys first; choose clustering/sort keys aligned with frequent filters and joins.
Join strategy: Join big fact to small dimensions. Broadcast/replicate the small side when possible. Filter both sides before joining.
Pre-aggregate early: Summarize the fact table at useful grain before joining to many dimensions.
COUNT DISTINCT carefully: It’s expensive. Consider de-duplicating first or using approximate algorithms when exact precision isn’t required.
Window functions: Powerful, but can explode work. Filter early, limit PARTITION BY columns to what’s necessary, and consider pre-aggregations.
CTEs: Some engines inline, others can materialize. If reusing the same heavy subquery multiple times, consider a temp table or a single evaluation strategy.
ORDER BY + LIMIT: A global sort can be expensive. Pre-filter aggressively or use partial/top-N strategies when supported.
UDFs: Treat as last resort; they often disable optimizations and can be slower than native SQL.

Worked examples

Example 1 — Prune partitions and select fewer columns

Goal: Daily revenue by region for web channel.

-- Slow pattern
SELECT *
FROM sales_fact
WHERE DATE(order_date) = '2024-01-01' AND channel = 'web';

Better:

SELECT region, SUM(amount) AS revenue
FROM sales_fact
WHERE order_date >= '2024-01-01' AND order_date < '2024-01-02'
  AND channel = 'web'
GROUP BY region;

Why it’s faster: range filter enables partition pruning; fewer columns read; aggregation done once.

Example 2 — Broadcast small dimension and pre-filter

Goal: Revenue by product category last 7 days.

-- Better pattern
WITH filtered_fact AS (
  SELECT product_id, amount
  FROM sales_fact
  WHERE order_date >= CURRENT_DATE - INTERVAL '7' DAY
), small_dim AS (
  SELECT product_id, category FROM dim_products -- small table
)
SELECT d.category, SUM(f.amount) AS revenue
FROM filtered_fact f
JOIN small_dim d USING (product_id)
GROUP BY d.category;

Why it’s faster: Fact filtered before join; small dimension is cheap to broadcast; less data shuffled.

Example 3 — Cheaper distinct via pre-dedup or approximate

Goal: Active customers (distinct) yesterday by segment.

-- Exact but expensive if done on raw facts
-- SELECT segment, COUNT(DISTINCT customer_id) ...

-- Better exact approach: pre-dedup first, then join
WITH active AS (
  SELECT DISTINCT customer_id
  FROM sales_fact
  WHERE order_date >= '2024-01-01' AND order_date < '2024-01-02'
    AND channel = 'web'
)
SELECT c.segment, COUNT(*) AS customers
FROM active a
JOIN dim_customers c USING (customer_id)
GROUP BY c.segment;

-- If approximate is acceptable (faster, less memory):
-- SELECT c.segment, APPROX_COUNT_DISTINCT(a.customer_id) AS customers ...
-- Note: Result is approximate; validate with stakeholders.

Exercises (practice)

Use the inline dataset in each exercise so you can reason about the exact output.

Exercise 1 — Filter early, prune partitions, and select only needed columns

Task: Compute revenue by region for 2024-01-01 and channel = 'web'. Avoid SELECT * and use a sargable date filter.

Dataset (inline CTE)

WITH sales_fact AS (
  SELECT * FROM (
    VALUES
      (1, '2024-01-01', 'NA', 'c1', 'p1', 100, 'web'),
      (2, '2024-01-01', 'NA', 'c2', 'p2', 200, 'web'),
      (3, '2024-01-01', 'EU', 'c1', 'p3', 150, 'web'),
      (4, '2024-01-01', 'NA', 'c3', 'p4',  50, 'store'),
      (5, '2024-01-02', 'EU', 'c2', 'p1', 120, 'web'),
      (6, '2024-01-02', 'NA', 'c1', 'p2',  80, 'web')
  ) AS t(order_id, order_date, region, customer_id, product_id, amount, channel)
)
-- Your query goes below

Expected output:

region | revenue
------ | -------
EU     | 150
NA     | 300

Hints

Use order_date >= '2024-01-01' AND order_date < '2024-01-02'.
Filter channel first; select only region and amount.
GROUP BY region.

Show solution

WITH sales_fact AS (
  SELECT * FROM (
    VALUES
      (1, '2024-01-01', 'NA', 'c1', 'p1', 100, 'web'),
      (2, '2024-01-01', 'NA', 'c2', 'p2', 200, 'web'),
      (3, '2024-01-01', 'EU', 'c1', 'p3', 150, 'web'),
      (4, '2024-01-01', 'NA', 'c3', 'p4',  50, 'store'),
      (5, '2024-01-02', 'EU', 'c2', 'p1', 120, 'web'),
      (6, '2024-01-02', 'NA', 'c1', 'p2',  80, 'web')
  ) AS t(order_id, order_date, region, customer_id, product_id, amount, channel)
)
SELECT region, SUM(amount) AS revenue
FROM sales_fact
WHERE order_date >= '2024-01-01' AND order_date < '2024-01-02'
  AND channel = 'web'
GROUP BY region
ORDER BY region;

Exercise 2 — Pre-dedup and join small dimension

Task: Count distinct customers by segment for 2024-01-01 and channel = 'web'. First deduplicate customers, then join.

Datasets (inline CTEs)

WITH sales_fact AS (
  SELECT * FROM (
    VALUES
      (1, '2024-01-01', 'NA', 'c1', 'p1', 100, 'web'),
      (2, '2024-01-01', 'NA', 'c2', 'p2', 200, 'web'),
      (3, '2024-01-01', 'EU', 'c1', 'p3', 150, 'web'),
      (4, '2024-01-01', 'NA', 'c3', 'p4',  50, 'store'),
      (5, '2024-01-02', 'EU', 'c2', 'p1', 120, 'web'),
      (6, '2024-01-02', 'NA', 'c1', 'p2',  80, 'web')
  ) AS t(order_id, order_date, region, customer_id, product_id, amount, channel)
), dim_customers AS (
  SELECT * FROM (
    VALUES
      ('c1', 'vip'),
      ('c2', 'regular'),
      ('c3', 'regular')
  ) AS d(customer_id, segment)
)
-- Your query goes below

Expected output:

segment | customers
------- | ---------
regular | 1
vip     | 1

Hints

Create a CTE active AS (SELECT DISTINCT customer_id FROM sales_fact WHERE ...).
Join active to dim_customers on customer_id.
COUNT(*) after the join (distinct already handled).

Show solution

WITH sales_fact AS (
  SELECT * FROM (
    VALUES
      (1, '2024-01-01', 'NA', 'c1', 'p1', 100, 'web'),
      (2, '2024-01-01', 'NA', 'c2', 'p2', 200, 'web'),
      (3, '2024-01-01', 'EU', 'c1', 'p3', 150, 'web'),
      (4, '2024-01-01', 'NA', 'c3', 'p4',  50, 'store'),
      (5, '2024-01-02', 'EU', 'c2', 'p1', 120, 'web'),
      (6, '2024-01-02', 'NA', 'c1', 'p2',  80, 'web')
  ) AS t(order_id, order_date, region, customer_id, product_id, amount, channel)
), dim_customers AS (
  SELECT * FROM (
    VALUES
      ('c1', 'vip'),
      ('c2', 'regular'),
      ('c3', 'regular')
  ) AS d(customer_id, segment)
), active AS (
  SELECT DISTINCT customer_id
  FROM sales_fact
  WHERE order_date >= '2024-01-01' AND order_date < '2024-01-02'
    AND channel = 'web'
)
SELECT c.segment, COUNT(*) AS customers
FROM active a
JOIN dim_customers c USING (customer_id)
GROUP BY c.segment
ORDER BY c.segment;

Practice checklist (apply to any query)

Am I selecting only necessary columns?
Are my filters sargable (no functions on filtered columns)?
Do filters align with partition/clustering keys?
Can I pre-aggregate or pre-deduplicate before joining?
Is the small table broadcastable? Did I filter it too?
Do I really need COUNT(DISTINCT) or a window? Any cheaper equivalent?
Did I avoid a global ORDER BY when not needed?

Common mistakes and self-check

Using DATE(col) in WHERE: Often disables pruning. Self-check: Replace with a range filter; compare performance.
Joining raw facts to multiple dimensions before filtering: Leads to huge shuffles. Self-check: Filter and pre-aggregate first.
SELECT * in production models: Reads unnecessary columns. Self-check: Count columns referenced downstream; remove the rest.
COUNT(DISTINCT) on giant tables: Memory-heavy. Self-check: Try pre-dedup or approximate and compare error/latency tradeoff.
Unbounded window partitions: Expensive. Self-check: Add WHERE filters, minimize PARTITION BY keys, or move logic to GROUP BY.

Practical projects

Project 1: Redesign a slow daily sales mart. Document baseline runtime and cost, then apply pruning, pre-aggregation, and join fixes. Record the improvement.
Project 2: Build a customer-activity dashboard metric. Implement both exact and approximate distinct methods; compare accuracy and runtime.
Project 3: Create a performance checklist for your team. Add examples from your own models and set review gates.

Learning path

Start: Sargable filters and partition-aware date logic.
Next: Join strategies (broadcast vs. shuffle) and pre-aggregation.
Then: Window functions performance patterns.
Finally: Cost-aware design of marts and dashboards.

Next steps

Revisit your slowest query today; apply the practice checklist.
Run the Quick Test to confirm you’ve internalized the concepts.
Plan a small refactor of one model to use pre-aggregation.

Mini challenge

Given a large fact table and three dimensions, write two query versions to compute weekly active users by region: one naive (raw join + COUNT DISTINCT) and one optimized (pre-dedup + filtered joins). Compare the logic and explain why the optimized version should scan less.

Progress saving note

The quick test is available to everyone. Only logged-in users will have their test results and progress saved.

Menu

Performance Aware Querying For Warehouses

Table of Contents