How to learn Query Performance Principles for Performance And Scalability in Data Architect for free

Why this matters

As a Data Architect, your designs must scale without surprises. Query performance principles help you reduce compute cost, prevent timeouts, and keep dashboards and pipelines fast. You will decide storage formats, partitioning, indexing, and modeling that make common queries efficient.

Design star schemas and partitioning so BI queries scan less data.
Guide teams to create the right indexes and join strategies.
Prevent data lake slowdowns from small files and non-prunable filters.
Control concurrency and resource usage to keep SLAs stable.

Who this is for

Data Architects and Senior Data Engineers shaping data models and platform choices.
Analytics Engineers optimizing warehouse queries.
Engineers moving batch workloads to low-latency, cost-aware designs.

Prerequisites

Comfortable with SQL joins, aggregates, and filtering.
Basic understanding of OLTP vs. OLAP, columnar formats (Parquet/ORC), and indexes.
Familiarity with explain plans in at least one engine (e.g., PostgreSQL, Snowflake, BigQuery, Spark SQL).

Concept explained simply

Fast queries do two things: read less data and move less data. Everything else is a way to achieve those two goals.

Mental model

Imagine your query as a flow of water:

Filters and projections are the valve at the source (read less).
Partitioning, indexing, and pruning are the pipes selecting the shortest path.
Joins and aggregations are mixers; the smaller the inputs, the smoother it runs.
Sorting and window functions are fine filters; use only when needed.

Quick glossary

Pruning: Skipping partitions or files that cannot match your filter.
Projection: Reading only the columns you need.
Broadcast join: Sending a small table to all workers to avoid shuffles.
Shuffle: Network-heavy redistribution of data, often the biggest cost.
Spill: When memory runs out and intermediate data goes to disk (slow).

Core principles you’ll use daily

Push filters early: Write filters so engines can prune partitions and indexes; avoid wrapping columns in functions in WHERE clauses.
Project only needed columns: Select the minimal set; columnar engines benefit the most.
Use the right join strategy: Join small to big; broadcast small dimensions; ensure join keys are distributed/indexed.
Order of operations: Filter and pre-aggregate before heavy joins and sorts when possible.
Choose storage and layout wisely: Columnar formats, compressed, partitioned by common filters; avoid the small files problem.
Indexing strategy: Composite indexes matching filter order and sort needs; covering indexes for hot OLTP queries.
Accurate statistics: Keep table and column stats fresh so the optimizer picks the right plan; avoid stale plans.
Avoid row-by-row work: Prefer set-based operations and window functions over loops/UDFs in tight filters.
Manage concurrency and memory: Tune parallelism, limit spills, and consider workload isolation or warehouses/pools.
Materialize for speed: Use materialized views or summary tables for frequent heavy aggregations; refresh on a schedule.

Worked examples

Example 1 — OLTP: speeding up a filtered look-up

Original query:

SELECT *
FROM orders
WHERE customer_id = 123
  AND created_at >= '2025-01-01'
ORDER BY created_at DESC
LIMIT 50;

Issues:

SELECT * fetches all columns; wider rows mean more I/O.
No composite index to support both filter and sort.

Better approach:

-- Create composite index matching filter and order
CREATE INDEX idx_orders_cust_created ON orders (customer_id, created_at DESC) INCLUDE (order_id, total_amount, status);

SELECT order_id, total_amount, status, created_at
FROM orders
WHERE customer_id = 123
  AND created_at >= '2025-01-01'
ORDER BY created_at DESC
LIMIT 50;

Why it works: The index supports both the filter and the sort (no extra sort step). INCLUDE makes it a covering index for the selected columns, minimizing heap lookups.

Example 2 — Warehouse: pruning and broadcasting

-- Original
SELECT d.region, SUM(f.sales_amount) AS revenue
FROM fact_sales f
JOIN dim_date d ON f.date_key = d.date_key
WHERE d.calendar_date BETWEEN '2025-07-01' AND '2025-07-31'
GROUP BY d.region;

Improvements:

Partition fact_sales by date_key to enable pruning.
Project only needed columns: f.sales_amount and f.date_key.
Broadcast dim_date if it’s small to avoid shuffles.

-- Optimized (syntax varies by engine)
SELECT d.region, SUM(f.sales_amount) AS revenue
FROM fact_sales f
JOIN /*+ BROADCAST(d) */ dim_date d ON f.date_key = d.date_key
WHERE d.calendar_date >= '2025-07-01' AND d.calendar_date <= '2025-07-31'
GROUP BY d.region;

In some engines, filtering dim_date first into a small CTE and broadcasting it reduces shuffles.

Example 3 — Data lake: make partition pruning work

Original query:

SELECT COUNT(*)
FROM events
WHERE DATE(event_ts) = '2025-07-15';

Problem: events is partitioned by event_date (string yyyy-mm-dd). The function on event_ts prevents pruning.

Optimized:

SELECT COUNT(*)
FROM events
WHERE event_date = '2025-07-15';

Further tips:

Ensure files are 128–512 MB each to avoid small file overhead.
Use only needed columns; Parquet with column pruning is very effective.

Example 4 — Aggregation strategy with pre-aggregation

Original:

SELECT product_id, DATE_TRUNC('month', sold_at) m, SUM(amount)
FROM sales
GROUP BY product_id, DATE_TRUNC('month', sold_at);

Optimized approach:

Maintain a monthly_sales summary table updated daily.
Query the summary for dashboards, keeping heavy group-by off peak time.

SELECT product_id, month, sum_amount
FROM monthly_sales
WHERE month BETWEEN '2025-01-01' AND '2025-12-01';

Hands-on exercises

These mirror the exercises below. You can complete them in any SQL engine you know. Focus on the concepts: pruning, projection, correct join strategy, and indexing.

Exercise ex1: Optimize a slow OLTP-style query by designing a composite index and rewriting the SELECT to be covering. Validate with an EXPLAIN plan.
Exercise ex2: Optimize a data lake query for partition pruning and reduced shuffles; rewrite filters and choose a broadcast join where appropriate.

Query review checklist

Are filters written to enable partition/index pruning? (No functions on filtered columns.)
Are only necessary columns selected?
Is the join order small-to-large and join keys selective?
Can a small table be broadcast to avoid a shuffle?
Are statistics fresh or cardinalities accurate?
Is there an unnecessary sort or DISTINCT?
For data lakes: are files right-sized and formats columnar with compression?
Is there a materialized view or summary that fits the query?

Common mistakes and how to self-check

Using functions on partition columns (prevents pruning). Self-check: Remove the function and compare scanned data.
SELECT * in analytical queries. Self-check: Count columns used; rewrite to explicit list and compare I/O.
Joining before filtering. Self-check: Filter and pre-aggregate inputs first; compare plan complexity.
Relying on stale stats. Self-check: Update or analyze stats, then EXPLAIN again; did join type change?
Ignoring data distribution. Self-check: Inspect skew on join keys; consider salting, repartitioning, or broadcasting.
Too many small files. Self-check: List file counts and sizes; compact to target size and retest.

Practical projects

Design a sales star schema with partitioned fact table and 2–3 dimensions. Demonstrate 3 queries with pruning and a broadcast join.
Convert a CSV data lake to Parquet, set partitioning, and compact files. Show before/after scan metrics on the same query.
Create a materialized view for a weekly dashboard and measure latency reduction vs. raw query.

Learning path

Start with pruning and projection: rewrite filters and SELECT lists.
Master joins: order, type, broadcast vs. shuffle, and indexes for OLTP.
Tackle storage layout: partitioning, clustering/sorting keys, file size, and formats.
Add aggregation strategy: pre-aggregation, materialized views, and incremental refresh.
Operationalize: stats maintenance, concurrency management, and workload isolation.

Mini challenge

You have a partitioned Parquet dataset events partitioned by event_date (yyyy-mm-dd). A frequent query counts events for the last 7 days and joins a small dim_users table for a filter on user_type. Draft an optimized query that:

Prunes partitions using event_date without functions.
Projects only necessary columns.
Broadcasts dim_users.

One possible approach

WITH d AS (
  SELECT user_id
  FROM dim_users /*+ BROADCAST */
  WHERE user_type = 'pro'
)
SELECT COUNT(*)
FROM events e
JOIN d ON e.user_id = d.user_id
WHERE e.event_date BETWEEN '2025-07-09' AND '2025-07-15';

Next steps

Apply the checklist to 3 critical queries in your environment; record before/after metrics.
Propose a materialized view or summary table for one heavy dashboard.
Compact small files in one partitioned dataset and document improvements.

Quick test

The quick test is available to everyone. If you are logged in, your progress will be saved.

Menu

Query Performance Principles

Table of Contents

Why this matters

Who this is for

Prerequisites

Concept explained simply

Mental model

Core principles you’ll use daily

Worked examples

Hands-on exercises

Query review checklist

Common mistakes and how to self-check

Practical projects

Learning path

Mini challenge

Next steps

Quick test

Practice Exercises

Design a covering index and rewrite a slow OLTP query

Instructions

Expected Output

Enable partition pruning and efficient joins in a data lake

Have questions about Query Performance Principles?

AI Assistant