How to learn Reducing Scan And Shuffle Costs for Warehouse Performance Tuning in Analytics Engineer for free

Why this matters

As an Analytics Engineer, dashboards and models must be fast and affordable. Two big drivers of cost and latency are scans (how much data you read) and shuffles (how much data you move across nodes during joins, aggregations, or sorts). Reducing them helps you:

Hit SLA/refresh windows for dashboards and reverse ETL jobs.
Cut cloud bills by reading fewer bytes and moving less data.
Make models more reliable and predictable under peak load.

Concept explained simply

Scan cost: the warehouse reading data from storage. Reading 2 columns from 10 days is cheaper than SELECT * from a full year.

Shuffle cost: data is redistributed across nodes so rows with the same key meet (for joins/aggregations). The more rows and the less balanced the keys, the more expensive it gets.

Mental model

Imagine your data as many boxes. Scan is how many boxes you open and how deep you dig inside each. Shuffle is pouring items from boxes into new piles by label. Open fewer boxes (partition pruning, filters) and create fewer/smaller piles (pre-aggregate, co-locate keys, broadcast small tables).

Core techniques to reduce scan and shuffle

Reduce scans (read less)

Filter early and on the storage layout: use WHERE on partition/date columns; avoid wrapping them in functions (bad: WHERE DATE(created_at) = '2025-01-01'; good: WHERE created_at >= '2025-01-01' AND created_at < '2025-01-02').
Select only needed columns; avoid SELECT * in production models.
Use partitioned/clustered/sorted tables; pick stable, selective keys (e.g., event_date, customer_id).
Maintain healthy file sizes (avoid millions of tiny files; target large, evenly sized files when possible).
Pre-aggregate at the grain you serve often (e.g., daily_device_metrics) instead of scanning raw events for every dashboard.
Leverage materialized views or summary tables for hot queries.

Reduce shuffles (move less)

Broadcast small dimension tables to avoid repartitioning big facts.
Join on well-distributed keys with matching data types; fix skewed keys (e.g., salt high-frequency keys).
Pre-aggregate before joining when downstream only needs grouped results.
Use bucketing/clustering/sort/distribution keys so rows that should meet already live together.
Avoid unnecessary ORDER BY and DISTINCT on massive datasets; push them after row-reduction steps.
Window functions: partition by selective keys and reduce input rows first.

5-step practical plan

Baseline: capture read bytes, partitions scanned, shuffle read/write, and runtime. Save an EXPLAIN/QUERY PLAN.
Prune: add WHERE filters; remove unused columns; push filters inside subqueries/CTEs.
Layout: set/adjust partitions and clustering; compact tiny files when needed.
Joins: broadcast small tables; pre-aggregate; ensure keys are typed the same; handle skew.
Validate: re-run with the same inputs; compare metrics; add regression checks.

Worked examples

Example 1 — Partition pruning + column pruning

Slow query:

-- Reads a full year, all columns
SELECT *
FROM analytics.events
WHERE DATE(event_time) BETWEEN '2025-01-01' AND '2025-01-31';

Optimized:

-- Reads only needed partitions and columns
SELECT event_time, user_id, event_type
FROM analytics.events
WHERE event_time >= '2025-01-01' AND event_time < '2025-02-01';

Why it helps: avoids function on the partition column and avoids SELECT *. Fewer partitions and fewer columns get scanned.

Example 2 — Broadcast small dimension to cut shuffle

Original join causes repartition of large fact:

SELECT f.order_id, d.segment
FROM fact_orders f
JOIN dim_customers d ON f.customer_id = d.customer_id;

Optimized when dim is small:

/* Hint syntax varies by engine */
SELECT f.order_id, d.segment
FROM fact_orders f
JOIN /*+ BROADCAST(d) */ dim_customers d
  ON f.customer_id = d.customer_id;

Why it helps: sending the small table to all workers avoids shuffling the big fact by join key.

Example 3 — Pre-aggregate before join

Original:

SELECT c.country, SUM(f.amount) AS revenue
FROM fact_orders f
JOIN dim_customers c ON f.customer_id = c.customer_id
GROUP BY c.country;

Optimized:

WITH agg AS (
  SELECT customer_id, SUM(amount) AS revenue
  FROM fact_orders
  WHERE order_date >= '2025-01-01' AND order_date < '2025-02-01'
  GROUP BY customer_id
)
SELECT c.country, SUM(a.revenue) AS revenue
FROM agg a
JOIN dim_customers c ON a.customer_id = c.customer_id
GROUP BY c.country;

Why it helps: aggregate the large fact first to reduce rows before the join, cutting shuffle volume.

Example 4 — Handling skew with salting

If one customer_id appears in a huge fraction of rows, a single partition becomes a hotspot.

-- Add a salt for hot keys; example pattern
WITH f AS (
  SELECT CASE WHEN customer_id = 42 THEN CONCAT(customer_id, '-', MOD(order_id, 16))
              ELSE CONCAT(customer_id, '-0') END AS salted_key,
         amount
  FROM fact_orders
)
SELECT SUM(amount)
FROM f
GROUP BY salted_key;

Why it helps: distributes a heavy key across multiple partitions, preventing one node from doing most of the work.

Exercises

Everyone can do the exercises and the quick test. Only logged-in users get saved progress.

Exercise 1 — Rewrite to prune partitions and columns

Current query:

SELECT *
FROM web.events
WHERE DATE(event_ts) BETWEEN '2024-07-01' AND '2024-07-31'
  AND country IN ('US', 'CA');

Goal: Reduce scan and keep results identical.

Assume events is partitioned by event_ts and clustered by user_id.
Return only event_ts, user_id, country, page.

Hints

Remove functions on partition columns.
Avoid SELECT *.
If possible, narrow the date range using >= and < operators.

Show solution

SELECT event_ts, user_id, country, page
FROM web.events
WHERE event_ts >= '2024-07-01' AND event_ts < '2024-08-01'
  AND country IN ('US', 'CA');

Expected effect: scan only July partitions and 4 columns.

Exercise 2 — Join rewrite: broadcast + pre-aggregate

Original:

SELECT o.order_id, c.segment, SUM(o.amount) AS amt
FROM fact_orders o
JOIN dim_customers c ON o.customer_id = c.customer_id
WHERE o.order_date >= '2025-01-01'
GROUP BY o.order_id, c.segment;

Goal: Reduce shuffle. Assume dim_customers is small.

Hints

Aggregate the large table before joining.
Broadcast the small dimension table.
Group at the final grain needed.

Show solution

WITH agg AS (
  SELECT order_id, customer_id, SUM(amount) AS amt
  FROM fact_orders
  WHERE order_date >= '2025-01-01'
  GROUP BY order_id, customer_id
)
SELECT a.order_id, c.segment, a.amt
FROM agg a
JOIN /*+ BROADCAST(c) */ dim_customers c
  ON a.customer_id = c.customer_id;

Expected effect: fewer rows shuffled; small table broadcast.

Preflight checklist for any query

Does the WHERE clause match partition columns without wrapping functions?
Are only needed columns selected?
Are joins done on matching data types and selective keys?
Can a small table be broadcast?
Can you pre-aggregate before joining or windowing?
Are ORDER BY / DISTINCT used only after row reduction?

Common mistakes and how to self-check

Using SELECT * in production — compare projected columns vs all; measure read bytes.
Wrapping partition columns in functions — check query plan for partition pruning.
Joining before aggregating — inspect shuffle metrics; try aggregating first.
Mismatched join key types (e.g., string vs int) — leads to full reshuffle; align types.
Ignoring skew — examine top key frequencies; consider salting or alternative keys.
Tiny files problem — compaction may be needed to avoid overhead.

Self-check prompts

What fraction of total table bytes did your query read? Is that proportionate to the date range?
Did shuffle read/write drop after your rewrite?
Is the output row count identical? If not, which transformation changed the grain?

Practical projects

Refactor a sluggish dashboard model: apply partition pruning, projection, and a broadcast join; document before/after metrics.
Design a daily summary table for a top 5 product metric and replace a query that scans raw events.
Create a test harness that logs read bytes and shuffle metrics for your 3 slowest jobs nightly.

Who this is for

Analytics Engineers and BI Developers maintaining warehouse models.
Data Engineers optimizing ELT pipelines.

Prerequisites

Comfortable with SQL (joins, GROUP BY, window functions).
Basic understanding of partitions/clustering and query plans.

Learning path

Foundations: storage layouts (partitions, clustering, file sizing).
Query rewrites: projection, predicate pushdown, join order.
Advanced shuffle control: broadcast, bucketing, skew handling.
Operationalize: metrics baselining, regression checks, compaction cadence.

Next steps

Pick one production model; collect baseline metrics.
Apply 2–3 techniques; re-measure; commit changes with notes.
Add a CI check that flags SELECT * and function-wrapped partition filters.

Mini challenge

You have a 2 TB fact table partitioned by event_date and a 50 MB dim table. A dashboard needs the last 7 days revenue by device_type. Write an outline that ensures: only 7 partitions read, small table broadcast, and aggregation before join. Then list two metrics you will compare before vs after.

Menu

Reducing Scan And Shuffle Costs

Table of Contents

Why this matters

Concept explained simply

Core techniques to reduce scan and shuffle

Reduce scans (read less)

Reduce shuffles (move less)

5-step practical plan

Worked examples

Exercises

Exercise 1 — Rewrite to prune partitions and columns

Exercise 2 — Join rewrite: broadcast + pre-aggregate

Preflight checklist for any query

Common mistakes and how to self-check

Practical projects

Who this is for

Prerequisites

Learning path

Next steps

Mini challenge

Practice Exercises

Rewrite to prune partitions and columns

Instructions

Expected Output

Join rewrite: broadcast + pre-aggregate

Reducing Scan And Shuffle Costs — Quick Test

Have questions about Reducing Scan And Shuffle Costs?

AI Assistant