How to learn Reducing Query Cost And Scan Size for Performance Tuning Basics in BI Analyst for free

Why this matters

Most cloud warehouses charge for what your queries scan. Reducing scanned bytes keeps dashboards fast and bills low. As a BI Analyst, you will:

Make daily dashboards run in minutes instead of hours by scanning only relevant partitions.
Design queries that return the same results but read a fraction of the data.
Recommend table structures (partitioning/clustering) that prevent runaway costs.
Guard ad-hoc exploration with cost-aware patterns so teams can explore safely.

Concept explained simply

Query cost in many warehouses is largely driven by how many bytes your query reads from storage. Scan less, pay less, and finish earlier.

Simple mental model: Pay for what you read

Think of the warehouse like a library that charges per page you flip. If you know the chapter and subsection, you can jump there and flip fewer pages. In data terms, that means:

Filter early (WHERE date = '2025-01-01').
Read only needed columns (avoid SELECT *).
Organize data to enable skipping chunks (partitions, clustering/sorting).

Core tactics to reduce scan size

Select only columns you need. Avoid SELECT *. Columnar storage can skip untouched columns entirely.
Filter by partition keys (e.g., event_date, load_date, region). Partition pruning prevents reading unrelated files.
Add selective predicates early. Push filters down before joins and heavy aggregations.
Order joins to reduce data early. Join filtered/aggregated subsets, not entire raw tables.
Use clustering/sorting on frequently filtered columns to help skip data within partitions.
Pre-aggregate into summary tables or materialized views for common rollups (daily, weekly, by segment).
Use approximate functions or sampling for exploratory analysis when exactness isn't required.
Bound windows by time. Avoid unbounded scans when you only need the last N days.
Avoid cross joins and unnecessary distincts; both can balloon data.
Store analytic data in columnar formats with compression (often standard in warehouses). Use appropriate data types.
LIMIT alone usually does not reduce scan size; it only limits output rows after reading.

Worked examples

Example 1: Partition pruning with time filter

Goal: Daily sales by country for the last 7 days.

-- Inefficient: no partition filter used, scans entire table
SELECT country, DATE(order_ts) AS d, SUM(amount) AS revenue
FROM fact_orders
GROUP BY country, DATE(order_ts);

-- Better: filter by partition (event_date) and only needed columns
SELECT country, event_date AS d, SUM(amount) AS revenue
FROM fact_orders
WHERE event_date BETWEEN DATE '2025-01-01' AND DATE '2025-01-07'
GROUP BY country, event_date;

Effect: Instead of scanning a full year, the query scans only 7 partitions. Approximate impact: from ~2 TB to ~40 GB, depending on table size. Varies by country/company; treat as rough ranges.

Example 2: Avoid SELECT *

-- Inefficient: reads all 80 columns
SELECT *
FROM user_events
WHERE event_date = CURRENT_DATE;

-- Better: read only 6 columns you actually use
SELECT user_id, session_id, event_date, event_type, item_id, price
FROM user_events
WHERE event_date = CURRENT_DATE;

Effect: Columnar engines skip unused columns, dropping scanned bytes significantly (e.g., from 400 GB to 40 GB). Varies by country/company; treat as rough ranges.

Example 3: Join after filtering and pre-aggregation

-- Inefficient: join entire year and then aggregate
SELECT d.country, SUM(f.amount) AS revenue
FROM fact_orders f
JOIN dim_customers d ON f.customer_id = d.customer_id
GROUP BY d.country;

-- Better: time-bound and pre-aggregate the fact first
WITH f AS (
  SELECT customer_id, SUM(amount) AS amt
  FROM fact_orders
  WHERE event_date >= CURRENT_DATE - INTERVAL '30' DAY
  GROUP BY customer_id
)
SELECT d.country, SUM(f.amt) AS revenue
FROM f
JOIN dim_customers d ON f.customer_id = d.customer_id
GROUP BY d.country;

Effect: The join touches fewer rows and columns, reducing both scan and shuffle.

Step-by-step: apply this to your warehouse

Identify your top 5 most expensive queries by scanned bytes or execution cost.
For each, list the columns referenced and ensure the query selects only those columns.
Add strict WHERE predicates, especially on partition keys (dates/regions). Time-box analyses.
Rework join order: filter and pre-aggregate large facts before joining to dimensions.
Check table design: confirm partitions on common filters and clustering on high-cardinality filters if supported.
Create summary tables/materialized views for recurring dashboards (daily/weekly rollups).

Tip: How to tell if a change helped

Compare bytes scanned before vs after.
Note runtime and slot/warehouse usage changes.
Confirm identical results with small sample checks.

Exercises

Practice the changes safely on a non-production schema or with read-only access.

Exercise 1: Cut scan size with columns and partitions

Rewrite the query to scan only needed columns and relevant dates.

-- Starting point
SELECT *
FROM page_views
WHERE DATE(view_ts) BETWEEN DATE '2025-01-01' AND DATE '2025-01-31';

-- Target: Only the columns needed for daily UV by country last 7 days
-- Columns you need: user_id, country, event_date

What to deliver

A rewritten query using event_date partition filters.
Explicit column selection (no SELECT *).
Screenshot or note of reduced scanned bytes after running EXPLAIN/DRY RUN (if available).

Exercise 2: Join after pre-aggregation

Turn a heavy join into a cheaper one by pre-aggregating the fact table and filtering by date.

-- Starting point
SELECT c.segment, SUM(o.amount) AS revenue
FROM fact_orders o
JOIN dim_customers c ON o.customer_id = c.customer_id
GROUP BY c.segment;

-- Target:
-- 1) Filter last 14 days using event_date
-- 2) Pre-aggregate fact_orders by customer first
-- 3) Then join to dim_customers

What to deliver

A rewritten query with a CTE that aggregates the fact first.
Evidence of reduced scanned bytes vs the starting point.

Self-check checklist

I removed SELECT * and listed only required columns.
I applied partition filters using native partition columns (e.g., event_date).
I pre-aggregated large facts before joining to dimensions.
I verified identical results on a sample and noted lower scanned bytes.

Common mistakes and how to self-check

Relying on LIMIT to save cost. Self-check: Compare scanned bytes; LIMIT rarely lowers it.
Filtering on an expression of the partition column (e.g., DATE(ts)). Self-check: Filter directly on the partition column (event_date) to enable pruning.
Joining full raw tables. Self-check: Ensure you filter and aggregate before the join.
SELECT * in production dashboards. Self-check: Count selected columns; it should match what the viz actually uses.
Unbounded time windows. Self-check: Confirm a specific date range is present.
Not leveraging clustering/sorting. Self-check: For large partitions, check if clustering on common filters would help.

Practical projects

Build a cost-and-scan dashboard: For top N queries, show bytes scanned, runtime, and owner. Add weekly trend.
Partition refactor: Migrate a large table to be partitioned by event_date and clustered by user_id or country; measure impact.
Summary layer: Create daily/weekly revenue and retention summary tables powering key dashboards; document savings.

Mini challenge

Given a marketing_events table (partitioned by event_date) and a clicks table (also partitioned), write a query to compute last 30 days CTR by channel. Ensure it scans only 30 days, selects only needed columns, and pre-aggregates clicks and impressions before joining.

Hint

Use two CTEs that aggregate by channel and event_date, then join on both keys.

Learning path

Next: Deep dive into partitioning and clustering strategies.
Then: Materialized views and summary tables for recurring analytics.
Finally: Query profiling and cost observability practices.

Who this is for

BI Analysts who write SQL for dashboards and ad-hoc insights.
Analytics Engineers maintaining semantic layers and models.
Data-savvy PMs and marketers who run heavy exploratory queries.

Prerequisites

Comfortable with SQL (SELECT, WHERE, GROUP BY, JOIN).
Basic understanding of partitions and columnar storage concepts.
Access to a warehouse where you can view scanned bytes or query plans.

Next steps

Apply the exercises to a real dashboard query and record before/after metrics.
Propose 1–2 new summary tables to your team and estimate cost savings.
Take the quick test to confirm mastery.

Quick Test

You can take this test for free. If you sign in, your progress will be saved automatically.

Menu

Reducing Query Cost And Scan Size

Table of Contents

Why this matters

Concept explained simply

Core tactics to reduce scan size

Worked examples

Step-by-step: apply this to your warehouse

Exercises

Exercise 1: Cut scan size with columns and partitions

Exercise 2: Join after pre-aggregation

Self-check checklist

Common mistakes and how to self-check

Practical projects

Mini challenge

Learning path

Who this is for

Prerequisites

Next steps

Quick Test

Practice Exercises

Cut scan size with columns and partitions

Instructions

Expected Output

Join after pre-aggregation

Reducing Query Cost And Scan Size — Quick Test

Have questions about Reducing Query Cost And Scan Size?

AI Assistant