Why this matters
Most cloud warehouses charge for what your queries scan. Reducing scanned bytes keeps dashboards fast and bills low. As a BI Analyst, you will:
- Make daily dashboards run in minutes instead of hours by scanning only relevant partitions.
- Design queries that return the same results but read a fraction of the data.
- Recommend table structures (partitioning/clustering) that prevent runaway costs.
- Guard ad-hoc exploration with cost-aware patterns so teams can explore safely.
Concept explained simply
Query cost in many warehouses is largely driven by how many bytes your query reads from storage. Scan less, pay less, and finish earlier.
Simple mental model: Pay for what you read
Think of the warehouse like a library that charges per page you flip. If you know the chapter and subsection, you can jump there and flip fewer pages. In data terms, that means:
- Filter early (WHERE date = '2025-01-01').
- Read only needed columns (avoid SELECT *).
- Organize data to enable skipping chunks (partitions, clustering/sorting).
Core tactics to reduce scan size
- Select only columns you need. Avoid SELECT *. Columnar storage can skip untouched columns entirely.
- Filter by partition keys (e.g., event_date, load_date, region). Partition pruning prevents reading unrelated files.
- Add selective predicates early. Push filters down before joins and heavy aggregations.
- Order joins to reduce data early. Join filtered/aggregated subsets, not entire raw tables.
- Use clustering/sorting on frequently filtered columns to help skip data within partitions.
- Pre-aggregate into summary tables or materialized views for common rollups (daily, weekly, by segment).
- Use approximate functions or sampling for exploratory analysis when exactness isn't required.
- Bound windows by time. Avoid unbounded scans when you only need the last N days.
- Avoid cross joins and unnecessary distincts; both can balloon data.
- Store analytic data in columnar formats with compression (often standard in warehouses). Use appropriate data types.
- LIMIT alone usually does not reduce scan size; it only limits output rows after reading.
Worked examples
Example 1: Partition pruning with time filter
Goal: Daily sales by country for the last 7 days.
-- Inefficient: no partition filter used, scans entire table
SELECT country, DATE(order_ts) AS d, SUM(amount) AS revenue
FROM fact_orders
GROUP BY country, DATE(order_ts);
-- Better: filter by partition (event_date) and only needed columns
SELECT country, event_date AS d, SUM(amount) AS revenue
FROM fact_orders
WHERE event_date BETWEEN DATE '2025-01-01' AND DATE '2025-01-07'
GROUP BY country, event_date;
Effect: Instead of scanning a full year, the query scans only 7 partitions. Approximate impact: from ~2 TB to ~40 GB, depending on table size. Varies by country/company; treat as rough ranges.
Example 2: Avoid SELECT *
-- Inefficient: reads all 80 columns
SELECT *
FROM user_events
WHERE event_date = CURRENT_DATE;
-- Better: read only 6 columns you actually use
SELECT user_id, session_id, event_date, event_type, item_id, price
FROM user_events
WHERE event_date = CURRENT_DATE;
Effect: Columnar engines skip unused columns, dropping scanned bytes significantly (e.g., from 400 GB to 40 GB). Varies by country/company; treat as rough ranges.
Example 3: Join after filtering and pre-aggregation
-- Inefficient: join entire year and then aggregate
SELECT d.country, SUM(f.amount) AS revenue
FROM fact_orders f
JOIN dim_customers d ON f.customer_id = d.customer_id
GROUP BY d.country;
-- Better: time-bound and pre-aggregate the fact first
WITH f AS (
SELECT customer_id, SUM(amount) AS amt
FROM fact_orders
WHERE event_date >= CURRENT_DATE - INTERVAL '30' DAY
GROUP BY customer_id
)
SELECT d.country, SUM(f.amt) AS revenue
FROM f
JOIN dim_customers d ON f.customer_id = d.customer_id
GROUP BY d.country;
Effect: The join touches fewer rows and columns, reducing both scan and shuffle.
Step-by-step: apply this to your warehouse
- Identify your top 5 most expensive queries by scanned bytes or execution cost.
- For each, list the columns referenced and ensure the query selects only those columns.
- Add strict WHERE predicates, especially on partition keys (dates/regions). Time-box analyses.
- Rework join order: filter and pre-aggregate large facts before joining to dimensions.
- Check table design: confirm partitions on common filters and clustering on high-cardinality filters if supported.
- Create summary tables/materialized views for recurring dashboards (daily/weekly rollups).
Tip: How to tell if a change helped
- Compare bytes scanned before vs after.
- Note runtime and slot/warehouse usage changes.
- Confirm identical results with small sample checks.
Exercises
Practice the changes safely on a non-production schema or with read-only access.
Exercise 1: Cut scan size with columns and partitions
Rewrite the query to scan only needed columns and relevant dates.
-- Starting point
SELECT *
FROM page_views
WHERE DATE(view_ts) BETWEEN DATE '2025-01-01' AND DATE '2025-01-31';
-- Target: Only the columns needed for daily UV by country last 7 days
-- Columns you need: user_id, country, event_date
What to deliver
- A rewritten query using event_date partition filters.
- Explicit column selection (no SELECT *).
- Screenshot or note of reduced scanned bytes after running EXPLAIN/DRY RUN (if available).
Exercise 2: Join after pre-aggregation
Turn a heavy join into a cheaper one by pre-aggregating the fact table and filtering by date.
-- Starting point
SELECT c.segment, SUM(o.amount) AS revenue
FROM fact_orders o
JOIN dim_customers c ON o.customer_id = c.customer_id
GROUP BY c.segment;
-- Target:
-- 1) Filter last 14 days using event_date
-- 2) Pre-aggregate fact_orders by customer first
-- 3) Then join to dim_customers
What to deliver
- A rewritten query with a CTE that aggregates the fact first.
- Evidence of reduced scanned bytes vs the starting point.
Self-check checklist
- I removed SELECT * and listed only required columns.
- I applied partition filters using native partition columns (e.g., event_date).
- I pre-aggregated large facts before joining to dimensions.
- I verified identical results on a sample and noted lower scanned bytes.
Common mistakes and how to self-check
- Relying on LIMIT to save cost. Self-check: Compare scanned bytes; LIMIT rarely lowers it.
- Filtering on an expression of the partition column (e.g., DATE(ts)). Self-check: Filter directly on the partition column (event_date) to enable pruning.
- Joining full raw tables. Self-check: Ensure you filter and aggregate before the join.
- SELECT * in production dashboards. Self-check: Count selected columns; it should match what the viz actually uses.
- Unbounded time windows. Self-check: Confirm a specific date range is present.
- Not leveraging clustering/sorting. Self-check: For large partitions, check if clustering on common filters would help.
Practical projects
- Build a cost-and-scan dashboard: For top N queries, show bytes scanned, runtime, and owner. Add weekly trend.
- Partition refactor: Migrate a large table to be partitioned by event_date and clustered by user_id or country; measure impact.
- Summary layer: Create daily/weekly revenue and retention summary tables powering key dashboards; document savings.
Mini challenge
Given a marketing_events table (partitioned by event_date) and a clicks table (also partitioned), write a query to compute last 30 days CTR by channel. Ensure it scans only 30 days, selects only needed columns, and pre-aggregates clicks and impressions before joining.
Hint
Use two CTEs that aggregate by channel and event_date, then join on both keys.
Learning path
- Next: Deep dive into partitioning and clustering strategies.
- Then: Materialized views and summary tables for recurring analytics.
- Finally: Query profiling and cost observability practices.
Who this is for
- BI Analysts who write SQL for dashboards and ad-hoc insights.
- Analytics Engineers maintaining semantic layers and models.
- Data-savvy PMs and marketers who run heavy exploratory queries.
Prerequisites
- Comfortable with SQL (SELECT, WHERE, GROUP BY, JOIN).
- Basic understanding of partitions and columnar storage concepts.
- Access to a warehouse where you can view scanned bytes or query plans.
Next steps
- Apply the exercises to a real dashboard query and record before/after metrics.
- Propose 1–2 new summary tables to your team and estimate cost savings.
- Take the quick test to confirm mastery.
Quick Test
You can take this test for free. If you sign in, your progress will be saved automatically.