How to learn Troubleshooting Slow Queries for Warehouse And Query Performance in Data Platform Engineer for free

Why this matters

As a Data Platform Engineer, slow queries delay pipelines, break SLAs, and inflate compute costs. Real tasks include:

Speeding up BI dashboards that time out during peak hours.
Reducing ETL model runtimes from 40 minutes to under 10.
Cutting warehouse spend by fixing queries that scan terabytes unnecessarily.

Mastering a repeatable approach lets you respond fast and prevent regressions.

Concept explained simply

Queries are slow for two big reasons:

Too much work: reading/scanning many bytes, shuffling/sorting large data, joining huge intermediates.
Wrong work: poor plans caused by functions on columns, bad join conditions, stale stats, or non-selective filters.

Mental model

Picture your query as a funnel:

Project less: avoid SELECT *. Keep only required columns.
Filter early: make filters sargable (index/partition friendly) so the engine skips data.
Join smart: join small-to-large, use correct keys, and prefer broadcasting tiny tables.
Aggregate/window late but efficiently: pre-aggregate where possible to shrink inputs.

Always measure with EXPLAIN/PROFILE to see where time and bytes are spent.

A practical checklist

[ ] Confirm it is slow: compare runtime to historical baseline and note input sizes.
[ ] Get a plan/profile: run EXPLAIN/EXPLAIN ANALYZE or query profile; record operators with highest time/bytes.
[ ] Validate filters: avoid functions on columns (e.g., DATE(ts)); use ranges and correct data types.
[ ] Reduce columns: replace SELECT * with only needed fields.
[ ] Enable pruning: ensure filters hit partition/sort/cluster keys. Refresh/analyze statistics if estimates are off.
[ ] Fix joins: correct keys, remove accidental cross joins, broadcast small tables, pre-aggregate large ones.
[ ] Check skew: identify hot keys; mitigate with salting or adaptive/skew joins if supported.
[ ] Tame heavy ops: rewrite window functions, decompose complex CTEs, avoid UDFs in tight loops.
[ ] Soften sorts: avoid global sorts; use TOP-K patterns or partial sorts when possible.
[ ] Check concurrency: if slow only at peak, consider scaling resources or dedicated queues/warehouses.

Worked examples

Example 1 — Non-sargable date filter causes full scan

-- Before (slow): scans partitions fully
overview as
SELECT user_id, created_at, revenue
FROM fact_events
WHERE DATE(created_at) = '2025-05-10';

What the profile shows

Large scan with poor pruning; filter applied after read. High bytes scanned vs small output.

Rewrite

-- After (fast): range filter enables pruning
SELECT user_id, created_at, revenue
FROM fact_events
WHERE created_at >= TIMESTAMP '2025-05-10 00:00:00'
  AND created_at < TIMESTAMP '2025-05-11 00:00:00';
-- Also: remove unused columns to reduce I/O

Example 2 — Join explosion from wrong predicate

-- Before (slow): unintended cross join due to NULL join keys
SELECT f.order_id, d.country_name
FROM fact_orders f
JOIN dim_country d
  ON f.country_code = d.country_code
WHERE f.order_date >= DATE '2025-06-01';
-- Some rows have f.country_code NULL; join produces many-to-many with default rows

What the profile shows

Huge intermediate after join, massive shuffle, skew on a few keys.

Rewrite

-- Enforce valid keys and broadcast tiny dim
SELECT /*+ BROADCAST(d) */ f.order_id, d.country_name
FROM fact_orders f
JOIN dim_country d
  ON f.country_code = d.country_code
WHERE f.order_date >= DATE '2025-06-01'
  AND f.country_code IS NOT NULL;
-- If hints unsupported: pre-filter dim and ensure NOT NULL constraints upstream

Example 3 — Heavy window function

-- Before: window over entire dataset
SELECT user_id, event_time,
       SUM(amount) OVER (PARTITION BY user_id ORDER BY event_time
                         ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS running_amount
FROM fact_payments
WHERE event_time >= DATE '2025-01-01';

What the profile shows

Large sort and spill due to wide partition and unbounded window.

Rewrite

-- Pre-aggregate by day, then do a smaller window
WITH daily AS (
  SELECT user_id, DATE(event_time) AS d, SUM(amount) AS amt
  FROM fact_payments
  WHERE event_time >= DATE '2025-01-01'
  GROUP BY user_id, DATE(event_time)
)
SELECT user_id, d,
       SUM(amt) OVER (PARTITION BY user_id ORDER BY d
                      ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS running_amount
FROM daily;

Step-by-step playbook

1. Capture context: inputs, expected outputs, historical runtime, and when it regressed.

2. Run EXPLAIN/PROFILE. Note operators with top time/bytes and any spills.

3. Confirm row/byte estimates. If far off, refresh/analyze stats.

4. Make filters sargable and prune partitions; fix data types and casts.

5. Project only needed columns.

6. Fix joins: correct keys, pre-aggregate big tables, broadcast small ones, remove cross joins.

7. Reduce sorts/windows: prefer TOP-K, pre-aggregate, limit partitions.

8. Check skew: identify hot keys and mitigate.

9. Re-profile after each change; keep improvements that reduce bytes/time.

10. If only peak-time slowdowns remain, address concurrency/resources.

Exercises

These match the interactive exercises below. Do them here, then check your answers in the solution blocks.

Exercise 1 — Make the filter sargable

You have:

SELECT *
FROM events
WHERE DATE(created_at) = '2025-05-10'
  AND LOWER(device_type) = 'ios';

Rewrite to enable partition/index pruning on created_at.
Avoid scanning unnecessary columns.
Assume device_type is stored consistently lowercase (no need for LOWER()).

Hint

Use a half-open time range and project only the columns you need.

One possible answer

SELECT user_id, created_at, device_type
FROM events
WHERE created_at >= TIMESTAMP '2025-05-10 00:00:00'
  AND created_at < TIMESTAMP '2025-05-11 00:00:00'
  AND device_type = 'ios';

Exercise 2 — Choose an efficient join strategy

Tables: fact_sales (~2B rows), dim_store (~5k rows). You need revenue by region for last 7 days.

-- Current query (slow)
SELECT s.region, SUM(f.revenue) AS total_rev
FROM fact_sales f
JOIN dim_store s ON f.store_id = s.store_id
WHERE f.sale_ts >= CURRENT_DATE - INTERVAL '7 days'
GROUP BY s.region;

Rewrite to reduce shuffled data. Consider broadcasting dim_store or pre-aggregating fact_sales.

Hint

Aggregate the fact table before the join or broadcast the small dimension.

One possible answer

WITH fact_7d AS (
  SELECT store_id, SUM(revenue) AS rev
  FROM fact_sales
  WHERE sale_ts >= CURRENT_DATE - INTERVAL '7 days'
  GROUP BY store_id
)
SELECT /*+ BROADCAST(s) */ s.region, SUM(f.rev) AS total_rev
FROM fact_7d f
JOIN dim_store s ON f.store_id = s.store_id
GROUP BY s.region;

Self-check checklist

[ ] My rewrites reduced scanned bytes and/or removed unnecessary columns.
[ ] My joins either broadcasted small tables or shrank large tables before joining.
[ ] My filters are range-based and do not apply functions to columns.

Common mistakes and how to self-check

Mistake: Tuning warehouse size first. Fix: Profile and reduce work before scaling compute.
Mistake: SELECT *. Fix: Project only needed columns and re-measure.
Mistake: Functions on columns (e.g., DATE(col)). Fix: Use range predicates.
Mistake: Ignoring skew. Fix: Inspect key distributions; salt or split hot keys.
Mistake: Blindly trusting estimates. Fix: Compare estimated vs actual rows; refresh stats if needed.
Mistake: Complex CTE chains that block predicate pushdown. Fix: Inline or materialize strategically; verify filter pushdown in profile.

Practical projects

Build a query tuning runbook: include screenshots of plans before/after, byte scans, and the exact code changes.
Create a top-k pattern library: implement fast top-N per group without global sorts; document when to use.
Skew lab: fabricate skewed data and demonstrate two mitigation techniques (salting and pre-splitting hot keys).

Who this is for, prerequisites, and learning path

Who this is for

Data Platform Engineers, Analytics Engineers, and Data Scientists who own or debug warehouse workloads.

Prerequisites

Comfort with SQL (joins, aggregates, window functions).
Basic understanding of your warehouse’s EXPLAIN/PROFILE tools.
Familiarity with partitioning, clustering, and statistics concepts.

Learning path

Start here: troubleshooting slow queries.
Next: partitioning/clustering strategies and cost-aware SQL patterns.
Then: workload management, concurrency controls, and autoscaling strategies.

Next steps

Refactor one slow production query using the checklist; document impact.
Schedule stats maintenance on critical tables.
Prepare a small internal guide with 5 common rewrites for your team.

Mini challenge

Your query selects 8 columns, filters on DATE(event_ts) for yesterday, joins a 1k-row dim table, and does ORDER BY amount DESC LIMIT 20. It scans 1.5 TB and sorts 400M rows. Pick three actions that likely give the biggest win:

Rewrite the filter to a half-open time range for pruning.
Project only the columns used by ORDER BY and final output.
Broadcast the small dimension and pre-aggregate fact data before the sort.

Tip: After changes, re-profile to confirm reduced scanned bytes and sorted rows. Quick test is available to everyone; sign in to save your progress.

Menu

Troubleshooting Slow Queries

Table of Contents

Why this matters

Concept explained simply

Mental model

A practical checklist

Worked examples

Example 1 — Non-sargable date filter causes full scan

Example 2 — Join explosion from wrong predicate

Example 3 — Heavy window function

Step-by-step playbook

Exercises

Exercise 1 — Make the filter sargable

Exercise 2 — Choose an efficient join strategy

Self-check checklist

Common mistakes and how to self-check

Practical projects

Who this is for, prerequisites, and learning path

Who this is for

Prerequisites

Learning path

Next steps

Mini challenge

Practice Exercises

Make the filter sargable and reduce I/O

Instructions

Expected Output

Pick an efficient join plan

Troubleshooting Slow Queries — Quick Test

Have questions about Troubleshooting Slow Queries?

AI Assistant