Why this matters
As a BI Analyst, your dashboards, scheduled reports, and ad‑hoc queries must be fast and reliable. Slow queries waste stakeholder time, inflate warehouse costs, and jeopardize SLAs with leadership. Monitoring helps you catch issues early, focus on the biggest offenders, and prove improvements with data.
- Real tasks you will face: identifying which queries slow a dashboard; explaining why a job exceeded budget; reducing p95 dashboard latency; comparing performance before vs. after an optimization.
- Impact: fewer timeouts, predictable costs, happier stakeholders, and data you can trust during peak usage.
Who this is for
- BI Analysts and Analytics Engineers who write SQL and own dashboards or scheduled jobs.
- Data-savvy product or operations analysts who troubleshoot slow reports.
Prerequisites
- Comfort reading SQL (SELECT, JOIN, GROUP BY).
- Basic understanding of schemas, indexes/partitioning, and execution plans.
- Access to a SQL environment (e.g., PostgreSQL/MySQL/SQL Server, Snowflake/BigQuery/Redshift).
Concept explained simply
A query is slow when its total time exceeds your target (e.g., p95 dashboard load time < 3s). Bottlenecks are the dominant reasons for slowness—like scanning too many rows, waiting on locks, or spilling to disk.
Mental model
Think of data flow like water in pipes: the narrowest pipe (bottleneck) limits speed. Your job is to measure, find the narrowest point, and widen it. Repeat until performance meets the target.
- Measure: execution time, rows scanned vs. returned, memory use, I/O, CPU, temp spills, wait/lock events, cache hit ratios.
- Find: identify the heaviest step (e.g., table scan, sort, join, window function).
- Fix: add/selective filters, indexes/partition pruning, pre-aggregation, adjust warehouse resources, or rewrite queries.
- Verify: compare before/after with the same inputs and real user paths.
What to monitor (quick checklist)
- p95/p99 latency of key dashboards and reports.
- Top N slow queries by total time and frequency.
- Queries with highest bytes scanned or I/O.
- Queries with frequent temp spill or sort spill.
- Waits/locks time, queue time, concurrency slots.
- Cost outliers (warehouse/compute time).
Baseline and monitoring workflow
- Define targets: e.g., dashboards p95 < 3s; scheduled report < 10 min; ad‑hoc budget < $2 per run. These guide what “slow” means.
- Capture slow queries: use your platform’s built-ins to list queries by time and resource use.
- Pick the biggest offenders: choose by total time x frequency, or bytes scanned x frequency.
- Deep dive: get an execution plan and runtime stats. Identify the single heaviest step.
- Fix and verify: make one change, re-run, compare metrics, and record the result.
- Automate reporting: schedule a weekly “top slow queries” report with trends and wins.
DB/warehouse notes (no setup steps, just where to look)
- PostgreSQL: pg_stat_statements; EXPLAIN (ANALYZE, BUFFERS).
- MySQL: slow query log; EXPLAIN; performance schema.
- SQL Server: Query Store; SET STATISTICS; execution plans.
- BigQuery: INFORMATION_SCHEMA.JOBS/JOBS_BY_*; execution details; bytes processed.
- Snowflake: QUERY_HISTORY; QUERY_PLAN; warehouse credits and pruning.
- Redshift: STL/STS tables; SVL/SVCS views; EXPLAIN and VACUUM/ANALYZE health.
Worked examples
Example 1 — OLTP/OLAP on PostgreSQL
Slow dashboard tile: last 30 days revenue by segment.
SELECT c.segment, SUM(o.total_amount) AS revenue
FROM orders o
JOIN customers c ON c.customer_id = o.customer_id
WHERE o.order_date >= CURRENT_DATE - INTERVAL '30 days'
GROUP BY c.segment;
Symptoms: EXPLAIN shows Seq Scan on orders (120M rows), Hash Join heavy, high I/O.
Likely bottlenecks: no index on order_date; join key not indexed; scanning far more rows than needed.
Fixes:
- Add index: CREATE INDEX ON orders(order_date); add index on orders(customer_id) if missing.
- Consider partial index on recent dates if most queries are recent: CREATE INDEX ON orders(order_date) WHERE order_date >= CURRENT_DATE - INTERVAL '90 days';
- Pre-aggregate into a daily revenue table to shrink per-request work.
Example 2 — BigQuery scan reduction
Original query uses a function that breaks partition pruning:
SELECT COUNT(*)
FROM events
WHERE DATE(event_ts) BETWEEN '2024-01-01' AND '2024-12-31'
AND country = 'US';
Symptoms: bytes processed unexpectedly high (hundreds of GB).
Bottleneck: wrapping event_ts with DATE() prevents partition pruning.
Fixes:
- Use native partition column without functions: event_date BETWEEN '2024-01-01' AND '2024-12-31'
- Select only needed columns; avoid SELECT *.
- Consider clustering on (country) to reduce scanned blocks for selective filters.
SELECT COUNT(*)
FROM events
WHERE event_date BETWEEN '2024-01-01' AND '2024-12-31'
AND country = 'US';
Example 3 — Snowflake window and sort spill
Query computes a window metric across a very large table and sorts all rows.
SELECT user_id,
SUM(purchase_amount) OVER (PARTITION BY user_id ORDER BY ts ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS running_spend
FROM purchases
WHERE ts >= DATEADD(month, -6, CURRENT_DATE);
Symptoms: high temp space, spill to disk, long elapsed time.
Bottleneck: wide window with large partitions and global sort.
Fixes:
- Pre-aggregate per day/user; compute running sums on a smaller table.
- Ensure micro-partition pruning via WHERE ts... and clustering by (user_id, ts) if repeatedly queried that way.
- Cache results or materialize for dashboards that reuse the same time window.
Monitoring checklist
- Have clear p95 targets for each critical dashboard/report.
- Track top slow queries by total time and by frequency.
- Log bytes scanned and temp spills for warehouses.
- Capture execution plans for the top offenders.
- Verify each optimization with before/after timings and row/byte counts.
- Record decisions and results where your team can see them.
Exercises
Do these in a scratch environment. Focus on diagnosis first, then fix.
Exercise 1 — Diagnose a slow orders report (PostgreSQL)
Schema: orders(order_id, order_date, customer_id, total_amount), customers(customer_id, segment).
Slow query and plan snippet:
SELECT c.segment, SUM(o.total_amount)
FROM orders o JOIN customers c ON c.customer_id = o.customer_id
WHERE o.order_date >= CURRENT_DATE - INTERVAL '30 days'
GROUP BY c.segment;
-- EXPLAIN (ANALYZE, BUFFERS) highlights
Seq Scan on orders (rows=120,000,000)
Filter: (order_date >= (CURRENT_DATE - '30 days'::interval))
Hash Join (rows=3,000,000)
Hash Cond: (o.customer_id = c.customer_id)
Buffers: shared hit=100000 read=900000
- Identify two main bottlenecks from the plan.
- Propose two changes to reduce time and I/O.
- Write the improved SQL (and any indexes) you’d try first.
Exercise 2 — Cut warehouse scan costs (BigQuery/Snowflake)
Table: events(event_ts TIMESTAMP, event_date DATE, country STRING, user_id STRING, event_name STRING, payload VARIANT/JSON...). Partitioned by event_date.
Current query:
SELECT user_id, COUNT(*) AS c
FROM events
WHERE DATE(event_ts) BETWEEN '2024-01-01' AND '2024-12-31'
AND country = 'US'
GROUP BY user_id
ORDER BY c DESC
LIMIT 100;
- Goal: keep the same result semantics but reduce bytes scanned by at least 80%.
- Rewrite the query and list 2 structural changes that reduce scans.
Need a nudge? (Hints)
- Exercise 1: Where do you see a full scan or expensive join? Would a date or join index help?
- Exercise 2: Are you using the partition column directly? Are you selecting only the columns you need?
Common mistakes and self-check
- Chasing micro-optimizations before fixing the biggest scan or spill. Self-check: did you sort top queries by total time x frequency?
- Ignoring p95 and only looking at average. Self-check: compare p50 vs p95; dashboards feel slow at p95.
- Wrapping partition columns in functions, breaking pruning. Self-check: no functions around partition keys in WHERE.
- Adding many indexes without usage. Self-check: confirm index selectivity and usage via plans.
- Not validating with real inputs. Self-check: re-run the same path and record before/after metrics.
- Forgetting concurrency/locks. Self-check: look for wait/lock events and queue time.
Practical projects
- Build a “Top Slow Queries” weekly report with: query text hash, p95 time, total time, runs, bytes scanned, and trend delta.
- Set and publish SLOs for your top 5 dashboards; track p95 weekly.
- Create a playbook: “If high bytes scanned, then check partition/filter/column selection; if high locks, check long transactions.”
- Precompute one expensive dashboard metric and measure the latency drop.
Learning path
- Start here: monitoring and detection (this page).
- Next: indexing, partitioning, and clustering strategies.
- Then: query rewrites and pre-aggregation patterns.
- Finally: capacity planning, caching, and scheduling for peak loads.
Next steps
- Apply this to one real dashboard today: capture baseline p95 and top slow queries.
- Pick one offender, optimize it, and document the before/after.
- Iterate weekly and share wins with stakeholders.
Mini challenge
Your finance dashboard is slow at quarter-end. Most time is in a table scan on 180 days of data. You must cut p95 in half within a day. What’s your fastest path?
- Shortlist: add a date index/ensure partition pruning; limit selected columns; pre-aggregate last 6 months into a summary table; bump warehouse size temporarily; re-measure.
Quick test
This test is available to everyone. Log in to save your progress and resume later.