How to learn Monitoring Slow Queries And Bottlenecks for Performance Tuning Basics in BI Analyst for free

Why this matters

As a BI Analyst, your dashboards, scheduled reports, and ad‑hoc queries must be fast and reliable. Slow queries waste stakeholder time, inflate warehouse costs, and jeopardize SLAs with leadership. Monitoring helps you catch issues early, focus on the biggest offenders, and prove improvements with data.

Real tasks you will face: identifying which queries slow a dashboard; explaining why a job exceeded budget; reducing p95 dashboard latency; comparing performance before vs. after an optimization.
Impact: fewer timeouts, predictable costs, happier stakeholders, and data you can trust during peak usage.

Who this is for

BI Analysts and Analytics Engineers who write SQL and own dashboards or scheduled jobs.
Data-savvy product or operations analysts who troubleshoot slow reports.

Prerequisites

Comfort reading SQL (SELECT, JOIN, GROUP BY).
Basic understanding of schemas, indexes/partitioning, and execution plans.
Access to a SQL environment (e.g., PostgreSQL/MySQL/SQL Server, Snowflake/BigQuery/Redshift).

Concept explained simply

A query is slow when its total time exceeds your target (e.g., p95 dashboard load time < 3s). Bottlenecks are the dominant reasons for slowness—like scanning too many rows, waiting on locks, or spilling to disk.

Mental model

Think of data flow like water in pipes: the narrowest pipe (bottleneck) limits speed. Your job is to measure, find the narrowest point, and widen it. Repeat until performance meets the target.

Measure: execution time, rows scanned vs. returned, memory use, I/O, CPU, temp spills, wait/lock events, cache hit ratios.
Find: identify the heaviest step (e.g., table scan, sort, join, window function).
Fix: add/selective filters, indexes/partition pruning, pre-aggregation, adjust warehouse resources, or rewrite queries.
Verify: compare before/after with the same inputs and real user paths.

What to monitor (quick checklist)

p95/p99 latency of key dashboards and reports.
Top N slow queries by total time and frequency.
Queries with highest bytes scanned or I/O.
Queries with frequent temp spill or sort spill.
Waits/locks time, queue time, concurrency slots.
Cost outliers (warehouse/compute time).

Baseline and monitoring workflow

Define targets: e.g., dashboards p95 < 3s; scheduled report < 10 min; ad‑hoc budget < $2 per run. These guide what “slow” means.
Capture slow queries: use your platform’s built-ins to list queries by time and resource use.
Pick the biggest offenders: choose by total time x frequency, or bytes scanned x frequency.
Deep dive: get an execution plan and runtime stats. Identify the single heaviest step.
Fix and verify: make one change, re-run, compare metrics, and record the result.
Automate reporting: schedule a weekly “top slow queries” report with trends and wins.

DB/warehouse notes (no setup steps, just where to look)

PostgreSQL: pg_stat_statements; EXPLAIN (ANALYZE, BUFFERS).
MySQL: slow query log; EXPLAIN; performance schema.
SQL Server: Query Store; SET STATISTICS; execution plans.
BigQuery: INFORMATION_SCHEMA.JOBS/JOBS_BY_*; execution details; bytes processed.
Snowflake: QUERY_HISTORY; QUERY_PLAN; warehouse credits and pruning.
Redshift: STL/STS tables; SVL/SVCS views; EXPLAIN and VACUUM/ANALYZE health.

Worked examples

Example 1 — OLTP/OLAP on PostgreSQL

Slow dashboard tile: last 30 days revenue by segment.

SELECT c.segment, SUM(o.total_amount) AS revenue
FROM orders o
JOIN customers c ON c.customer_id = o.customer_id
WHERE o.order_date >= CURRENT_DATE - INTERVAL '30 days'
GROUP BY c.segment;

Symptoms: EXPLAIN shows Seq Scan on orders (120M rows), Hash Join heavy, high I/O.

Likely bottlenecks: no index on order_date; join key not indexed; scanning far more rows than needed.

Fixes:

Add index: CREATE INDEX ON orders(order_date); add index on orders(customer_id) if missing.
Consider partial index on recent dates if most queries are recent: CREATE INDEX ON orders(order_date) WHERE order_date >= CURRENT_DATE - INTERVAL '90 days';
Pre-aggregate into a daily revenue table to shrink per-request work.

Example 2 — BigQuery scan reduction

Original query uses a function that breaks partition pruning:

SELECT COUNT(*)
FROM events
WHERE DATE(event_ts) BETWEEN '2024-01-01' AND '2024-12-31'
AND country = 'US';

Symptoms: bytes processed unexpectedly high (hundreds of GB).

Bottleneck: wrapping event_ts with DATE() prevents partition pruning.

Fixes:

Use native partition column without functions: event_date BETWEEN '2024-01-01' AND '2024-12-31'
Select only needed columns; avoid SELECT *.
Consider clustering on (country) to reduce scanned blocks for selective filters.

SELECT COUNT(*)
FROM events
WHERE event_date BETWEEN '2024-01-01' AND '2024-12-31'
AND country = 'US';

Example 3 — Snowflake window and sort spill

Query computes a window metric across a very large table and sorts all rows.

SELECT user_id,
       SUM(purchase_amount) OVER (PARTITION BY user_id ORDER BY ts ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS running_spend
FROM purchases
WHERE ts >= DATEADD(month, -6, CURRENT_DATE);

Symptoms: high temp space, spill to disk, long elapsed time.

Bottleneck: wide window with large partitions and global sort.

Fixes:

Pre-aggregate per day/user; compute running sums on a smaller table.
Ensure micro-partition pruning via WHERE ts... and clustering by (user_id, ts) if repeatedly queried that way.
Cache results or materialize for dashboards that reuse the same time window.

Monitoring checklist

Have clear p95 targets for each critical dashboard/report.
Track top slow queries by total time and by frequency.
Log bytes scanned and temp spills for warehouses.
Capture execution plans for the top offenders.
Verify each optimization with before/after timings and row/byte counts.
Record decisions and results where your team can see them.

Exercises

Do these in a scratch environment. Focus on diagnosis first, then fix.

Exercise 1 — Diagnose a slow orders report (PostgreSQL)

Schema: orders(order_id, order_date, customer_id, total_amount), customers(customer_id, segment).

Slow query and plan snippet:

SELECT c.segment, SUM(o.total_amount)
FROM orders o JOIN customers c ON c.customer_id = o.customer_id
WHERE o.order_date >= CURRENT_DATE - INTERVAL '30 days'
GROUP BY c.segment;

-- EXPLAIN (ANALYZE, BUFFERS) highlights
Seq Scan on orders  (rows=120,000,000)
  Filter: (order_date >= (CURRENT_DATE - '30 days'::interval))
Hash Join  (rows=3,000,000)
  Hash Cond: (o.customer_id = c.customer_id)
  Buffers: shared hit=100000 read=900000

Identify two main bottlenecks from the plan.
Propose two changes to reduce time and I/O.
Write the improved SQL (and any indexes) you’d try first.

Exercise 2 — Cut warehouse scan costs (BigQuery/Snowflake)

Table: events(event_ts TIMESTAMP, event_date DATE, country STRING, user_id STRING, event_name STRING, payload VARIANT/JSON...). Partitioned by event_date.

Current query:

SELECT user_id, COUNT(*) AS c
FROM events
WHERE DATE(event_ts) BETWEEN '2024-01-01' AND '2024-12-31'
AND country = 'US'
GROUP BY user_id
ORDER BY c DESC
LIMIT 100;

Goal: keep the same result semantics but reduce bytes scanned by at least 80%.
Rewrite the query and list 2 structural changes that reduce scans.

Need a nudge? (Hints)

Exercise 1: Where do you see a full scan or expensive join? Would a date or join index help?
Exercise 2: Are you using the partition column directly? Are you selecting only the columns you need?

Common mistakes and self-check

Chasing micro-optimizations before fixing the biggest scan or spill. Self-check: did you sort top queries by total time x frequency?
Ignoring p95 and only looking at average. Self-check: compare p50 vs p95; dashboards feel slow at p95.
Wrapping partition columns in functions, breaking pruning. Self-check: no functions around partition keys in WHERE.
Adding many indexes without usage. Self-check: confirm index selectivity and usage via plans.
Not validating with real inputs. Self-check: re-run the same path and record before/after metrics.
Forgetting concurrency/locks. Self-check: look for wait/lock events and queue time.

Practical projects

Build a “Top Slow Queries” weekly report with: query text hash, p95 time, total time, runs, bytes scanned, and trend delta.
Set and publish SLOs for your top 5 dashboards; track p95 weekly.
Create a playbook: “If high bytes scanned, then check partition/filter/column selection; if high locks, check long transactions.”
Precompute one expensive dashboard metric and measure the latency drop.

Learning path

Start here: monitoring and detection (this page).
Next: indexing, partitioning, and clustering strategies.
Then: query rewrites and pre-aggregation patterns.
Finally: capacity planning, caching, and scheduling for peak loads.

Next steps

Apply this to one real dashboard today: capture baseline p95 and top slow queries.
Pick one offender, optimize it, and document the before/after.
Iterate weekly and share wins with stakeholders.

Mini challenge

Your finance dashboard is slow at quarter-end. Most time is in a table scan on 180 days of data. You must cut p95 in half within a day. What’s your fastest path?

Shortlist: add a date index/ensure partition pruning; limit selected columns; pre-aggregate last 6 months into a summary table; bump warehouse size temporarily; re-measure.

Quick test

This test is available to everyone. Log in to save your progress and resume later.

Instructions

Schema: orders(order_id, order_date, customer_id, total_amount), customers(customer_id, segment).

Slow query and plan snippet:

SELECT c.segment, SUM(o.total_amount)
FROM orders o JOIN customers c ON c.customer_id = o.customer_id
WHERE o.order_date >= CURRENT_DATE - INTERVAL '30 days'
GROUP BY c.segment;

-- EXPLAIN (ANALYZE, BUFFERS)
Seq Scan on orders  (rows=120,000,000)
  Filter: (order_date >= (CURRENT_DATE - '30 days'::interval))
Hash Join  (rows=3,000,000)
  Hash Cond: (o.customer_id = c.customer_id)
  Buffers: shared hit=100000 read=900000

Identify two main bottlenecks from the plan.
Propose two changes to reduce time and I/O.
Write the improved SQL (and any indexes) you’d try first.

Menu

Monitoring Slow Queries And Bottlenecks

Table of Contents

Why this matters

Who this is for

Prerequisites

Concept explained simply

Mental model

Baseline and monitoring workflow

Worked examples

Example 1 — OLTP/OLAP on PostgreSQL

Example 2 — BigQuery scan reduction

Example 3 — Snowflake window and sort spill

Monitoring checklist

Exercises

Exercise 1 — Diagnose a slow orders report (PostgreSQL)

Exercise 2 — Cut warehouse scan costs (BigQuery/Snowflake)

Common mistakes and self-check

Practical projects

Learning path

Next steps

Mini challenge

Quick test

Practice Exercises

Diagnose a slow orders report (PostgreSQL)

Instructions

Expected Output

Cut warehouse scan costs (BigQuery/Snowflake)

Monitoring Slow Queries And Bottlenecks — Quick Test

Have questions about Monitoring Slow Queries And Bottlenecks?

AI Assistant