luvv to helpDiscover the Best Free Online Tools
Topic 7 of 8

Monitoring Query Costs And Runtime

Learn Monitoring Query Costs And Runtime for free with explanations, exercises, and a quick test (for Analytics Engineer).

Published: December 23, 2025 | Updated: December 23, 2025

Why this matters

As an Analytics Engineer, you own models and queries that power dashboards and downstream data products. Unmonitored queries can quietly become slow and expensive, driving up warehouse costs and frustrating stakeholders. Monitoring query costs and runtime lets you:

  • Catch regressions before business users notice.
  • Control spend by finding heavy scans and inefficient patterns.
  • Prioritize optimizations using data, not guesswork.
  • Prove impact of changes with measurable improvements.

Concept explained simply

Every query has two things you care about: how long it takes (runtime) and how much it costs (compute and data scanned). Runtime often breaks into queue time + compilation + execution. Cost is usually tied to compute consumed (e.g., warehouse-seconds/credits/slots) and data scanned. Fewer scanned bytes, better pruning/partitioning, and efficient joins usually mean lower cost and faster runtime.

Mental model

Think of each query as a ride share trip:

  • Pickup delay = queue time.
  • Route planning = compilation/optimization.
  • Driving = execution time.
  • Fuel used = cost (compute + bytes scanned).

Shorter, well-planned trips use less fuel and arrive faster.

Core signals to watch

  • Runtime breakdown: queue_ms, compilation_ms, execution_ms.
  • Bytes scanned and rows scanned vs. rows output.
  • Compute consumed: credits/warehouse-seconds/slots (name varies by platform).
  • Cache hit vs. cold run (first run after cache clear is often slower).
  • Query plan red flags: large scans, cross joins, unpruned partitions, sort/aggregate spills.
  • Concurrency and queueing: many concurrent heavy jobs or undersized warehouse.
  • Error/abort rate and retries.
Tip: Good baselines
  • Use p50/p95 runtime by job/model and compare to last 7–28 days.
  • Track cost per job/model per day; look for step changes.
  • Compare cold run vs. warmed run if your warehouse benefits from caching.

How to monitor in practice

Most cloud warehouses expose query/job history in system views. Names vary, but you can expect a table similar to INFORMATION_SCHEMA.QUERY_HISTORY with columns like query_id, user_name, start_time, end_time, bytes_scanned, rows_scanned, rows_returned, was_cached, error, and sometimes compute_used or warehouse_seconds.

  1. Collect: Query history into a monitoring schema (daily incremental load).
  2. Label: Map queries to owners/jobs (e.g., by comment, tag, model name in query text).
  3. Aggregate: Compute p50/p95 runtime and total cost by job, owner, and hour/day.
  4. Alert: Thresholds on regressions (e.g., p95 runtime +100% vs baseline).
  5. Review: Weekly review of top N expensive and slow queries.
Template: recent slow or expensive queries
-- Adjust table/column names to your warehouse
SELECT
  q.query_id,
  q.user_name,
  q.start_time,
  TIMESTAMPDIFF('millisecond', q.start_time, q.end_time) AS runtime_ms,
  q.bytes_scanned,
  q.rows_scanned,
  q.rows_returned,
  q.was_cached,
  q.query_text
FROM your_wh.INFORMATION_SCHEMA.QUERY_HISTORY q
WHERE q.start_time > CURRENT_TIMESTAMP - INTERVAL '24 hours'
ORDER BY q.bytes_scanned DESC, runtime_ms DESC
LIMIT 50;
Template: daily p50/p95 runtime by job label
-- Suppose you embed a label in SQL comments like /* job:daily_sales */
WITH labeled AS (
  SELECT
    q.*,
    REGEXP_SUBSTR(q.query_text, '/\*\s*job:([a-zA-Z0-9_\-]+)\s*\*/', 1, 1, 'e', 1) AS job_label,
    TIMESTAMPDIFF('millisecond', q.start_time, q.end_time) AS runtime_ms
  FROM your_wh.INFORMATION_SCHEMA.QUERY_HISTORY q
  WHERE q.start_time >= CURRENT_DATE - INTERVAL '14 days'
)
SELECT
  CAST(start_time AS DATE) AS dt,
  job_label,
  APPROX_PERCENTILE(runtime_ms, 0.50) AS p50_ms,
  APPROX_PERCENTILE(runtime_ms, 0.95) AS p95_ms,
  SUM(bytes_scanned) AS bytes_scanned_sum
FROM labeled
WHERE job_label IS NOT NULL
GROUP BY 1,2
ORDER BY 1,2;
Template: cost proxy by owner
-- If your warehouse exposes compute or credits, use that.
-- Otherwise, bytes_scanned is a reasonable cost proxy.
SELECT
  COALESCE(owner, user_name) AS owner_or_user,
  CAST(start_time AS DATE) AS dt,
  SUM(COALESCE(credits_used, 0)) AS credits_used,
  SUM(bytes_scanned) AS bytes_scanned
FROM your_wh.INFORMATION_SCHEMA.QUERY_HISTORY
WHERE start_time >= CURRENT_DATE - INTERVAL '7 days'
GROUP BY 1,2
ORDER BY 2 DESC, 3 DESC;

Worked examples

Example 1: Dashboard loads slowed by 2x overnight

Symptom: Dashboard queries for daily_sales doubled in runtime.

Investigation:

-- Compare today vs 7-day median
WITH runs AS (
  SELECT
    REGEXP_SUBSTR(query_text, '/\*\s*job:([a-zA-Z0-9_\-]+)\s*\*/', 1, 1, 'e', 1) AS job,
    TIMESTAMPDIFF('millisecond', start_time, end_time) AS runtime_ms,
    CAST(start_time AS DATE) AS dt
  FROM your_wh.INFORMATION_SCHEMA.QUERY_HISTORY
  WHERE start_time >= CURRENT_DATE - INTERVAL '8 days'
)
SELECT
  job,
  APPROX_PERCENTILE(CASE WHEN dt = CURRENT_DATE THEN runtime_ms END, 0.50) AS today_p50,
  APPROX_PERCENTILE(CASE WHEN dt < CURRENT_DATE THEN runtime_ms END, 0.50) AS prev7_p50
FROM runs
WHERE job = 'daily_sales'
GROUP BY 1;

Interpretation: p50 doubled; bytes_scanned also increased. Likely partition pruning broke.

Fix: Ensure date filter pushed down, add partition/clustering on date, avoid SELECT *.

Example 2: Cost spike from ad-hoc exploration

Symptom: Credits used jumped during business hours.

Investigation:

SELECT user_name, COUNT(*) AS queries,
       SUM(bytes_scanned) AS bytes_scanned
FROM your_wh.INFORMATION_SCHEMA.QUERY_HISTORY
WHERE start_time > CURRENT_TIMESTAMP - INTERVAL '24 hours'
GROUP BY 1
ORDER BY bytes_scanned DESC
LIMIT 10;

Interpretation: One user scanned terabytes with wide SELECTs and no filters.

Fix: Educate on preview tables, add row limits in IDEs, set resource monitors and max runtime policies.

Example 3: Long queue time at peak

Symptom: Runtimes are long but execution_ms is normal; queue_ms is high.

Investigation: Plot queue_ms over time vs number of concurrent queries; check warehouse size and scheduling overlap.

Fix: Right-size or auto-scale the warehouse; stagger ELT schedules; reserve capacity for BI.

Hands-on exercises

These mirror the exercises below. You can complete them in any SQL-capable environment with access to your warehouse logs.

Exercise 1: Detect runtime regressions by model/job

  1. Parse a job label from query comments like /* job:your_model */.
  2. Compute p50 runtime for today and compare to the last 7 days.
  3. List jobs where today_p50 >= 2x prev7_p50 and at least 10 runs today.
See a reference query
WITH labeled AS (
  SELECT
    REGEXP_SUBSTR(q.query_text, '/\*\s*job:([a-zA-Z0-9_\-]+)\s*\*/', 1, 1, 'e', 1) AS job,
    TIMESTAMPDIFF('millisecond', q.start_time, q.end_time) AS runtime_ms,
    CAST(q.start_time AS DATE) AS dt
  FROM your_wh.INFORMATION_SCHEMA.QUERY_HISTORY q
  WHERE q.start_time >= CURRENT_DATE - INTERVAL '8 days'
), summary AS (
  SELECT
    job,
    APPROX_PERCENTILE(CASE WHEN dt = CURRENT_DATE THEN runtime_ms END, 0.50) AS today_p50,
    APPROX_PERCENTILE(CASE WHEN dt < CURRENT_DATE THEN runtime_ms END, 0.50) AS prev7_p50,
    COUNT_IF(dt = CURRENT_DATE) AS runs_today
  FROM labeled
  WHERE job IS NOT NULL
  GROUP BY 1
)
SELECT *
FROM summary
WHERE runs_today >= 10 AND today_p50 >= 2 * prev7_p50
ORDER BY today_p50 DESC;

Exercise 2: Find top cost drivers and check pruning

  1. From the last 24 hours, find top 10 queries by bytes_scanned.
  2. Add simple heuristics: flag if query_text contains SELECT * and lacks WHERE.
  3. Output query_id, owner, bytes_scanned_gb, runtime_s, has_filter_flag.
See a reference query
WITH base AS (
  SELECT
    q.query_id,
    COALESCE(q.owner, q.user_name) AS owner,
    q.query_text,
    q.bytes_scanned,
    TIMESTAMPDIFF('second', q.start_time, q.end_time) AS runtime_s
  FROM your_wh.INFORMATION_SCHEMA.QUERY_HISTORY q
  WHERE q.start_time > CURRENT_TIMESTAMP - INTERVAL '24 hours'
), heuristics AS (
  SELECT
    *,
    (REGEXP_LIKE(UPPER(query_text), '^\s*SELECT\s+\*') AND NOT REGEXP_LIKE(UPPER(query_text), '\bWHERE\b')) AS no_pruning
  FROM base
)
SELECT
  query_id,
  owner,
  ROUND(bytes_scanned / 1024 / 1024 / 1024, 2) AS bytes_scanned_gb,
  runtime_s,
  no_pruning AS likely_unpruned
FROM heuristics
ORDER BY bytes_scanned DESC
LIMIT 10;
  • [ ] I verified my warehouse log table names and time functions.
  • [ ] I used medians/p95 instead of only averages.
  • [ ] I included a basic owner/job label to group queries.
  • [ ] I validated at least one suspected regression manually.

Common mistakes and how to self-check

  • Only tracking averages: Use p50/p95 to avoid outlier distortion.
  • Ignoring queue time: Split runtime to locate contention vs query inefficiency.
  • Forgetting cold vs warm cache: Compare apples to apples; note was_cached.
  • Conflating bytes scanned with result size: Big scans with small outputs still cost a lot.
  • No ownership labels: Add job/user tags in SQL comments to group queries.
  • Chasing micro-optimizations: Prioritize top N by cost or p95 runtime.
Self-check: Is my alert signal noisy?
  • Does it fire daily for expected batch windows? If yes, widen window or add schedule-aware baselines.
  • Does it track p95 and minimum volume thresholds (e.g., 10+ runs)?

Practical projects

  • Build a "Top Offenders" dashboard: top queries by bytes_scanned and by p95 runtime; drill into text and owner.
  • Create a weekly report: changes in cost per job vs previous week, with annotations of deployments.
  • Guardrail alert: notify when a single query scans > 100 GB or runs > 10 minutes.

Who this is for

Analytics Engineers and BI Developers who maintain data models, scheduled transformations, and dashboard queries.

Prerequisites

  • Comfortable with SQL (window functions, regex helpful).
  • Access to your warehouse query/job history views.
  • Basic understanding of partitions/clustering and join strategies.

Learning path

  1. Collect and label query history by job/owner.
  2. Implement p50/p95 runtime and cost proxies per job/day.
  3. Add dashboards and investigate top offenders.
  4. Set threshold-based alerts and tune schedules/warehouse size.
  5. Iterate: optimize models (filters, indexes/keys, clustering), then measure again.

Next steps

  • Add standardized SQL comments in every model and job for reliable mapping.
  • Adopt small pull request templates: note expected runtime/cost impact and verify after deploy.
  • Schedule a weekly 15-minute performance review with clear owners and actions.

Mini challenge

Pick one dashboard. Measure its top 3 queries’ p95 runtime and bytes scanned this week vs last week. Write one concrete optimization for each (e.g., add date filter, pre-aggregate, remove SELECT *). Re-measure after changes.

Quick Test

The quick test is available to everyone. Your progress is saved if you are logged in.

Practice Exercises

2 exercises to complete

Instructions

Parse a job label from SQL comments like /* job:your_model */. Compute p50 runtime for today and the previous 7 days. Return jobs where today_p50 >= 2x prev7_p50 and runs_today >= 10.

  1. Extract job label from query_text.
  2. Calculate runtime_ms for each run.
  3. Aggregate to p50 per job for today vs previous 7 days.
  4. Filter on the 2x threshold and minimum runs.
Expected Output
A result set listing job, today_p50_ms, prev7_p50_ms, runs_today for jobs that regressed by at least 2x.

Monitoring Query Costs And Runtime — Quick Test

Test your knowledge with 7 questions. Pass with 70% or higher.

7 questions70% to pass

Have questions about Monitoring Query Costs And Runtime?

AI Assistant

Ask questions about this tool