Why this matters
As a Data Platform Engineer, you keep the warehouse fast, reliable, and cost-effective. Performance monitoring dashboards give you early warning signals (before users complain), reveal hotspots, and show the impact of tuning. Real tasks you will do:
- Spot and fix spikes in query latency, queue time, and warehouse saturation.
- Identify top cost drivers and reduce waste (e.g., unbounded scans, bad filters).
- Track SLAs: data freshness, job success, and end-to-end query performance.
- Provide self-serve observability for analysts, engineers, and stakeholders.
Note: The quick test is available to everyone; only logged-in users get saved progress.
Concept explained simply
A performance dashboard is a set of live charts and tables that summarizes how healthy your data warehouse and queries are. It turns raw logs into a small set of signals you can read at a glance.
Mental model
Think of a warehouse like a highway:
- Traffic volume (concurrency) and speed (latency) vary by time of day.
- Jams happen when lanes are full (resource saturation) or when trucks are too heavy (huge scans, memory spills).
- Good signage (labels, owners, job names) helps you direct the right drivers to the right lanes (workload management).
Your dashboard should answer, in under 30 seconds:
- Is it slow right now? (p95 latency, queue time)
- Why? (saturation, skew, large scans, errors)
- Who is impacted? (team/label/warehouse)
- What changed recently? (deployments, data size, schedule)
Core metrics and signals
- Latency: p50/p95 query duration; queue time vs execution time.
- Concurrency and utilization: active queries, slots/credits/warehouse utilization %.
- Scan volume: bytes scanned, rows read vs returned, partitions/pruning efficiency.
- Memory and spills: temp storage use, disk spill counts.
- Errors/retries: error rate, top error codes.
- Cost: cost per query/workload, cost per TB scanned (rough proxies are fine).
- Cache: cache hit ratio; delta vs baseline.
- Freshness: data latency for key tables and pipelines.
Useful breakdowns
- By team/label, warehouse/size, query type (ad-hoc vs scheduled), and table.
- By time of day and release window (pre/post deploy).
Worked examples
Example 1: Sudden p95 latency spike
- Check p95 latency and queue time. If queue time rose sharply, you likely hit concurrency limits.
- Check utilization and autoscaling events. If maxed out, consider scaling or workload isolation.
- If execution time increased but queue stayed flat, investigate heavy scans or skew (top tables by bytes scanned).
What you’d show on the dashboard
- Time series: p95 latency and queue time.
- Bar chart: top tables by bytes scanned during the spike.
- Table: top queries by execution time with owners/labels.
Example 2: Cost surge overnight
- Plot cost by label/team. Identify the label with the largest delta.
- Within that label, show top queries by bytes scanned and cache miss rate.
- Action: enforce partition pruning and limit SELECT *, add filters, or materialize summaries.
What you’d show on the dashboard
- Stacked area: cost by label over the last 7 days.
- Table: top 10 expensive queries with owner and last run time.
Example 3: Missed data freshness SLA
- Check freshness for critical tables (e.g., events, orders).
- Trace upstream job durations and error rate around the delay window.
- Action: increase cluster size for ingestion window or reschedule heavy transforms.
What you’d show on the dashboard
- Time series: freshness minutes for key datasets.
- Table: failed/slow jobs with start/end times and owners.
Designing your dashboard
- Top row (Now): p95 latency, queue time, utilization %, error rate, freshness for 2–3 key datasets.
- Behavior (Last 24h/7d): time series of latency, cost by label, concurrency, cache hit ratio.
- Drilldowns: top N expensive queries, top tables by scan bytes, error codes.
- Attribution: breakdowns by label/team, warehouse, and query type.
- Alerts view: recent alerts with status and owner action items.
Recommended KPIs per warehouse type
- Elastic cloud DW: utilization %, autoscale events, cost per workload label.
- Slot-based engines: slot utilization, queued vs running queries.
- Presto/Trino-like: memory spill count, skew metrics, failed tasks.
Implementation patterns (portable)
You can compute metrics from a generic query_history and cost_export table. Example pseudo-SQL:
Time-bucketed latency percentiles (last 24h)
SELECT date_trunc('minute', end_time) AS ts,
percentile_cont(0.50) WITHIN GROUP (ORDER BY exec_ms) AS p50_ms,
percentile_cont(0.95) WITHIN GROUP (ORDER BY exec_ms) AS p95_ms,
percentile_cont(0.95) WITHIN GROUP (ORDER BY queued_ms) AS q_p95_ms
FROM query_history
WHERE end_time > now() - interval '24 hours'
AND status = 'SUCCESS'
GROUP BY 1
ORDER BY 1;
Top 10 expensive queries by label (7d)
SELECT label,
query_id,
SUM(cost_usd) AS cost_usd,
MAX(user_email) AS owner,
MAX(end_time) AS last_run
FROM query_history
WHERE end_time > now() - interval '7 days'
GROUP BY label, query_id
ORDER BY cost_usd DESC
LIMIT 10;
Table scan hotspots
SELECT table_name,
SUM(bytes_scanned) AS scanned_bytes,
SUM(rows_returned) AS rows_returned
FROM query_table_scan_stats -- derived from query plans or access logs
WHERE end_time > now() - interval '7 days'
GROUP BY table_name
ORDER BY scanned_bytes DESC
LIMIT 20;
Safety tip: use read-only credentials for monitoring queries and avoid sampling sensitive query text unless required.
Alerting and thresholds
- Start with baselines. Measure 14 days, then set WARN at +30% above baseline, CRIT at +60% (tune per signal).
- Use time windows (e.g., 15–30 min) to avoid flapping.
- Route alerts by label/team and include a one-line suggested action.
- Alert on leading indicators: queue time p95, utilization, sudden cache-hit drops, error rate spikes.
Example alert payload
{
"signal": "queue_time_p95",
"current": 1800,
"baseline": 800,
"status": "CRIT",
"window": "15m",
"label": "analytics_batch",
"suggestion": "Scale warehouse or move heavy backfills outside peak. Review large scans in last 30m."
}
Exercises
Do these to apply what you learned. Mirror of the exercises below.
- Exercise 1 (ex1): Write SQL to compute core tiles from a query history schema and list the dashboard panels you’d create.
- Exercise 2 (ex2): From a given baseline, propose alert thresholds and actions.
Pre-flight checklist
- Have time windows defined (last 24h, 7d).
- Decided on labels/owners for attribution.
- Listed top 5 KPIs for your top row.
- Picked a safe starting alert policy (WARN/CRIT with windows).
Common mistakes and self-check
- Mistake: Only monitoring averages. Fix: Track p95/99 and separate queue vs execution time.
- Mistake: No attribution. Fix: Always include label/team and owner columns.
- Mistake: Alert storms. Fix: Add windows, dedupe, and route by label.
- Mistake: Focusing only on cost or only on speed. Fix: Pair latency with cost and bytes scanned.
- Mistake: Missing data freshness. Fix: Add freshness tiles for business-critical tables.
Self-check
- Can you explain a spike using three charts or fewer?
- Can a team see their own cost and top queries without asking you?
- Would your alerts lead to a clear first action?
Practical projects
- Build a “Now” dashboard: top row KPIs, last 24h time series, and top-N tables.
- Add a cost explorer: stacked cost by team with drilldown to queries.
- Freshness monitor: track 5 critical datasets and alert on thresholds.
- Adopt labels: define a simple label schema (owner, environment, workload) and roll it out.
Who this is for
- Data Platform Engineers who operate analytical warehouses.
- Analytics Engineers and Data Analysts needing self-serve visibility.
- SREs supporting data infrastructure performance.
Prerequisites
- Comfort with SQL and reading query plans.
- Basic understanding of your warehouse’s resource model.
- Access to query history and cost/billing exports (read-only).
Learning path
- Foundations: query lifecycle and resource models.
- Metrics: define and compute p95 latency, queue vs exec, utilization, cost.
- Dashboards: design top-row KPIs and drilldowns.
- Alerts: baseline, thresholds, routing.
- Optimization: use dashboards to guide tuning and workload isolation.
Next steps
- Implement the “Now” dashboard and share with two teams.
- Introduce labels on scheduled jobs and standardize owners.
- Run a weekly 15-minute review of top cost drivers and slow queries.
Mini challenge
Pick one production hour that had the highest p95 latency last week. In three screenshots or fewer (or equivalent summaries), tell the story: what changed, who was impacted, and your one recommended fix.
Quick Test
Take the quick test below to check your understanding. Progress is saved for logged-in users only.