Why this matters
As a Data Platform Engineer, you balance many competing workloads: BI dashboards hitting the warehouse every few seconds, analysts running ad‑hoc queries, and heavy ELT jobs transforming terabytes. Without intentional concurrency and workload management, one noisy job can slow everything, miss SLAs, and spike costs. Your job is to shape traffic, isolate workloads, and keep predictable performance under unpredictable demand.
- Keep dashboards snappy during peak traffic (executive meetings, product launches).
- Ensure ELT finishes on time without starving interactive users.
- Control costs by scaling only when queuing shows real pressure.
- Provide fair access for many teams while honoring SLOs.
Who this is for
- Data Platform and Data Warehouse engineers responsible for shared compute.
- Analytics engineers building ELT that competes with BI traffic.
- Ops/on-call engineers who triage performance incidents.
Prerequisites
- Basic SQL and query execution concepts (scan, join, aggregate).
- Familiarity with your warehouse compute model (clusters/virtual warehouses/slots).
- Comfort reading performance metrics (CPU, memory, queue wait times).
Learning path
- Understand core concepts: concurrency, isolation, admission control, autoscaling.
- Design routing rules: classify workloads and define SLOs.
- Set limits and guardrails: queues, timeouts, quotas.
- Monitor and iterate using signals: queue wait, utilization, spill, p95 latency.
- Practice with worked examples and exercises; take the quick test.
Concept explained simply
Concurrency and workload management is traffic control for your data warehouse. You decide which queries get which lane, how many can run at once, and when to open new lanes. The goals are predictable performance and efficient cost.
Core ideas:
- Workload isolation: run different job types on separate pools so heavy jobs don’t slow light queries.
- Admission control: if resources are busy, queue or reject new work based on priority.
- Autoscaling: add capacity when queues grow, remove it when they drain.
- Fairness and quotas: one team cannot hog everything.
Mental model
Think of your warehouse as lanes on a highway:
- Lanes (compute pools) are sized differently: small for lots of tiny cars (BI), big for trucks (ELT).
- On-ramps (admission control) meter cars into lanes to avoid pileups.
- Toll gates (quotas) stop one fleet from blocking everyone.
- Traffic cams (metrics) show when to open/close lanes (autoscaling).
Deep dive: Admission control
Admission control decides what happens when a query arrives and there are no free execution slots. Options:
- Queue with a max wait time (protects UX for interactive queries).
- Reject fast with a clear message (useful for low-priority batch retries).
- Route to a different pool (overflow) if policy allows.
Key levers and terms
- Concurrency target: how many queries can run at once per pool.
- Queue depth/timeout: how many queries can wait, and how long, before rejection.
- Resource class/slot size: per-query memory/CPU share; smaller sizes improve concurrency for tiny queries.
- Autoscaling policy: when to add/remove compute based on queue and utilization.
- Routing rules: map queries to pools using tags (user, role, warehouse, label, path).
- Quotas: per-team caps on concurrent queries or compute-hours.
- SLOs: e.g., dashboards p95 < 3s, ad‑hoc p95 < 30s, ELT finish by 7:00.
Worked examples
Example 1: BI vs. ELT peak
Situation: 150 concurrent dashboard queries and a heavy ELT job start at 8:55. Users report slow dashboards.
Action:
- Isolate: BI pool (small slot size, high concurrency), ELT pool (large slot size, low concurrency).
- Admission: BI queue max wait 2s; ELT queue 5m; ELT low priority.
- Autoscale: BI pool scales out when p95 wait >= 1s for 2 minutes.
Result: BI stays responsive; ELT waits briefly or runs slower without affecting BI.
Example 2: High-concurrency tiny queries
Situation: Hundreds of 100ms metadata queries pile up and cause thrash.
Action:
- Use smaller per-query resource class to increase parallel sessions.
- Set 1s timeout and fast-fail retries with jitter.
- Batch/caching: consider result cache to avoid re-computation.
Result: Throughput improves; cost remains controlled.
Example 3: Spiky orchestration window
Situation: Airflow triggers 40 ELT tasks at the hour.
Action:
- Set per-dag concurrency to 5; queue the rest.
- Stagger start times by 1–2 minutes; enable autoscaling if deadlines require.
- Cap table-level write concurrency to avoid locks.
Result: Smooth resource usage; same completion time with fewer spikes.
Step-by-step: Design a basic workload policy
- Step 1 — Classify workloads: Identify BI (interactive), Ad‑hoc (exploratory), ELT (batch), ML feature builds (compute-heavy).
- Step 2 — Define SLOs: BI p95 < 3s; Ad‑hoc p95 < 30s; ELT finish before 7:00.
- Step 3 — Create pools: Pool-BI (small slot, high concurrency), Pool-ADHOC (medium), Pool-ELT (large slot, low concurrency).
- Step 4 — Routing rules: Tag queries by role/schema/label and send to pools. Default to ADHOC with conservative limits.
- Step 5 — Admission control: Set queue depth and timeouts. BI queue short, ELT queue long.
- Step 6 — Autoscaling: Scale BI based on queue wait; scale ELT based on backlog and deadline.
- Step 7 — Quotas and guardrails: Per-team concurrent query caps; per-query max runtime; cancellation for abandoned sessions.
- Step 8 — Monitor and iterate: Review weekly: p95 latency by pool, queue wait, spill, and cost.
Signals to trigger a change
- p95 wait > 1s for BI during peak: increase concurrency target or add node.
- CPU < 50% but wait > 1s: slot size too big; shrink per-query resources.
- Frequent spill to disk: increase memory per query or reduce parallelism.
- Backlog growing near ELT deadline: temporarily scale out ELT pool.
Monitoring and signals
- Queue wait time (admission latency) — leading indicator of contention.
- CPU/memory utilization — ensure added capacity is actually used.
- Spill/shuffle metrics — high spill means memory pressure or skew.
- Throughput (queries/sec) and p50/p95 latency — user-facing impact.
- Cancel/retry rates — admission policy too strict or flaky clients.
Common mistakes and self-check
- Mistake: One giant pool for everything. Fix: Isolate by workload class.
- Mistake: Scaling without checking queue metrics. Fix: Scale only when queue wait is sustained.
- Mistake: Oversized per-query resources for tiny queries. Fix: Use smaller slot size to raise concurrency.
- Mistake: No timeouts or quotas. Fix: Set max runtime, queue timeout, and per-user caps.
- Mistake: Ignoring data skew. Fix: Address skewed joins/partitions; it looks like low concurrency but is actually slow tasks.
Self-check
- Can you point to the metric proving contention (queue wait) before scaling?
- Do you know exactly which workloads share compute and which are isolated?
- Are SLOs defined and measured per pool?
Exercises
Do these mini tasks, then compare with the solutions. The same tasks appear in the Exercises panel below.
Exercise 1 (ex1): Build a workload routing plan
Scenario: Your warehouse supports three workloads.
- BI dashboards: p95 <= 3s, up to 200 small queries/sec at peak.
- Hourly ELT: 4 heavy CTAS jobs running 5–10 minutes each.
- Ad‑hoc: Analysts run medium queries (10–60s), bursty.
Compute options: Pool-S (small slot, concurrency target 16), Pool-M (target 8), Pool-L (target 4). Each pool queue depth 20. Define routing, per-pool concurrency, queue timeouts, and autoscaling triggers.
Hint
- Isolate ELT from BI.
- Short queue for BI, longer for ELT.
- Overflow rules prevent rejection during rare spikes.
Exercise 2 (ex2): Triage a 9:00 incident
At 9:00, BI p95 jumps from 1.4s to 7s. Metrics: BI pool CPU 85%, queue wait 2.2s, ELT pool just started 3 heavy jobs, cache refresh queries increased 3x, ad‑hoc steady. Propose 1) primary cause, 2) immediate mitigation, 3) short-term fix, 4) long-term prevention.
Hint
- Look for who arrived and what changed in BI queue and cache refresh load.
- Use caps to protect BI from surges.
Checklist: before you roll to prod
- Workloads are tagged and routed to the right pools.
- Queue timeouts and per-user caps are set.
- Autoscaling has clear add/remove conditions.
- Dashboards have p95 alerts; ELT has deadline alerts.
- Runbooks exist for BI slowdown and ELT backlog.
Practical projects
- Create a two-pool policy (BI, ELT) with routing by role and measure p95 for a week. Adjust slot size and queues.
- Implement a quota system: cap per-team concurrent queries and track rejections; share a weekly fairness report.
- Build an autoscaling rule that uses queue wait and backlog minutes; verify cost vs. performance before/after.
Mini challenge
Design an overflow rule: When BI queue wait > 1.5s for 120s and Pool-S at max size, route up to 10% of BI queries to Pool-M with a 1s extra timeout. What safeguards do you add to avoid runaway costs and ensure fairness?
Considerations
- Cap overflow volume and duration.
- Prefer read-only queries for overflow (avoid write hot spots).
- Log and review overflow causes weekly.
Quick test
Everyone can take the test. Only logged-in users get saved progress.
When ready, open the Quick Test section below.
Next steps
- Automate query tagging in your orchestration and BI tools.
- Add alerts on queue wait and p95 per pool.
- Iterate monthly: re-balance pools based on real usage patterns.