Why this matters
In real data warehouses, multiple workloads run at once: ELT backfills, scheduled reports, ad-hoc exploration, and machine learning feature generation. Without concurrency control and workload isolation, a single heavy job can slow or break everything else. As a Data Engineer, you must design your warehouse so each workload gets predictable performance and won’t be starved or blocked.
- Keep dashboards fast during peak hours while nightly loads still complete on time.
- Run backfills safely without corrupting data or locking out analysts.
- Limit “noisy neighbor” effects from ad-hoc queries.
Concept explained simply
Concurrency is how many queries can run at the same time. Workload isolation is making sure those queries don’t hurt each other. Think of your warehouse like a kitchen: multiple cooks (workloads) share tools (compute, IO, memory). Isolation is giving each cook their own station and schedule so everyone can deliver dishes on time.
Mental model
- Separate lanes: Put workloads into lanes (compute clusters/queues) with limits and priorities.
- Fair rules: Set quotas, concurrency caps, and timeouts so one lane can’t take everything.
- Safe snapshots: Readers see a stable snapshot while writers make changes (MVCC/snapshot isolation).
- Predictable scale: Scale lanes up/down independently; add burst capacity if supported.
Core building blocks
- Isolation levels: Read committed, snapshot isolation (MVCC), serializable. Snapshot isolation lets readers proceed without blocking writers.
- Resource isolation: Separate compute pools/virtual warehouses/reservations/queues per workload.
- Priorities and quotas: Weights, max concurrency per group, per-user caps, and query timeouts.
- Scheduling windows: Heavy ETL at off-peak; BI peak protected by caps.
- Backpressure and retries: When limits are hit, queue or throttle rather than fail the system.
- Idempotency: Make jobs safe to retry so you can use queues and backoff without data corruption.
Worked examples
Example 1: Split ELT and BI with separate compute lanes
Goal: Keep dashboards snappy 8:00–12:00 while ELT runs hourly.
Workloads: - BI (dashboards/SQL for analysts) - ELT (hourly loads + dimensions) Plan: - Create two compute pools/queues: BI_POOL and ELT_POOL - BI_POOL: priority high, max concurrency 40, per-user cap 4, autoscale 2–6 - ELT_POOL: priority medium, max concurrency 15, autoscale 1–4 - During 08:00–12:00, enforce BI_POOL min 3 nodes; ELT_POOL max 2 nodes - Query timeout: 5 min in BI_POOL; 30 min in ELT_POOL
Result: BI stays responsive even when ELT spikes.
Example 2: Safe backfill with snapshot isolation
Goal: Recompute last 90 days of facts without blocking analysts.
-- Pseudo-steps 1) Run backfill on a dedicated pool BACKFILL_POOL with low priority. 2) Use snapshot isolation so reads are consistent while writes batch up. 3) Write in small chunks with retries. BEGIN TRANSACTION ISOLATION LEVEL SNAPSHOT; -- chunked backfill DELETE FROM facts WHERE dt BETWEEN '2025-10-01' AND '2025-12-31'; INSERT INTO facts SELECT ... FROM staging WHERE dt BETWEEN '2025-10-01' AND '2025-12-31'; COMMIT; -- Readers saw a stable snapshot throughout
Result: Analysts keep querying; the backfill commits atomically.
Example 3: Taming ad-hoc exploration
Analysts run unpredictable queries that sometimes explode scans.
Plan: - Route ad-hoc to EXPLORE_POOL with: max concurrency 10, per-user cap 2. - Set query timeout to 3–5 min, and limit scan bytes (or slots) per query. - Enable result cache for repeated queries; require filters on large tables via query rules.
Result: Exploration remains available but cannot starve core workloads.
How to choose an isolation strategy
Step 1 — Identify workloads
List BI, ELT, ad-hoc, data science, backfills. Note peak times and SLAs.
Step 2 — Assign lanes
Give each workload a dedicated pool/queue with autoscaling and caps.
Step 3 — Set priorities and limits
Define concurrency caps, per-user limits, and timeouts. Protect BI peaks.
Step 4 — Choose isolation level
Default to snapshot isolation. Use serializable for critical consistency steps.
Step 5 — Prove it with SLOs
Track p95 latency for BI and duration for ELT. Alert when breached.
Practical projects
- Project A: Create two pools (BI and ELT), set autoscaling ranges, priorities, and per-user caps. Run a synthetic load test; record BI p95 latency before/after.
- Project B: Implement a backfill framework with chunking, retries, and snapshot isolation. Verify zero BI slowdown.
- Project C: Build a workload classifier (tags/labels) that routes queries to pools based on user group or SQL patterns.
Exercises
Match these with the exercises section below. Try them first; answers are hidden in expandable blocks.
Exercise 1 — Design a 3-lane WLM plan
Design a plan for BI dashboards (peak 08:00–12:00), ELT (hourly), and Data Science ad-hoc (spiky). Set pools, priorities, concurrency limits, autoscaling, per-user caps, and timeouts.
Exercise 2 — Backfill with isolation and retries
Write a pseudo-transaction script to backfill 60 days safely using snapshot isolation. Include chunking and retry/backoff, and ensure BI runs unaffected.
- [ ] Each workload has its own pool/queue or reservation.
- [ ] Peak windows and SLAs are defined and protected.
- [ ] Concurrency caps and timeouts prevent noisy neighbors.
- [ ] Backfills are chunked, idempotent, and retriable.
- [ ] Monitoring in place: p95/99 latency for BI, ETL duration, queue wait time.
Common mistakes and self-checks
- Mistake: One giant shared pool for everything. Fix: Separate pools with caps and priorities.
- Mistake: Backfills during business hours on BI pool. Fix: Dedicated low-priority pool.
- Mistake: Ignoring isolation levels. Fix: Use snapshot isolation for mixed read/write.
- Mistake: No per-user caps for ad-hoc. Fix: Set per-user concurrency and timeouts.
- Mistake: No observability. Fix: Track queue wait, spill, and latency SLOs.
Self-check: Is BI protected?
During BI peak, verify BI pool minimum compute is enforced and ELT pool cannot scale above its cap. p95 dashboard latency should not degrade >20% under load.
Mini challenge
Your ELT job occasionally spikes CPU and IO, slowing dashboards at 09:30. In one paragraph, propose changes using only pool splits, priorities, and caps. Bonus: add a simple SLO and alert condition.
Who this is for
- Data Engineers and Analytics Engineers managing shared warehouses.
- Platform/Infra folks owning data platform SLOs.
- Senior Analysts running heavy ad-hoc queries responsibly.
Prerequisites
- Comfort with SQL and transactions (BEGIN/COMMIT, isolation levels).
- Basic understanding of your warehouse scaling model (compute pools/queues/reservations).
- Familiarity with scheduled jobs and monitoring metrics.
Learning path
- Start: Basics of warehouse compute and storage.
- Then: Isolation levels (read committed vs snapshot vs serializable).
- Next: Workload management—pools, quotas, autoscaling, priorities.
- Finally: SLOs and monitoring for workloads; incident runbooks.
Next steps
- Implement a two-pool split in a sandbox and validate SLOs.
- Refactor backfills to be chunked and idempotent.
- Add per-user caps and timeouts for ad-hoc users.
Quick Test
Available to everyone. Log in to save your progress and resume later.