What is Workload Management Strategy?
Workload Management (WLM) is how you classify, prioritize, and control compute, memory, and I/O so the right jobs meet their SLOs while everyone else still gets fair, cost-efficient access. It blends policy (who gets what), controls (admission, concurrency, queues), and feedback (observability, autoscaling).
Why this matters
- Protect business-critical dashboards from being slowed by heavy ad‑hoc queries.
- Guarantee batch deadlines (daily ETL by 6:30 AM) without overpaying for peak capacity all day.
- Keep streaming pipelines stable under spikes with backpressure and isolation.
- Enable multi‑tenant access (teams, partners) with quotas and fairness.
- Control spend with budgets, caps, and right‑sized autoscaling.
Who this is for & Prerequisites
Who this is for
- Data Architects defining platform standards and policies.
- Platform/Infra engineers operating data warehouses, lakes, and stream processors.
- Analytics leaders who need predictable performance and cost control.
Prerequisites
- Basic query engine concepts (concurrency, memory, I/O).
- Familiarity with your platform's resource pools/queues or reservations.
- Understanding of your org's SLAs/SLOs (latency, deadlines, budgets).
Concept explained simply + Mental model
Simple idea: you have limited lanes (compute). Jobs are different vehicle types: ambulances (critical dashboards), buses (batch ETL), scooters (ad‑hoc exploration). A good WLM paints lanes, sets speed limits, and uses ramp meters so emergencies flow, buses arrive on time, and scooters still ride safely.
Mental model: classify → allocate → control → observe → adapt. Start small, protect the most critical path, and iterate with real telemetry.
Core concepts
- Classification: tag workloads (e.g., streaming, batch ETL, ad‑hoc) and tenants (team, product).
- Prioritization: business criticality, deadlines, latency SLOs, data freshness.
- Isolation: resource pools/queues and limits to prevent noisy neighbors.
- Admission control: check before running (quotas, budgets, concurrency caps).
- Scheduling: FIFO, fair share, deadline-aware; preemption for emergencies.
- Backpressure: slow producers instead of letting the system crash.
- Autoscaling: scale up for bursts, scale down to save cost; set floors to protect steady flows.
- Observability: queue time, run time, P95 latency, deadline hit ratio, cost per class.
Design framework — step by step
- Catalog workloads
- Streaming (ingestion, feature pipelines)
- Batch ETL/ELT (with deadlines)
- Ad‑hoc analytics (spiky, unpredictable)
- Dashboards/APIs (latency sensitive)
- ML training/scoring (bursty, heavy)
- Define SLOs and constraints
- Latency targets (e.g., dashboards P95 < 3–5s)
- Deadlines (daily ETL by 06:30)
- Freshness (streaming lag < 2 min)
- Budget ceilings (daily credits/hours)
- Map to resource pools
- Reserve minimums for steady/critical (streaming, dashboards)
- Allow shared burst capacity for ad‑hoc
- Set memory/CPU ratios aligned to job profiles
- Admission & concurrency
- Per‑class concurrency caps and queue lengths
- Time‑of‑day rules (e.g., higher batch priority overnight)
- Size‑based routing (heavy ad‑hoc to a separate pool)
- Autoscaling & protection
- Floor/ceiling per pool; cool‑downs to avoid flapping
- Budget guards: slow, degrade, or preempt non‑critical when near budget
- Circuit breakers: kill runaway queries after thresholds
- Observe & adapt
- Dashboards: queue wait, utilization, SLO success, cost/class
- Weekly tuning: change one variable at a time, document impact
Worked examples
Example 1: Protect BI dashboards from ad‑hoc spikes
- Classes: BI (critical), Ad‑hoc (low), ETL (medium)
- Pools: BI reserved 30% min, burst to 60%; Ad‑hoc 10–40%; ETL 20–60% but low during business hours
- Admission: BI concurrency 20, Ad‑hoc 10 with queue length 50 and max 10 min wait
- Policy: Preempt Ad‑hoc if BI P95 > 5s for 2 min
- Result: BI latency stabilized; ad‑hoc waits a bit at peak
Example 2: Batch deadline at 06:30 without overprovisioning
- Window: 03:00–06:00 batch high priority; BI low
- Autoscaling: batch pool scales up to 80% ceiling only in window
- Admission: heavy ad‑hoc held or down‑routed during window
- Result: Deadline met and daytime costs reduced
Example 3: Streaming stability under producer surge
- Streaming pool: fixed floor (20%) with burst to 40%
- Backpressure: throttle ingestion if lag > 2 min; drop non‑critical enrichment
- Result: No crash; controlled lag that recovers post‑surge
Configuration patterns (pseudo)
{
"classes": [
{"name": "bi", "priority": "high", "slo": {"p95_ms": 5000}},
{"name": "etl", "priority": "medium", "slo": {"deadline": "06:30"}},
{"name": "adhoc", "priority": "low"},
{"name": "stream", "priority": "high", "slo": {"max_lag_sec": 120}}
],
"pools": [
{"name": "bi_pool", "min": 0.3, "max": 0.6, "concurrency": 20},
{"name": "etl_pool", "min": 0.2, "max": 0.8, "concurrency": 50, "time_window": "03:00-06:00"},
{"name": "adhoc_pool", "min": 0.1, "max": 0.4, "concurrency": 10, "queue_max": 50},
{"name": "stream_pool", "min": 0.2, "max": 0.4 }
],
"policies": {
"preempt": [{"if": "bi_p95_ms>5000 for 120s", "then": "pause adhoc oldest 10"}],
"budget_guards": [{"class": "adhoc", "daily_budget_units": 100, "action": "degrade_then_hold"}],
"circuit_breakers": [{"match": "query_runtime>30m", "action": "kill"}]
}
}
Self‑check checklist
- Have you explicitly listed all workload classes and their SLOs?
- Do critical workloads have reserved capacity floors?
- Are there concurrency caps and queue limits for spiky classes?
- Is there a time‑based policy for batch windows?
- Is there backpressure or shedding for streaming under surge?
- Do you track queue wait, P95 latency, deadline hit ratio, and cost per class?
Exercises
These mirror the exercises below. Try them here first, then compare with the solutions.
-
Exercise 1: Classify and allocate
Design a WLM plan for three workloads: real‑time ingestion, nightly batch ETL (03:00–06:00, deadline 06:30), and ad‑hoc queries. Assume 60 vCPU total; you may create three pools with min/max shares and concurrency caps. Include admission rules and any autoscaling or preemption.
-
Exercise 2: Handle a spike without breaking BI SLOs
Quarter‑end triples ad‑hoc volume. Keep BI P95 <= 5s during 09:00–18:00. Propose caps, queueing, and preemption/cost guard policies.
- Your plan protects critical workloads with floors.
- Batch windows use time‑based priority or scaling.
- Ad‑hoc has a concurrency cap and queue limit.
- There is a clear preemption or degradation rule.
- Metrics to verify success are defined.
Common mistakes and how to self‑check
- No explicit SLOs: If you can’t name P95 targets or deadlines, you can’t tune. Write them down first.
- Only scaling up: Without caps and admission control, costs balloon and latency still spikes. Add ceilings and queues.
- Shared everything: No isolation means noisy neighbors. Create pools and quotas.
- Ignoring time‑of‑day: Batch may steal capacity during business hours. Use schedules.
- One‑time tuning: Metrics drift; schedule weekly reviews. Change one parameter at a time.
Practical projects
- Build a WLM policy file for your platform with three classes (BI, ETL, Ad‑hoc), including floors, ceilings, and caps.
- Create a dashboard showing queue wait, P95 latency, deadline hit ratio, and cost per class over time.
- Simulate a spike (e.g., submit 5x ad‑hoc load) and measure BI latency before/after WLM changes.
- Implement a budget guard that degrades Ad‑hoc after a daily threshold.
Learning path
- Define workload classes and SLOs.
- Introduce resource pools with floors/ceilings.
- Add admission control and queue limits.
- Implement autoscaling and time‑based policies.
- Introduce preemption and budget guards.
- Instrument observability and run weekly tuning.
Next steps
- Tune concurrency and memory for top 5 heavy queries/pipelines.
- Add backpressure paths for streaming sources.
- Document your WLM playbook so teams know expectations and request paths.
Mini challenge
You have: dashboards (P95 3s), ad‑hoc, and daily ETL (06:30 deadline). Today, BI P95 spikes to 9s at noon when 20 ad‑hoc queries run. In 3–4 bullet points, propose WLM changes to fix it without adding all‑day capacity.
Show a sample answer
- Reserve BI floor (30%), cap Ad‑hoc concurrency at 8, queue length 40.
- Define heavy‑query routing (rows > 1e8) to a separate low‑priority pool.
- Preempt oldest Ad‑hoc if BI P95 > 5s for 2 minutes.
- Autoscale BI pool up to 60% only during 11:30–13:30.
Quick Test
Take the quick test below to check your understanding. Everyone can take the test; only logged‑in users get saved progress.