luvv to helpDiscover the Best Free Online Tools
Topic 8 of 8

Governance For Shared Compute

Learn Governance For Shared Compute for free with explanations, exercises, and a quick test (for Data Platform Engineer).

Published: January 11, 2026 | Updated: January 11, 2026

Why this matters

Shared compute is the backbone of most data platforms. Multiple teams run ELT jobs, BI dashboards, ad‑hoc analysis, and machine learning on the same finite capacity. Without clear governance, one heavy job can slow down your dashboards, costs can spike unexpectedly, and SLOs are missed. Good governance protects performance, cost, and fairness.

  • Real task: Keep BI dashboards snappy during month-end loads.
  • Real task: Cap spend for ad‑hoc queries without blocking critical pipelines.
  • Real task: Route long-running queries away from latency-sensitive pools.

Concept explained simply

Governance for shared compute is a set of policies and guardrails that decide who can run what, where, and how much. Think: lanes on a highway. Fast lane for dashboards, a truck lane for big batch jobs, and rules to keep traffic flowing.

Mental model

Use a 3-layer mental model:

  1. Identify workloads: BI, ELT, Ad‑hoc, ML.
  2. Assign lanes (compute pools/warehouses): small/fast, medium, large/batch.
  3. Apply rules: priorities, quotas, timeouts, concurrency, scaling, and cost caps.

Core building blocks of shared compute governance

  • Workload classes: Labels that describe intent (BI realtime, ELT batch, Ad‑hoc exploration, ML training).
  • Resource pools/warehouses: Isolated compute groups with defined size, scaling rules, and concurrency limits.
  • Admission control: Who can start a query, how many can run in parallel, and what gets queued or rejected.
  • Priorities and routing: Map workloads to the right pool with appropriate priority.
  • Limits and guardrails: Timeouts, memory/slot caps, row/byte scan caps, result size caps.
  • Scheduling windows: When heavy jobs are allowed (e.g., nightly), and blackout periods.
  • Cost governance: Quotas per team/project, daily caps, and alerts.
  • Tagging and lineage: Tags for cost center, environment, data sensitivity, workload type.
  • RBAC and least privilege: Who can use which pool and change its settings.
  • Monitoring: Dashboards for queue times, run times, cancellations, errors, and spend; simple SLOs (e.g., p95 BI latency < 3s).

Worked examples

Example 1 — Keep dashboards fast during ELT

Goal: Dashboards must stay responsive even when nightly ELT runs.

  • Create pools:
    • BI-Fast: Small/auto-scale, high priority, short timeout (e.g., 120s), strict scan cap per query.
    • ELT-Batch: Large, medium priority, long timeout (e.g., 60m), high concurrency but scheduled mainly at night.
  • Routing policy:
    • BI queries with tag workload=bi route to BI-Fast.
    • ELT jobs with tag workload=elt route to ELT-Batch.
  • Guardrails:
    • Any query on BI-Fast exceeding 120s is cancelled and suggested to run on ELT-Batch.
    • Ad‑hoc users default to a separate pool (Adhoc-Lite) with daily cost cap.
  • Outcome: BI stays snappy; ELT continues without blocking.
Example 2 — Cost cap with graceful degradation

Goal: Keep ad‑hoc exploration within a daily budget without stopping work mid‑day.

  • Set Adhoc-Lite daily quota (e.g., 300 compute units). When 80% reached, notify; at 100%, switch to Adhoc-Slow (smaller, queue-heavy) instead of hard stop.
  • Long-running ad‑hoc queries auto-reroute to Batch-Overflow pool after 5 minutes run time.
  • Outcome: Budget protection with a slower but available path after cap.
Example 3 — Tagging and chargeback

Goal: Attribute cost and performance to teams to drive accountability.

  • Mandatory tags: cost_center, owner, workload, environment.
  • Dashboards show spend by tag plus queue times and cancellations.
  • Monthly showback to teams; if BI p95 latency degrades, review their query patterns and pool usage.
  • Outcome: Visibility enables teams to optimize proactively.

How to design your policy (step-by-step)

  1. Inventory workloads: List top pipelines, BI dashboards, ad‑hoc users, and ML jobs. Capture SLOs and timing.
  2. Define tiers (lanes): For example, BI-Fast, Adhoc-Lite, ELT-Batch, Batch-Overflow.
  3. Set admission rules: Concurrency limits, queue behavior, and who can access each pool.
  4. Add guardrails: Timeouts, scan/slot caps, and memory limits per workload class.
  5. Plan scaling: Auto-suspend/auto-resume, min/max size, and scale-out triggers.
  6. Establish cost controls: Daily quotas per pool/team and overage behavior (throttle/reroute).
  7. Standardize tags: Enforce required tags on jobs and queries.
  8. Create SLOs and alerts: e.g., BI p95 < 3s, ELT completes by 06:00. Alert on breaches and 80% budget use.
  9. Document runbook: How to unblock queues, raise temp limits, and communicate incidents.
  10. Review monthly: Adjust tiers, limits, and quotas based on usage and incidents.

Operational runbook (common tasks)

Unblock a BI incident
  • Check BI-Fast queue time and cap breaches.
  • Temporarily increase BI-Fast concurrency or scale out within bounds.
  • Identify top offenders; move heavy queries to ELT-Batch or Adhoc-Lite with a friendly message.
  • Log the change and set a 24h reminder to revert.
Handle budget overage
  • At 80% spend, notify owners; suggest Adhoc-Slow for non-urgent work.
  • At 100%, enforce reroute or throttle as pre-defined (no surprises).
  • Review tags for untagged jobs and fix.
Onboard a new team
  • Assign default pool (Adhoc-Lite), set initial daily quota.
  • Provide tagging guide and examples.
  • Schedule a 2-week check-in to right-size limits.

Exercises

Do these to practice. A quick test is available to everyone; only logged-in users get saved progress.

Exercise 1 — Classify and allocate

Scenario: You have three workloads: (A) Daily ELT from 01:00–05:00 needs to finish by 06:00, (B) BI dashboards 07:00–20:00 with p95 < 3s, (C) Ad‑hoc analysts weekdays 09:00–18:00 with a small budget.

  • Create 3 pools with names, size/scale, concurrency, and timeouts.
  • Define routing rules and daily quotas.
  • Write 2 guardrails to protect BI.
Checklist
  • Each workload has a pool.
  • BI has strict timeout.
  • Ad‑hoc has a quota and overage behavior.
  • ELT has nighttime window.

Exercise 2 — Preemption and rerouting

Scenario: BI p95 latency degraded yesterday. Investigation shows several 10+ minute queries ran in the BI pool.

  • Define a rule to auto-cancel or reroute long BI queries.
  • Define a communication message to users.
  • Propose a metric and alert to catch this early.
Checklist
  • Rule is specific (threshold + action).
  • Fallback pool named.
  • Alert based on BI p95 or queue time.

Common mistakes and self-checks

  • Single giant pool for everything: Self-check—Do BI jobs share resources with ELT? If yes, separate lanes.
  • No timeouts on BI: Self-check—Do any BI queries exceed 2–3 minutes? Add timeouts and scan caps.
  • Hard budget stops: Self-check—Is work blocked mid-day? Use graceful degradation (slower pool) instead.
  • Missing tags: Self-check—Can you attribute all spend to teams? If not, enforce required tags.
  • Over-tuning concurrency: Self-check—Did queue times drop but errors rose? Rebalance concurrency vs stability.

Practical projects

  • Project 1: Build a governance doc for your company with workload classes, pools, limits, quotas, and SLOs.
  • Project 2: Create a daily dashboard showing queue time, p95 latency per pool, cancellations, and spend by tag.
  • Project 3: Run a tabletop incident drill: BI slowdown during ELT. Apply your runbook and record outcomes.

Mini challenge

Design a two-tier ad‑hoc policy that encourages small, quick queries to finish instantly while pushing heavy exploration to a slower, queued pool—without blocking users. Write the rules in 5 bullet points.

Who this is for

  • Data platform engineers managing multi-team warehouses.
  • Analytics engineers who schedule pipelines and optimize dashboards.
  • Team leads who need predictable performance and spend.

Prerequisites

  • Basic understanding of data warehouses and query execution.
  • Familiarity with RBAC and tagging/metadata.
  • Ability to read platform metrics (latency, concurrency, queue time, spend).

Learning path

  • Before: Warehouse sizing and auto-scaling fundamentals.
  • This lesson: Governance for shared compute (workload classes, lanes, guardrails).
  • Next: Advanced workload management and SLO monitoring.

Next steps

  • Complete the exercises above.
  • Take the quick test to check understanding. Everyone can take it; only logged-in users will see saved progress.
  • Apply one guardrail (timeout or quota) in your environment this week and observe the impact.

Practice Exercises

2 exercises to complete

Instructions

Given workloads: (A) Daily ELT 01:00–05:00, deadline 06:00; (B) BI dashboards 07:00–20:00, p95 < 3s; (C) Ad‑hoc analysts 09:00–18:00, small budget.

  • Create 3 pools with: name, size/scale, concurrency, timeout.
  • Set routing rules and daily quotas.
  • Add 2 BI guardrails.
Expected Output
Three pools with clear parameters, routing rules per workload, BI guardrails preventing long/expensive queries, and a daily ad‑hoc quota with overage behavior.

Governance For Shared Compute — Quick Test

Test your knowledge with 8 questions. Pass with 70% or higher.

8 questions70% to pass

Have questions about Governance For Shared Compute?

AI Assistant

Ask questions about this tool