luvv to helpDiscover the Best Free Online Tools
Topic 5 of 8

Cost And Capacity Planning

Learn Cost And Capacity Planning for free with explanations, exercises, and a quick test (for Data Platform Engineer).

Published: January 11, 2026 | Updated: January 11, 2026

Why this matters

Cost and capacity planning ensures your data platform meets SLAs without overspending. As a Data Platform Engineer, you will:

  • Forecast storage, compute, and streaming capacity for new workloads.
  • Plan headroom for peak traffic and seasonal spikes.
  • Set budgets, alerts, and unit costs (cost per TB processed, per query, per dashboard).
  • Choose right-sizing vs. autoscaling vs. reservations for predictable costs.
  • Apply data lifecycle policies to keep hot data fast and cold data cheap.

Concept explained simply

Think of your platform like a highway and parking:

  • Highway lanes = compute/throughput. Too few lanes cause traffic jams (missed SLAs); too many lanes waste money.
  • Parking = storage. Hot spots near the entrance cost more but are quick to access; cheaper lots farther away (cold) are slower but fine for archives.
  • Traffic peaks require a buffer (headroom). You do not build for worst-ever day, but for a sensible peak plus buffer.

Mental model

  • Forecast demand (data size, queries, events/s).
  • Convert demand to capacity (MB/s, cores, partitions, TB).
  • Add operational factors (utilization target, overhead, headroom).
  • Choose pricing model (on-demand, reserved, spot/preemptible).
  • Continuously measure and tune with unit economics.

Core concepts and quick formulas

  • Throughput sizing: required_agg_MBps = data_MB / SLA_seconds
  • Per-worker effective rate: worker_effective_MBps = base_MBps × (1 - overhead) × target_utilization
  • Workers needed: n = ceil(required_agg_MBps / worker_effective_MBps)
  • Partition sizing (streaming): partitions = ceil(required_MBps / per_partition_MBps)
  • Capacity headroom: plan for 20–30% above forecast peak
  • Storage forecast (simple): next_month_TB = current_TB + growth_TB - deletions_TB - lifecycle_moves_TB
  • Unit economics examples: cost per TB stored, cost per TB processed, cost per 1k events, cost per dashboard refresh
  • Cost controls: budgets and alerts, tags for cost allocation, lifecycle policies, right-sizing, autoscaling, reservations, spot/preemptible for fault-tolerant jobs

Worked examples

Example 1: Storage forecast with hot-to-cold lifecycle

Context: Data lake starts with 6 TB hot. Each month adds 1 TB (post-compression). Policy: keep hot for 2 months, then move to cold. Costs: hot $22/TB-month, cold $10/TB-month. Question: End of month 4 — how many TB are hot vs. cold, and what is the month-4 storage cost?

  1. Month 1 end: hot = 6 + 1 = 7 TB; cold = 0 TB.
  2. Month 2 end: hot = 7 + 1 = 8 TB; cold = 0 TB.
  3. Month 3 end: move month 1’s 7 TB to cold. Hot now = 1 (m2) + 1 (m3) = 2 TB; cold = 7 TB.
  4. Month 4 end: add new 1 TB to hot => 3 TB, then move month 2’s 1 TB to cold. Hot = 2 TB; cold = 8 TB.

Cost month 4: hot 2×$22 = $44; cold 8×$10 = $80; total = $124.

Example 2: Batch compute sizing for SLA

Process 9 TB in 2 hours. One worker sustains 60 MB/s. Overhead 15%. Target utilization 70%.

  1. Required aggregate throughput: 9,000,000 MB / 7,200 s ≈ 1,250 MB/s.
  2. Per-worker effective: 60 × 0.85 × 0.70 = 35.7 MB/s.
  3. Workers needed: ceil(1,250 / 35.7) = 35.

If a worker costs $0.20/hour, job cost ≈ 35 × 0.20 × 2 = $14.

Example 3: Streaming partitioning

Peak (p95) ingress = 60,000 events/s at 1.2 KB average. Broker partition handles up to 5 MB/s. Plan 30% headroom.

  1. Throughput = 60,000 × 1.2 KB ≈ 72,000 KB/s ≈ 72 MB/s.
  2. With headroom: 72 × 1.3 = 93.6 MB/s.
  3. Partitions = ceil(93.6 / 5) = 19. Consider choosing 20 for even distribution.

How to approach planning (step-by-step)

  1. Clarify objectives: SLA/SLOs, workloads (batch, streaming, interactive), data domains, growth horizon (3–12 months).
  2. Measure baselines: current storage TB, query scans, job runtimes, events/s, peak/seasonal patterns, failure/retry rates.
  3. Model demand: convert to data rates (MB/s), sizes (TB), concurrency, and schedules.
  4. Choose capacity strategy: right-size baseline; add autoscaling for spikes; consider reservations for steady load; spot/preemptible for fault-tolerant tasks.
  5. Apply lifecycle: partitioning, clustering, compression, hot/cold/archive moves, retention limits.
  6. Set unit economics: define 1–2 key unit costs by workload (e.g., $/TB processed for ETL, $/1k events for streaming).
  7. Add headroom: usually 20–30% above predicted peak; validate against p95/p99 usage.
  8. Implement budgets/alerts and cost allocation tags: track by team, dataset, or pipeline.
  9. Review monthly: compare forecast vs. actual; update assumptions; remove idle resources.

Exercises you can do now

These mirror the exercises below. Try first, then compare with the solutions.

  1. Exercise 1: Forecast hot vs. cold storage and month-4 cost for the given lifecycle policy.
  2. Exercise 2: Size the worker cluster to meet a batch SLA; estimate the run cost.
  • Checklist for both exercises:
    • List assumptions and unit conversions (MB, GB, TB).
    • Include overhead and utilization in compute sizing.
    • Add headroom only where required by the scenario.
    • Show intermediate numbers and rounding.

Common mistakes and how to self-check

  • Ignoring overhead and utilization: Self-check by multiplying base throughput by (1 - overhead) × utilization and re-compute workers.
  • No headroom for spikes: Validate with p95/p99 metrics; add 20–30% if spikes exist.
  • Over-retaining in hot storage: Verify lifecycle rules and the actual age distribution of data.
  • Missing unit economics: Choose a unit aligned to value (e.g., $/TB processed for ETL) and track it monthly.
  • Unlabeled costs: Ensure cost allocation tags/labels exist for every major workload.

Practical projects

  • Create a 3-month capacity plan for a sample data platform: batch ETL, streaming ingestion, and BI queries. Include headroom and a monthly cost forecast.
  • Implement a lifecycle policy for a data lake: partitioned tables, compression, hot-to-cold move at 30 days, deletion at 365 days. Measure savings.
  • Build a unit economics dashboard: show $/TB stored, $/TB processed, and top 5 costly jobs with trend lines.

Who this is for

  • Aspiring and current Data Platform Engineers.
  • Data Engineers responsible for pipelines, storage, or query platforms.
  • Tech leads needing predictable platform spend and performance.

Prerequisites

  • Basic understanding of data lakes/warehouses and ETL/ELT.
  • Familiarity with metrics: throughput (MB/s), latency, concurrency, partitions.
  • Comfort with arithmetic and unit conversions between MB/GB/TB.

Learning path

  • Start: Data platform components and workload types (batch, streaming, interactive).
  • Then: Storage design (partitioning, compression, lifecycle).
  • Next: Compute orchestration and autoscaling strategies.
  • Finally: Cost governance (budgets, tags, unit economics, reviews).

Mini challenge

Your analytics warehouse has a daily peak query window 09:00–11:00 that doubles normal traffic. Propose a plan that keeps 25% headroom during that window and reduces spend outside it. Include: baseline capacity, autoscaling or schedule-based scaling, and one optimization (e.g., partition pruning). Summarize in 4–5 bullet points.

Next steps

  • Apply these steps to one real workload you own.
  • Set 2 unit cost metrics and a monthly review cadence.
  • Pilot a lifecycle policy on a high-volume dataset and measure results.

Ready to test yourself?

The quick test below is available to everyone. Sign in to save your progress and resume later.

Practice Exercises

2 exercises to complete

Instructions

You manage a data lake with this profile:

  • Starting hot storage: 6 TB
  • Monthly new data (post-compression): +1 TB at month end
  • Lifecycle: keep hot for 2 months, then move to cold
  • Costs: hot $22/TB-month, cold $10/TB-month

Question: At the end of month 4, how many TB are hot vs. cold, and what is the month-4 storage cost?

Expected Output
Hot = 2 TB; Cold = 8 TB; Month-4 cost = $124

Cost And Capacity Planning — Quick Test

Test your knowledge with 8 questions. Pass with 70% or higher.

8 questions70% to pass

Have questions about Cost And Capacity Planning?

AI Assistant

Ask questions about this tool