Why this matters
Cost and capacity planning ensures your data platform meets SLAs without overspending. As a Data Platform Engineer, you will:
- Forecast storage, compute, and streaming capacity for new workloads.
- Plan headroom for peak traffic and seasonal spikes.
- Set budgets, alerts, and unit costs (cost per TB processed, per query, per dashboard).
- Choose right-sizing vs. autoscaling vs. reservations for predictable costs.
- Apply data lifecycle policies to keep hot data fast and cold data cheap.
Concept explained simply
Think of your platform like a highway and parking:
- Highway lanes = compute/throughput. Too few lanes cause traffic jams (missed SLAs); too many lanes waste money.
- Parking = storage. Hot spots near the entrance cost more but are quick to access; cheaper lots farther away (cold) are slower but fine for archives.
- Traffic peaks require a buffer (headroom). You do not build for worst-ever day, but for a sensible peak plus buffer.
Mental model
- Forecast demand (data size, queries, events/s).
- Convert demand to capacity (MB/s, cores, partitions, TB).
- Add operational factors (utilization target, overhead, headroom).
- Choose pricing model (on-demand, reserved, spot/preemptible).
- Continuously measure and tune with unit economics.
Core concepts and quick formulas
- Throughput sizing: required_agg_MBps = data_MB / SLA_seconds
- Per-worker effective rate: worker_effective_MBps = base_MBps × (1 - overhead) × target_utilization
- Workers needed: n = ceil(required_agg_MBps / worker_effective_MBps)
- Partition sizing (streaming): partitions = ceil(required_MBps / per_partition_MBps)
- Capacity headroom: plan for 20–30% above forecast peak
- Storage forecast (simple): next_month_TB = current_TB + growth_TB - deletions_TB - lifecycle_moves_TB
- Unit economics examples: cost per TB stored, cost per TB processed, cost per 1k events, cost per dashboard refresh
- Cost controls: budgets and alerts, tags for cost allocation, lifecycle policies, right-sizing, autoscaling, reservations, spot/preemptible for fault-tolerant jobs
Worked examples
Example 1: Storage forecast with hot-to-cold lifecycle
Context: Data lake starts with 6 TB hot. Each month adds 1 TB (post-compression). Policy: keep hot for 2 months, then move to cold. Costs: hot $22/TB-month, cold $10/TB-month. Question: End of month 4 — how many TB are hot vs. cold, and what is the month-4 storage cost?
- Month 1 end: hot = 6 + 1 = 7 TB; cold = 0 TB.
- Month 2 end: hot = 7 + 1 = 8 TB; cold = 0 TB.
- Month 3 end: move month 1’s 7 TB to cold. Hot now = 1 (m2) + 1 (m3) = 2 TB; cold = 7 TB.
- Month 4 end: add new 1 TB to hot => 3 TB, then move month 2’s 1 TB to cold. Hot = 2 TB; cold = 8 TB.
Cost month 4: hot 2×$22 = $44; cold 8×$10 = $80; total = $124.
Example 2: Batch compute sizing for SLA
Process 9 TB in 2 hours. One worker sustains 60 MB/s. Overhead 15%. Target utilization 70%.
- Required aggregate throughput: 9,000,000 MB / 7,200 s ≈ 1,250 MB/s.
- Per-worker effective: 60 × 0.85 × 0.70 = 35.7 MB/s.
- Workers needed: ceil(1,250 / 35.7) = 35.
If a worker costs $0.20/hour, job cost ≈ 35 × 0.20 × 2 = $14.
Example 3: Streaming partitioning
Peak (p95) ingress = 60,000 events/s at 1.2 KB average. Broker partition handles up to 5 MB/s. Plan 30% headroom.
- Throughput = 60,000 × 1.2 KB ≈ 72,000 KB/s ≈ 72 MB/s.
- With headroom: 72 × 1.3 = 93.6 MB/s.
- Partitions = ceil(93.6 / 5) = 19. Consider choosing 20 for even distribution.
How to approach planning (step-by-step)
- Clarify objectives: SLA/SLOs, workloads (batch, streaming, interactive), data domains, growth horizon (3–12 months).
- Measure baselines: current storage TB, query scans, job runtimes, events/s, peak/seasonal patterns, failure/retry rates.
- Model demand: convert to data rates (MB/s), sizes (TB), concurrency, and schedules.
- Choose capacity strategy: right-size baseline; add autoscaling for spikes; consider reservations for steady load; spot/preemptible for fault-tolerant tasks.
- Apply lifecycle: partitioning, clustering, compression, hot/cold/archive moves, retention limits.
- Set unit economics: define 1–2 key unit costs by workload (e.g., $/TB processed for ETL, $/1k events for streaming).
- Add headroom: usually 20–30% above predicted peak; validate against p95/p99 usage.
- Implement budgets/alerts and cost allocation tags: track by team, dataset, or pipeline.
- Review monthly: compare forecast vs. actual; update assumptions; remove idle resources.
Exercises you can do now
These mirror the exercises below. Try first, then compare with the solutions.
- Exercise 1: Forecast hot vs. cold storage and month-4 cost for the given lifecycle policy.
- Exercise 2: Size the worker cluster to meet a batch SLA; estimate the run cost.
- Checklist for both exercises:
- List assumptions and unit conversions (MB, GB, TB).
- Include overhead and utilization in compute sizing.
- Add headroom only where required by the scenario.
- Show intermediate numbers and rounding.
Common mistakes and how to self-check
- Ignoring overhead and utilization: Self-check by multiplying base throughput by (1 - overhead) × utilization and re-compute workers.
- No headroom for spikes: Validate with p95/p99 metrics; add 20–30% if spikes exist.
- Over-retaining in hot storage: Verify lifecycle rules and the actual age distribution of data.
- Missing unit economics: Choose a unit aligned to value (e.g., $/TB processed for ETL) and track it monthly.
- Unlabeled costs: Ensure cost allocation tags/labels exist for every major workload.
Practical projects
- Create a 3-month capacity plan for a sample data platform: batch ETL, streaming ingestion, and BI queries. Include headroom and a monthly cost forecast.
- Implement a lifecycle policy for a data lake: partitioned tables, compression, hot-to-cold move at 30 days, deletion at 365 days. Measure savings.
- Build a unit economics dashboard: show $/TB stored, $/TB processed, and top 5 costly jobs with trend lines.
Who this is for
- Aspiring and current Data Platform Engineers.
- Data Engineers responsible for pipelines, storage, or query platforms.
- Tech leads needing predictable platform spend and performance.
Prerequisites
- Basic understanding of data lakes/warehouses and ETL/ELT.
- Familiarity with metrics: throughput (MB/s), latency, concurrency, partitions.
- Comfort with arithmetic and unit conversions between MB/GB/TB.
Learning path
- Start: Data platform components and workload types (batch, streaming, interactive).
- Then: Storage design (partitioning, compression, lifecycle).
- Next: Compute orchestration and autoscaling strategies.
- Finally: Cost governance (budgets, tags, unit economics, reviews).
Mini challenge
Your analytics warehouse has a daily peak query window 09:00–11:00 that doubles normal traffic. Propose a plan that keeps 25% headroom during that window and reduces spend outside it. Include: baseline capacity, autoscaling or schedule-based scaling, and one optimization (e.g., partition pruning). Summarize in 4–5 bullet points.
Next steps
- Apply these steps to one real workload you own.
- Set 2 unit cost metrics and a monthly review cadence.
- Pilot a lifecycle policy on a high-volume dataset and measure results.
Ready to test yourself?
The quick test below is available to everyone. Sign in to save your progress and resume later.