How to learn Capacity Planning for Reliability And Operations in Platform Engineer for free

Why this matters

Capacity planning keeps systems reliable and cost-effective. As a Platform Engineer, you will forecast demand, right-size compute and storage, set safe utilization targets, and plan scaling before outages happen. Real tasks include deciding how many API instances to run for a product launch, how much storage to allocate for logs or Kafka, and when to add database replicas to protect SLOs.

Prevent outages by staying below saturation.
Control costs by avoiding overprovisioning.
Hit SLOs by aligning capacity with demand and error budgets.

Who this is for

Platform and SRE engineers responsible for reliability and scaling.
Backend engineers who own services and on-call rotations.
Tech leads who approve capacity/cost plans.

Prerequisites

Basic understanding of CPU, memory, I/O, and network limits.
Familiarity with service metrics (RPS/QPS, latency percentiles, error rates).
Comfort with simple math: percentages, averages, and rounding.

Concept explained simply

Capacity planning predicts how much load will arrive and ensures you have enough resources, with a safety margin, while keeping utilization healthy. It’s a balance: too little causes incidents; too much wastes money.

Mental model: Highways and headroom

Imagine traffic lanes. If a lane runs at 95% full, one small surge causes a traffic jam. Keep traffic around 50–70% so you have room for spikes and delays. In systems, that "room" is headroom: the extra capacity above expected peak.

Forecast demand (peak and patterns).
Choose safe utilization targets (e.g., CPU 50–70%).
Add headroom (e.g., +20–40% above peak).
Plan scaling triggers (autoscale thresholds, batch windows).

Core formulas and targets

Effective capacity per instance = throughput_at_reference_util × (target_util / reference_util)
Required capacity with headroom = peak_demand × (1 + headroom_fraction)
Instances needed = ceil(required_capacity / effective_capacity_per_instance)
Little’s Law (queues): L = λ × W (in-flight = arrival_rate × wait_time)

Common targets:

CPU target utilization: 50–70% under normal peak.
Memory target: leave 20–30% free to avoid GC/OOM risks.
Disk utilization: aim for 60–70% with 20% free space policy.
Network: keep below 70–80% line rate to limit drops.

Worked examples

Example 1: Web API instances for launch day

Given: forecast peak = 7,000 RPS; each instance sustained 350 RPS at 70% CPU during load test. Target utilization = 60%. Headroom = 30%.

Effective per-instance capacity = 350 × (0.60 / 0.70) ≈ 300 RPS
Required capacity with headroom = 7,000 × 1.30 = 9,100 RPS
Instances needed = ceil(9,100 / 300) = 31

Answer: Run 31 instances. Add autoscaling with warm-up to avoid cold starts.

Example 2: Storage for a 5-day Kafka retention

Given: avg ingress = 120 MB/s; daily peak 3 hours at 2× (240 MB/s); retention = 5 days; compression = 0.5; replication = 3; overhead = 10%; keep 20% free.

Daily volume = (21h × 120 + 3h × 240) × 3600 ≈ 11,664,000 MB ≈ 11.664 TB
5 days raw = 58.32 TB; compressed = 29.16 TB
Replicated (×3) = 87.48 TB; +10% overhead = 96.23 TB
Provision with 20% free: 96.23 / 0.8 ≈ 120.29 TB

Answer: About 120–121 TB usable provisioned space.

Example 3: Batch window impact on API headroom

Scenario: Nightly batch job adds 1,500 RPS for 45 minutes at 01:00 UTC; normal peak = 4,000 RPS; headroom policy 25%; current capacity = 5,500 RPS. Is it safe?

Batch peak demand = 4,000 + 1,500 = 5,500 RPS
Required with headroom = 5,500 × 1.25 = 6,875 RPS
Current capacity = 5,500 RPS ⇒ shortfall = 1,375 RPS

Action: Add instances or shift the batch to reduce overlapping peak.

Step-by-step playbook

Collect demand: Peak RPS/throughput, percentiles, seasonality, batch overlaps.
Define targets: Utilization thresholds, SLOs, error budgets, warm-up times.
Load test: Measure per-instance throughput at a known utilization (e.g., 70%).
Compute: Convert to effective capacity at target utilization; add headroom; round up instances.
Plan scaling: Autoscaling rules, min/max bounds, cooldowns, and manual runbooks.
Validate cost: Estimate monthly spend; review against budget. Varies by country/company; treat as rough ranges.
Review: After releases or traffic changes, re-check assumptions.

Learning path

Start: Understand utilization, headroom, and forecasting basics.
Next: Practice with compute, storage, and network examples.
Advance: Apply Little’s Law and error budgets to plan safe limits.
Master: Build runbooks and autoscaling policies with realistic thresholds.

Practice exercises

Do these now. They mirror the graded exercises below.

Exercise 1: Right-size API instances

Forecast peak = 7,000 RPS. Per instance: 350 RPS at 70% CPU (from load test). Target util = 60%. Headroom = 30%.

Find effective per-instance capacity at 60%.
Add headroom to peak.
Round up instances needed.

When done, compare with the solution in the Exercises section.

Exercise 2: Plan replicated storage

5-day retention; avg 120 MB/s; 3 hours/day at 240 MB/s; compression 0.5; replication 3; 10% overhead; 20% free space target.

Compute daily and 5-day volumes.
Apply compression, replication, overhead.
Ensure 20% free space.

I considered peak, not just average
I applied utilization targets correctly
I added headroom before rounding
I accounted for replication/overhead/free space

Common mistakes and self-check

Using average instead of peak or percentile demand.
Confusing utilization target with headroom (they are different layers of safety).
Ignoring warm-up and scale-out delay.
Forgetting replication and free space policies in storage estimates.
Not revisiting plans after a feature launch or traffic change.

Self-check prompts

Did I model the worst overlapping loads within the same time window?
Is my instance capacity based on measured data at a known utilization?
Do autoscaling thresholds avoid oscillation (cooldowns, min/max)?
If a node fails, does remaining capacity still meet SLOs?

Practical projects

Create a capacity workbook: one sheet each for API, DB, and Kafka/logs with inputs, formulas, and outputs.
Design an autoscaling policy for a stateless service, including min/max, cooldowns, and alarm thresholds.
Run a synthetic load test and update your per-instance capacity numbers and utilization targets.

Mini challenge

Your service has 2,800 RPS peak, but marketing plans a campaign expected to add +60% traffic for 2 hours. You target 65% CPU, have load-test data of 200 RPS/instance at 70% CPU, and want 25% headroom. Is your current fleet of 20 instances enough?

Reveal hint

Adjust per-instance capacity to 65%, compute campaign peak, add headroom, then divide and round up.

Next steps

Finish the exercises and take the quick test below. Anyone can take it for free; only logged-in users will see saved progress.
Apply these steps to one real service you own this week.
Schedule a 30-minute review with your team to validate assumptions.

Quick Test — how it works

The test is available to everyone for free. Logged-in users will have their progress saved automatically.

Menu

Capacity Planning

Table of Contents

Why this matters

Who this is for

Prerequisites

Concept explained simply

Core formulas and targets

Worked examples

Step-by-step playbook

Learning path

Practice exercises

Common mistakes and self-check

Practical projects

Mini challenge

Next steps

Quick Test — how it works

Practice Exercises

Right-size API instances with headroom

Instructions

Expected Output

Estimate replicated storage for 5-day retention

Capacity Planning — Quick Test

Have questions about Capacity Planning?

AI Assistant