Why this matters
As a Backend Engineer, you must ensure systems meet demand without wasting money. Capacity planning helps you predict peak load, right-size infrastructure, and protect service-level objectives (SLOs). Real tasks you will do:
- Estimate how many service instances are needed for a promotion or product launch.
- Set autoscaling targets and safety buffers to keep p95βp99 latency under SLO.
- Forecast database storage, IOPS, and network throughput months ahead.
- Plan redundancy (N+1, multi-AZ) so you stay within SLO during failures.
Quick glossary
- Throughput (RPS/jobs/sec): how much work arrives.
- Latency: time to complete a request (p50/p95/p99).
- Utilization: percent of a resource used (CPU, memory, IOPS, bandwidth).
- Saturation: queues building up; waits increase.
- Headroom: spare capacity reserved for spikes/failover.
- N+1: have at least one extra unit beyond the minimum to handle failure.
Concept explained simply
Capacity planning answers: Will our system meet peak demand within SLO while staying cost-efficient?
- Measure current demand and performance.
- Forecast peak demand (growth, seasonality, events).
- Map demand to resources (CPU, memory, storage, network, IOPS).
- Add safety margins (headroom, N+1, multi-AZ).
- Continuously verify via load tests and observability.
Mental model
- Littleβs Law (simple version): Concurrent work β Arrival rate Γ Time in system. If 200 requests/sec and avg time 0.2 sec, about 40 requests are in-flight.
- Utilization: Keep typical utilization 50β70% so you have headroom for spikes and to protect tail latency.
- Headroom: plan buffers (e.g., 20β40%) and N+1. Over-provision a bit to avoid SLO breaches; under-provisioning is costlier during incidents.
- Workload type: CPU-bound, memory-bound, I/O-bound, or network-bound. The tightest constraint dominates capacity.
- Scale strategy: Vertical (bigger machines) vs horizontal (more instances). Horizontal scaling plus autoscaling is common for stateless services.
Step-by-step capacity planning (repeatable)
- Define SLO/SLI: e.g., p95 latency < 200 ms, error rate < 0.5%.
- Baseline: Gather RPS/jobs, p95/p99 latency, CPU%, memory, IOPS, network. Note peak vs average.
- Forecast: Estimate peak using recent peaks Γ growth, seasonality, and special events. Prefer p95 peaks over averages.
- Find constraints: Identify which resource saturates first (CPU, memory, IOPS, network).
- Per-instance capacity: Load test to find sustainable throughput at target utilization (e.g., 65β70%).
- Compute instances: Instances = ceil(peak_throughput / per_instance_capacity) Γ safety_factor.
- Resilience: Apply N+1 and distribute across AZs. Ensure you can lose 1 unit/AZ and still meet SLO.
- Autoscaling: Set min/desired/max and targets (e.g., CPU 60β65%).
- Alerts: Alert on SLO burn, saturation, and approaching limits (e.g., 80β90% of capacity sustained).
- Review: Revisit after launches/incidents; update forecasts monthly or after big changes.
Worked examples
Example 1 β API service sizing
Baseline: 500 RPS peak today. Load test shows each instance sustains 140 RPS at ~65% CPU with p95 latency < 180 ms. Next monthβs campaign: expect 2.5Γ peak.
- Forecast peak: 500 Γ 2.5 = 1250 RPS.
- Instances before buffer: ceil(1250 / 140) = ceil(8.93) = 9.
- Safety buffer 30%: 9 Γ 1.3 = 11.7 β 12 instances.
- Multi-AZ (3 AZs) with N+1 mindset: 12 total β 4 per AZ. Losing 1 instance still leaves 11 which covers 1250 RPS at ~76% of per-instance capacity (acceptable if still within SLO).
- Autoscaler: min 3 (one per AZ), desired 12, max 18, target CPU 60β65%.
Example 2 β Batch queue deadline
Goal: process 1,000,000 jobs in 2 hours. Each job needs 0.2 sec CPU and fits in memory per worker.
- Total CPU-seconds: 1,000,000 Γ 0.2 = 200,000 sec.
- Available wall-clock: 2 hours = 7200 sec.
- Workers needed: 200,000 / 7200 β 27.78 β 28 workers.
- Add 20% headroom: 28 Γ 1.2 = 33.6 β 34 workers.
Example 3 β Storage growth forecast
Current DB: 100 GB. Growth ~8%/month. Volume limit: 2 TB (2048 GB). Plan upgrade when at 75%.
- Target threshold: 0.75 Γ 2048 = 1536 GB.
- How many months to reach 1536 from 100 with 8% growth? 100 Γ (1.08)^m = 1536 β (1.08)^m = 15.36 β m β ln(15.36)/ln(1.08) β 33.7 months.
- Plan: schedule capacity change at ~32 months, earlier if ingestion spikes.
Who this is for
- Backend and platform engineers responsible for service reliability and cost.
- On-call engineers preparing for traffic spikes or migrations.
Prerequisites
- Basic understanding of service metrics (RPS, latency, CPU, memory).
- Familiarity with autoscaling concepts and multi-AZ deployments.
- Ability to run or read results from load tests.
Learning path
- Before this: Observability fundamentals (SLIs/SLOs), basic scaling.
- This lesson: Core formulas, buffers, and a repeatable planning process.
- After this: Incident response playbooks, cost optimization, advanced forecasting.
Exercises & practice
Note: Exercises and the quick test are available for free. If you log in, your progress will be saved.
Exercise 1 β Size a stateless API for a spike
Given: average 350 RPS; expected peak multiplier 3Γ; SLO p95 < 200 ms; one instance sustains 180 RPS at ~70% CPU (p95 160 ms), degrades after 230 RPS; plan for 30% headroom; 3 AZs with N+1 per AZ. Calculate total instances and per-AZ distribution. Propose autoscaler min/desired/max and CPU target.
Exercise 2 β Forecast storage and set alerts
Given: DB size 400 GB today; linear growth 2.5 GB/day; volume limit 1.5 TB; set alerts at 75% and 90% of limit. Compute when to trigger each alert and when the volume will be full if nothing changes. Suggest a review cadence.
Practice checklist
- [ ] I based plans on peak demand and SLOs, not averages.
- [ ] I identified the dominant bottleneck (CPU/memory/IOPS/network).
- [ ] I included both headroom and N+1/multi-AZ considerations.
- [ ] I proposed explicit autoscaling targets and bounds.
- [ ] I added actionable alerts before hard limits.
Common mistakes and self-check
- Planning from averages, not peaks. Fix: multiply by peak factors or use recent p95 peaks.
- No buffer for failover. Fix: add N+1 and headroom (20β40%).
- Ignoring the real bottleneck. Fix: validate with profiling/load tests.
- Setting alerts on raw CPU only. Fix: alert on SLO burn and saturation signals too.
- One-time plan. Fix: schedule monthly reviews and after major launches.
Self-check prompts
- If one AZ is lost, do I still meet SLO?
- What metric will saturate first, and how do I know?
- How quickly can autoscaling respond vs how fast traffic spikes?
- Whatβs the rollback if my forecast is wrong by 50%?
Practical projects
- Create a capacity plan for one critical service: include SLOs, forecast, per-instance capacity, total instances, headroom, N+1, autoscaling, and alerts.
- Run a load test to find sustainable throughput at 60β70% CPU and update your plan.
- Build a dashboard showing demand, utilization, latency (p95/p99), and headroom; add alert rules at 80% and 90% of limits.
Next steps
- Automate weekly reports: peak demand, headroom, and upcoming storage deadlines.
- Introduce pre-warming or scheduled scaling for known events.
- Review cost: compare right-sized instances vs over-provisioning; optimize after SLO is safe.
Mini challenge
Your service currently handles 800 RPS at 65% CPU with 8 pods. Marketing expects a 2Γ traffic spike for 1 hour. You want 25% headroom and can only scale in whole pods. How many pods should you run during the spike?
Reveal answer
Per-pod capacity at 65% β 800 / 8 = 100 RPS. Peak = 800 Γ 2 = 1600 RPS. Pods before buffer: 1600 / 100 = 16. Add 25% headroom: 16 Γ 1.25 = 20 pods.