How to learn Capacity Planning Basics for Observability And Operations in Backend Engineer for free

Why this matters

As a Backend Engineer, you must ensure systems meet demand without wasting money. Capacity planning helps you predict peak load, right-size infrastructure, and protect service-level objectives (SLOs). Real tasks you will do:

Estimate how many service instances are needed for a promotion or product launch.
Set autoscaling targets and safety buffers to keep p95–p99 latency under SLO.
Forecast database storage, IOPS, and network throughput months ahead.
Plan redundancy (N+1, multi-AZ) so you stay within SLO during failures.

Quick glossary

Throughput (RPS/jobs/sec): how much work arrives.
Latency: time to complete a request (p50/p95/p99).
Utilization: percent of a resource used (CPU, memory, IOPS, bandwidth).
Saturation: queues building up; waits increase.
Headroom: spare capacity reserved for spikes/failover.
N+1: have at least one extra unit beyond the minimum to handle failure.

Concept explained simply

Capacity planning answers: Will our system meet peak demand within SLO while staying cost-efficient?

Measure current demand and performance.
Forecast peak demand (growth, seasonality, events).
Map demand to resources (CPU, memory, storage, network, IOPS).
Add safety margins (headroom, N+1, multi-AZ).
Continuously verify via load tests and observability.

Mental model

Little’s Law (simple version): Concurrent work ≈ Arrival rate × Time in system. If 200 requests/sec and avg time 0.2 sec, about 40 requests are in-flight.
Utilization: Keep typical utilization 50–70% so you have headroom for spikes and to protect tail latency.
Headroom: plan buffers (e.g., 20–40%) and N+1. Over-provision a bit to avoid SLO breaches; under-provisioning is costlier during incidents.
Workload type: CPU-bound, memory-bound, I/O-bound, or network-bound. The tightest constraint dominates capacity.
Scale strategy: Vertical (bigger machines) vs horizontal (more instances). Horizontal scaling plus autoscaling is common for stateless services.

Step-by-step capacity planning (repeatable)

Define SLO/SLI: e.g., p95 latency < 200 ms, error rate < 0.5%.
Baseline: Gather RPS/jobs, p95/p99 latency, CPU%, memory, IOPS, network. Note peak vs average.
Forecast: Estimate peak using recent peaks × growth, seasonality, and special events. Prefer p95 peaks over averages.
Find constraints: Identify which resource saturates first (CPU, memory, IOPS, network).
Per-instance capacity: Load test to find sustainable throughput at target utilization (e.g., 65–70%).
Compute instances: Instances = ceil(peak_throughput / per_instance_capacity) × safety_factor.
Resilience: Apply N+1 and distribute across AZs. Ensure you can lose 1 unit/AZ and still meet SLO.
Autoscaling: Set min/desired/max and targets (e.g., CPU 60–65%).
Alerts: Alert on SLO burn, saturation, and approaching limits (e.g., 80–90% of capacity sustained).
Review: Revisit after launches/incidents; update forecasts monthly or after big changes.

Worked examples

Example 1 — API service sizing

Baseline: 500 RPS peak today. Load test shows each instance sustains 140 RPS at ~65% CPU with p95 latency < 180 ms. Next month’s campaign: expect 2.5× peak.

Forecast peak: 500 × 2.5 = 1250 RPS.
Instances before buffer: ceil(1250 / 140) = ceil(8.93) = 9.
Safety buffer 30%: 9 × 1.3 = 11.7 → 12 instances.
Multi-AZ (3 AZs) with N+1 mindset: 12 total ≈ 4 per AZ. Losing 1 instance still leaves 11 which covers 1250 RPS at ~76% of per-instance capacity (acceptable if still within SLO).
Autoscaler: min 3 (one per AZ), desired 12, max 18, target CPU 60–65%.

Example 2 — Batch queue deadline

Goal: process 1,000,000 jobs in 2 hours. Each job needs 0.2 sec CPU and fits in memory per worker.

Total CPU-seconds: 1,000,000 × 0.2 = 200,000 sec.
Available wall-clock: 2 hours = 7200 sec.
Workers needed: 200,000 / 7200 ≈ 27.78 → 28 workers.
Add 20% headroom: 28 × 1.2 = 33.6 → 34 workers.

Example 3 — Storage growth forecast

Current DB: 100 GB. Growth ~8%/month. Volume limit: 2 TB (2048 GB). Plan upgrade when at 75%.

Target threshold: 0.75 × 2048 = 1536 GB.
How many months to reach 1536 from 100 with 8% growth? 100 × (1.08)^m = 1536 → (1.08)^m = 15.36 → m ≈ ln(15.36)/ln(1.08) ≈ 33.7 months.
Plan: schedule capacity change at ~32 months, earlier if ingestion spikes.

Who this is for

Backend and platform engineers responsible for service reliability and cost.
On-call engineers preparing for traffic spikes or migrations.

Prerequisites

Basic understanding of service metrics (RPS, latency, CPU, memory).
Familiarity with autoscaling concepts and multi-AZ deployments.
Ability to run or read results from load tests.

Learning path

Before this: Observability fundamentals (SLIs/SLOs), basic scaling.
This lesson: Core formulas, buffers, and a repeatable planning process.
After this: Incident response playbooks, cost optimization, advanced forecasting.

Exercises & practice

Note: Exercises and the quick test are available for free. If you log in, your progress will be saved.

Exercise 1 — Size a stateless API for a spike

Given: average 350 RPS; expected peak multiplier 3×; SLO p95 < 200 ms; one instance sustains 180 RPS at ~70% CPU (p95 160 ms), degrades after 230 RPS; plan for 30% headroom; 3 AZs with N+1 per AZ. Calculate total instances and per-AZ distribution. Propose autoscaler min/desired/max and CPU target.

Exercise 2 — Forecast storage and set alerts

Given: DB size 400 GB today; linear growth 2.5 GB/day; volume limit 1.5 TB; set alerts at 75% and 90% of limit. Compute when to trigger each alert and when the volume will be full if nothing changes. Suggest a review cadence.

Practice checklist

[ ] I based plans on peak demand and SLOs, not averages.
[ ] I identified the dominant bottleneck (CPU/memory/IOPS/network).
[ ] I included both headroom and N+1/multi-AZ considerations.
[ ] I proposed explicit autoscaling targets and bounds.
[ ] I added actionable alerts before hard limits.

Common mistakes and self-check

Planning from averages, not peaks. Fix: multiply by peak factors or use recent p95 peaks.
No buffer for failover. Fix: add N+1 and headroom (20–40%).
Ignoring the real bottleneck. Fix: validate with profiling/load tests.
Setting alerts on raw CPU only. Fix: alert on SLO burn and saturation signals too.
One-time plan. Fix: schedule monthly reviews and after major launches.

Self-check prompts

If one AZ is lost, do I still meet SLO?
What metric will saturate first, and how do I know?
How quickly can autoscaling respond vs how fast traffic spikes?
What’s the rollback if my forecast is wrong by 50%?

Practical projects

Create a capacity plan for one critical service: include SLOs, forecast, per-instance capacity, total instances, headroom, N+1, autoscaling, and alerts.
Run a load test to find sustainable throughput at 60–70% CPU and update your plan.
Build a dashboard showing demand, utilization, latency (p95/p99), and headroom; add alert rules at 80% and 90% of limits.

Next steps

Automate weekly reports: peak demand, headroom, and upcoming storage deadlines.
Introduce pre-warming or scheduled scaling for known events.
Review cost: compare right-sized instances vs over-provisioning; optimize after SLO is safe.

Mini challenge

Your service currently handles 800 RPS at 65% CPU with 8 pods. Marketing expects a 2× traffic spike for 1 hour. You want 25% headroom and can only scale in whole pods. How many pods should you run during the spike?

Reveal answer

Per-pod capacity at 65% ≈ 800 / 8 = 100 RPS. Peak = 800 × 2 = 1600 RPS. Pods before buffer: 1600 / 100 = 16. Add 25% headroom: 16 × 1.25 = 20 pods.

Menu

Capacity Planning Basics

Table of Contents

Why this matters

Concept explained simply

Mental model

Step-by-step capacity planning (repeatable)

Worked examples

Who this is for

Prerequisites

Learning path

Exercises & practice

Exercise 1 — Size a stateless API for a spike

Exercise 2 — Forecast storage and set alerts

Practice checklist

Common mistakes and self-check

Practical projects

Next steps

Mini challenge

Practice Exercises

Size a stateless API for a spike

Instructions

Expected Output

Forecast storage and set alerts

Capacity Planning Basics — Quick Test

Have questions about Capacity Planning Basics?

AI Assistant