Why this matters
As a Platform Engineer, you are responsible for making apps fast, stable, and cost-efficient. Resource requests and limits control how much CPU and memory a Pod reserves and can use; autoscaling adds or removes capacity based on load. Get these wrong and you get throttling, OOMKills, noisy neighbors, or wasteful clusters. Get them right and you ship reliable, scalable platforms.
- Real tasks you will do:
- Right-size CPU/memory for Deployments to stop throttling and OOMKills.
- Configure HPA to scale replicas based on CPU/memory or custom metrics.
- Work with Cluster Autoscaler to add nodes when requests cannot be scheduled.
- Set sensible defaults/limits in namespaces so teams avoid outages and cost spikes.
Concept explained simply
Each container in Kubernetes can declare two numbers for each resource:
- request: how much you ask the scheduler to reserve. Determines placement and capacity planning.
- limit: the maximum the container may use. Exceeding CPU limit causes throttling; exceeding memory limit kills the container (OOMKilled).
Units:
- CPU: millicores (500m = 0.5 CPU of a core). CPU is compressible: exceeding the limit throttles but doesn’t kill.
- Memory: bytes (e.g., 256Mi, 1Gi). Memory is not compressible: exceeding the limit OOMKills the container.
Pod QoS classes (derived from requests/limits):
- Guaranteed: every container has equal request==limit for CPU and memory. Most protected from eviction.
- Burstable: some requests set; can burst up to limits. Middle protection.
- BestEffort: no requests/limits. Least protected; first to be evicted under pressure.
Autoscaling layers:
- HPA (Horizontal Pod Autoscaler): changes replica count based on metrics (e.g., CPU utilization). More replicas = more parallelism.
- VPA (Vertical Pod Autoscaler): recommends or applies larger/smaller requests/limits per Pod (restarts Pods when applying).
- Cluster Autoscaler: adds/removes nodes when Pods can’t be scheduled due to insufficient requested resources.
Mental model: budget and speed limit
Think of request as your reserved budget (a desk on the floor) and limit as the speed limit (how fast you can go). The scheduler places you based on your reserved budget. When traffic grows, HPA adds more workers (more desks), and if there’s no room, Cluster Autoscaler rents more floor space (new nodes).
Worked examples
Example 1: Right-size a latency-sensitive API
Observed p95 CPU per Pod is ~150m during peak with spikes to 400m. Memory steady at 180Mi with occasional peaks to 300Mi.
- Pick headroom: request CPU 200m, limit 500m; request memory 256Mi, limit 512Mi.
- Why: request covers typical p95 so the Pod schedules reliably; limit allows burst without throttling too quickly; memory limit above peak to avoid OOMKills.
resources:
requests:
cpu: "200m"
memory: "256Mi"
limits:
cpu: "500m"
memory: "512Mi"
Example 2: Configure HPA for CPU
Target average CPU utilization per Pod at 60%, replicas 2–10.
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: api-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 60
Note: HPA decisions depend on requests. If requests are too small, the same raw CPU load looks like a high utilization percentage and causes over-scaling.
Example 3: Memory-bound worker avoiding OOMKills
Worker uses 600–700Mi with spikes to 900Mi when processing large batches.
- Set request 700Mi, limit 1Gi; CPU request 250m, limit 1000m (to allow bursts).
- Use HPA on memory utilization at 70% to add replicas when usage grows.
resources:
requests:
cpu: "250m"
memory: "700Mi"
limits:
cpu: "1000m"
memory: "1Gi"
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: worker-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: worker
minReplicas: 1
maxReplicas: 8
metrics:
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 70
What if Pods don't scale even when HPA says they should?
- Check events for Pending Pods with reason "Insufficient cpu/memory". Cluster Autoscaler may need to add nodes.
- If CA is enabled but not scaling, requests may be larger than any single node’s free space. Reduce per-Pod request or use larger nodes.
Exercises you can run
Do these after reading the examples. They mirror the graded exercises below.
- Exercise 1: Plan requests/limits and an HPA for a spiky web service. Target 60% CPU utilization; expected YAML includes Deployment resources and HPA 2–10 replicas.
- Exercise 2: Diagnose throttling and memory eviction from given logs; propose resource tweaks and HPA settings.
- Checklist before you move on:
- You can explain the difference between request and limit in one sentence.
- You can choose CPU/memory values from observed p95 and peak usage.
- You can configure an autoscaling/v2 HPA with CPU or memory targets.
- You know how QoS classes change with request/limit settings.
Common mistakes and how to self-check
- Too-low requests inflating HPA utilization and causing over-scaling.
Self-check
Compare raw CPU (cores) vs utilization %. If raw CPU is steady but % is high, requests are probably too small.
- CPU limits too tight causing throttling.
Self-check
Look for throttling metrics or messages like "throttling" in logs. Increase CPU limit or remove it for latency-critical apps.
- Memory limits below peak causing OOMKilled.
Self-check
Check container restarts with reason OOMKilled. Set limit above known peak; set request closer to steady-state.
- Expecting HPA to fix bad per-Pod sizing.
Self-check
If each Pod instantly OOMs or throttles, HPA won’t help. First fix per-Pod requests/limits.
- Ignoring Cluster Autoscaler when Pods stay Pending.
Self-check
Pending with "Insufficient" reasons means the cluster lacks resources. Reduce per-Pod request or ensure Cluster Autoscaler can scale nodes.
Practical projects
- Autoscaling API: Deploy a sample API with proper requests/limits and HPA on CPU. Perform a basic load test and tune targets to hit a latency SLO.
- Memory-heavy batch worker: Configure memory-first sizing and HPA on memory utilization; verify zero OOMKills across a large batch.
- Cost-aware namespace defaults: Create a LimitRange and ResourceQuota that prevent BestEffort Pods and cap over-provisioning.
Learning path
- Before: Containers basics, Pod/Deployment, metrics collection fundamentals.
- Now: Requests, limits, QoS; HPA v2 configuration; Cluster Autoscaler behavior.
- After: Custom/external metrics for HPA, VPA recommendations, pod topology spread, priority and preemption.
Who this is for
- Platform and DevOps engineers owning multi-tenant clusters.
- Backend engineers deploying services who need predictable performance.
- SREs responsible for latency/error budgets and cost controls.
Prerequisites
- Comfort with Kubernetes Deployments, Pods, and basic YAML editing.
- Basic understanding of CPU/memory metrics and Pod logs.
Next steps
- Apply these settings to one real service and watch metrics for 24–48 hours.
- Tune requests/limits and HPA targets to meet latency/error budgets.
- Add namespace defaults (LimitRange) to guide teams toward safe values.
Mini challenge
Pick an existing Deployment that occasionally OOMKills. Increase memory limit 20–30% above observed peak, set request near steady-state, and add an HPA on memory at 70%. Verify no OOMKills over a full traffic cycle and note the replica behavior.
Quick Test: what to expect
10 short questions on requests, limits, QoS, and autoscaling. Everyone can take the test; only logged-in users get saved progress.