Menu

Topic 5 of 8

Resource Requests Limits Autoscaling

Learn Resource Requests Limits Autoscaling for free with explanations, exercises, and a quick test (for Platform Engineer).

Published: January 23, 2026 | Updated: January 23, 2026

Why this matters

As a Platform Engineer, you are responsible for making apps fast, stable, and cost-efficient. Resource requests and limits control how much CPU and memory a Pod reserves and can use; autoscaling adds or removes capacity based on load. Get these wrong and you get throttling, OOMKills, noisy neighbors, or wasteful clusters. Get them right and you ship reliable, scalable platforms.

  • Real tasks you will do:
    • Right-size CPU/memory for Deployments to stop throttling and OOMKills.
    • Configure HPA to scale replicas based on CPU/memory or custom metrics.
    • Work with Cluster Autoscaler to add nodes when requests cannot be scheduled.
    • Set sensible defaults/limits in namespaces so teams avoid outages and cost spikes.

Concept explained simply

Each container in Kubernetes can declare two numbers for each resource:

  • request: how much you ask the scheduler to reserve. Determines placement and capacity planning.
  • limit: the maximum the container may use. Exceeding CPU limit causes throttling; exceeding memory limit kills the container (OOMKilled).

Units:

  • CPU: millicores (500m = 0.5 CPU of a core). CPU is compressible: exceeding the limit throttles but doesn’t kill.
  • Memory: bytes (e.g., 256Mi, 1Gi). Memory is not compressible: exceeding the limit OOMKills the container.

Pod QoS classes (derived from requests/limits):

  • Guaranteed: every container has equal request==limit for CPU and memory. Most protected from eviction.
  • Burstable: some requests set; can burst up to limits. Middle protection.
  • BestEffort: no requests/limits. Least protected; first to be evicted under pressure.

Autoscaling layers:

  • HPA (Horizontal Pod Autoscaler): changes replica count based on metrics (e.g., CPU utilization). More replicas = more parallelism.
  • VPA (Vertical Pod Autoscaler): recommends or applies larger/smaller requests/limits per Pod (restarts Pods when applying).
  • Cluster Autoscaler: adds/removes nodes when Pods can’t be scheduled due to insufficient requested resources.
Mental model: budget and speed limit

Think of request as your reserved budget (a desk on the floor) and limit as the speed limit (how fast you can go). The scheduler places you based on your reserved budget. When traffic grows, HPA adds more workers (more desks), and if there’s no room, Cluster Autoscaler rents more floor space (new nodes).

Worked examples

Example 1: Right-size a latency-sensitive API

Observed p95 CPU per Pod is ~150m during peak with spikes to 400m. Memory steady at 180Mi with occasional peaks to 300Mi.

  • Pick headroom: request CPU 200m, limit 500m; request memory 256Mi, limit 512Mi.
  • Why: request covers typical p95 so the Pod schedules reliably; limit allows burst without throttling too quickly; memory limit above peak to avoid OOMKills.
resources:
  requests:
    cpu: "200m"
    memory: "256Mi"
  limits:
    cpu: "500m"
    memory: "512Mi"

Example 2: Configure HPA for CPU

Target average CPU utilization per Pod at 60%, replicas 2–10.

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 60

Note: HPA decisions depend on requests. If requests are too small, the same raw CPU load looks like a high utilization percentage and causes over-scaling.

Example 3: Memory-bound worker avoiding OOMKills

Worker uses 600–700Mi with spikes to 900Mi when processing large batches.

  • Set request 700Mi, limit 1Gi; CPU request 250m, limit 1000m (to allow bursts).
  • Use HPA on memory utilization at 70% to add replicas when usage grows.
resources:
  requests:
    cpu: "250m"
    memory: "700Mi"
  limits:
    cpu: "1000m"
    memory: "1Gi"
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: worker-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: worker
  minReplicas: 1
  maxReplicas: 8
  metrics:
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 70
What if Pods don't scale even when HPA says they should?
  • Check events for Pending Pods with reason "Insufficient cpu/memory". Cluster Autoscaler may need to add nodes.
  • If CA is enabled but not scaling, requests may be larger than any single node’s free space. Reduce per-Pod request or use larger nodes.

Exercises you can run

Do these after reading the examples. They mirror the graded exercises below.

  1. Exercise 1: Plan requests/limits and an HPA for a spiky web service. Target 60% CPU utilization; expected YAML includes Deployment resources and HPA 2–10 replicas.
  2. Exercise 2: Diagnose throttling and memory eviction from given logs; propose resource tweaks and HPA settings.
  • Checklist before you move on:
    • You can explain the difference between request and limit in one sentence.
    • You can choose CPU/memory values from observed p95 and peak usage.
    • You can configure an autoscaling/v2 HPA with CPU or memory targets.
    • You know how QoS classes change with request/limit settings.

Common mistakes and how to self-check

  • Too-low requests inflating HPA utilization and causing over-scaling.
    Self-check

    Compare raw CPU (cores) vs utilization %. If raw CPU is steady but % is high, requests are probably too small.

  • CPU limits too tight causing throttling.
    Self-check

    Look for throttling metrics or messages like "throttling" in logs. Increase CPU limit or remove it for latency-critical apps.

  • Memory limits below peak causing OOMKilled.
    Self-check

    Check container restarts with reason OOMKilled. Set limit above known peak; set request closer to steady-state.

  • Expecting HPA to fix bad per-Pod sizing.
    Self-check

    If each Pod instantly OOMs or throttles, HPA won’t help. First fix per-Pod requests/limits.

  • Ignoring Cluster Autoscaler when Pods stay Pending.
    Self-check

    Pending with "Insufficient" reasons means the cluster lacks resources. Reduce per-Pod request or ensure Cluster Autoscaler can scale nodes.

Practical projects

  • Autoscaling API: Deploy a sample API with proper requests/limits and HPA on CPU. Perform a basic load test and tune targets to hit a latency SLO.
  • Memory-heavy batch worker: Configure memory-first sizing and HPA on memory utilization; verify zero OOMKills across a large batch.
  • Cost-aware namespace defaults: Create a LimitRange and ResourceQuota that prevent BestEffort Pods and cap over-provisioning.

Learning path

  • Before: Containers basics, Pod/Deployment, metrics collection fundamentals.
  • Now: Requests, limits, QoS; HPA v2 configuration; Cluster Autoscaler behavior.
  • After: Custom/external metrics for HPA, VPA recommendations, pod topology spread, priority and preemption.

Who this is for

  • Platform and DevOps engineers owning multi-tenant clusters.
  • Backend engineers deploying services who need predictable performance.
  • SREs responsible for latency/error budgets and cost controls.

Prerequisites

  • Comfort with Kubernetes Deployments, Pods, and basic YAML editing.
  • Basic understanding of CPU/memory metrics and Pod logs.

Next steps

  • Apply these settings to one real service and watch metrics for 24–48 hours.
  • Tune requests/limits and HPA targets to meet latency/error budgets.
  • Add namespace defaults (LimitRange) to guide teams toward safe values.

Mini challenge

Pick an existing Deployment that occasionally OOMKills. Increase memory limit 20–30% above observed peak, set request near steady-state, and add an HPA on memory at 70%. Verify no OOMKills over a full traffic cycle and note the replica behavior.

Quick Test: what to expect

10 short questions on requests, limits, QoS, and autoscaling. Everyone can take the test; only logged-in users get saved progress.

Practice Exercises

2 exercises to complete

Instructions

You run a web API with these observations per Pod at peak: CPU p95 180m, CPU spikes 450m; Memory steady 220Mi, peaks 380Mi. Target average CPU utilization: 60%. You want minimum 2 replicas, up to 10 during spikes.

  • Choose CPU and memory requests/limits.
  • Create an HPA (autoscaling/v2) that meets the target.
  • Explain why your chosen values avoid throttling and OOMKills.
Expected Output
Deployment resources with CPU request around 200m, CPU limit around 500m; Memory request around 256Mi, Memory limit around 512Mi. HPA with minReplicas 2, maxReplicas 10, CPU averageUtilization 60.

Resource Requests Limits Autoscaling — Quick Test

Test your knowledge with 10 questions. Pass with 70% or higher.

10 questions70% to pass

Have questions about Resource Requests Limits Autoscaling?

AI Assistant

Ask questions about this tool