Topic Not Found

Why this matters

As an MLOps Engineer, you must make ML services reliable and cost-effective. Setting Kubernetes resource requests and limits correctly ensures pods get scheduled on the right nodes, avoid OOM kills or CPU throttling, and scale up or down based on demand. Typical tasks include:

Right-sizing real-time inference services to keep latency low and costs under control.
Scheduling GPU training jobs efficiently so they start quickly and use GPUs fully.
Designing autoscaling for batch and streaming workloads that follow traffic or queue depth.

What can go wrong without proper sizing?

Pods stuck in Pending due to too-large requests or missing GPU resources.
OOMKilled containers from memory limits that are too low.
High p99 latency from CPU throttling due to strict CPU limits.
Unnecessary cloud bills from overly conservative requests.

Concept explained simply

In Kubernetes, each container can declare requests (what it needs to run) and limits (the maximum it can use). The scheduler places pods on nodes that can honor the requests. At runtime, the kernel enforces limits via cgroups. Autoscalers use these values and live metrics to scale pods and nodes.

Requests: Minimum guaranteed resources for scheduling and QoS decisions.
Limits: Hard caps (for memory, exceeding the limit kills the container; for CPU, exceeding causes throttling).
Pod QoS:
- Guaranteed: request == limit for all resources of all containers.
- Burstable: at least one request set, and some request < limit.
- BestEffort: no requests or limits set.
Autoscaling:
- HPA (Horizontal Pod Autoscaler): adds/removes replicas based on metrics (e.g., CPU% of request, memory, or custom metrics).
- VPA (Vertical Pod Autoscaler): adjusts requests/limits for a pod (commonly used in recommendation mode for services and applied automatically for jobs).
- Cluster Autoscaler: adds/removes nodes to fit unschedulable pods (cloud-provider feature).

Mental model

Think of a node as an office floor and a pod as a team. Requests reserve desks so your team has a guaranteed place. Limits are the maximum desks the team can occupy. HPA hires more teams (pods) when busy. VPA changes team size (per-pod resources). Cluster Autoscaler rents more floors (nodes) if there isn’t space.

Worked examples

Example 1: Real-time CPU-bound inference service

Goal: keep p95 latency under 100ms during peak traffic.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: fast-infer
spec:
  replicas: 3
  selector:
    matchLabels:
      app: fast-infer
  template:
    metadata:
      labels:
        app: fast-infer
    spec:
      containers:
      - name: server
        image: myrepo/fast-infer:1.2
        resources:
          requests:
            cpu: "300m"
            memory: "512Mi"
          limits:
            cpu: "800m"
            memory: "1Gi"
        ports:
        - containerPort: 8080
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: fast-infer-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: fast-infer
  minReplicas: 3
  maxReplicas: 12
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 60

Why: HPA targets 60% CPU utilization relative to requests. If actual CPU is ~600m per pod, HPA increases replicas to push utilization toward 60% and reduce latency.

Example 2: GPU training job

Goal: request exactly one GPU and enough CPU/memory for the data loader.

apiVersion: batch/v1
kind: Job
metadata:
  name: train-resnet
spec:
  template:
    spec:
      restartPolicy: Never
      containers:
      - name: trainer
        image: myrepo/train:2.0
        resources:
          requests:
            cpu: "2"
            memory: "8Gi"
            nvidia.com/gpu: "1"
          limits:
            cpu: "4"
            memory: "16Gi"
            nvidia.com/gpu: "1"
        args: ["--epochs", "20", "--batch-size", "64"]

Why: GPUs are scheduled via the extended resource nvidia.com/gpu. You typically request and limit the same integer count. Memory and CPU allow adequate data pipeline throughput.

Example 3: Batch inference workers with HPA on memory

Goal: scale workers when memory pressure increases (proxy for queue backlog).

apiVersion: apps/v1
kind: Deployment
metadata:
  name: batch-worker
spec:
  replicas: 2
  selector:
    matchLabels:
      app: batch-worker
  template:
    metadata:
      labels:
        app: batch-worker
    spec:
      containers:
      - name: worker
        image: myrepo/worker:1.0
        resources:
          requests:
            cpu: "500m"
            memory: "1Gi"
          limits:
            cpu: "1500m"
            memory: "2Gi"
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: batch-worker-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: batch-worker
  minReplicas: 2
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 70

Why: Memory utilization rising toward the limit indicates larger batches or queue growth; HPA scales to keep utilization around 70%.

Tip: Jobs and autoscaling

Jobs themselves don’t scale via HPA, but you can scale the Deployment of workers that consume tasks from a queue. For scheduled runs, use CronJob to create Jobs at intervals.

Reliable sizing workflow

Measure a baseline: run a pod with generous limits and scrape CPU/memory over a real workload.
Set initial requests to p50 usage and limits to p95–p99 usage (memory) or 2–3x requests (CPU), then load-test.
Watch for OOM kills or throttling; adjust memory limits up if killed, or relax CPU limits if latency spikes.
Enable HPA targeting 50–70% CPU for latency-sensitive services; for memory-driven workloads, target memory utilization.
Consider VPA for recommendations (especially for batch jobs) and use Cluster Autoscaler so new replicas have room.

How to spot CPU throttling

Latency rises while CPU usage appears capped near the limit.
Container runtime metrics show throttled time increasing.
Fix: raise/remove CPU limits, keep requests sized for scheduling.

Common mistakes and self-check

Only setting limits but no requests: may cause BestEffort QoS and unstable performance.
Setting limit < request: invalid; the API rejects it.
Memory limits too low: leads to OOMKilled. Increase memory limit or optimize memory usage.
Very tight CPU limits on latency-sensitive services: causes throttling and tail latency. Prefer higher limits or no CPU limit.
Using HPA without Cluster Autoscaler in small clusters: replicas may remain Pending under peak.
Running HPA and VPA both in "Auto" on the same Deployment: can conflict. Use VPA in "Off" (recommendation) if HPA controls replicas.

Self-check

Does each container have requests set for CPU and memory?
Are memory limits high enough to avoid OOM under peak batch size?
Is HPA target utilization realistic given your per-pod throughput?
Can the cluster add nodes when HPA scales up?

Exercises

These exercises are available to everyone; only logged-in users will have their progress saved.

Exercise 1 — Propose requests/limits

You measured a single inference pod under peak for 30 minutes:

CPU: p50 200m, p95 600m
Memory: p50 300Mi, p95 700Mi
Latency target: p95 under 120ms

Pick initial requests and limits to balance stability and cost, and explain your choice.

When you’re done, compare with the solution below.

Exercise 2 — Write an HPA

Create an HPA for Deployment api-infer with minReplicas: 2, maxReplicas: 10, targeting averageUtilization: 65 for CPU.

When you’re done, compare with the solution below.

Checklist before checking solutions:
- Each container has both requests and limits (except when intentionally leaving CPU limit unset).
- Memory limit is at least p95 memory usage.
- HPA refers to the correct target and API versions.

Practical projects

Right-size an inference service: start with overprovisioned limits, measure, then iteratively lower requests/limits to hit SLO and cost goals.
Train on GPUs with a Job: request GPUs, add CPU/memory headroom, and validate throughput with a profiling run.
Batch worker autoscaling: deploy workers with HPA on memory; simulate backlog and verify scale-out and scale-in behavior.
Rightsizing pipeline: add VPA in recommendation mode and create a weekly review of suggested requests.

Who this is for

MLOps Engineers building reliable, cost-aware ML infrastructures.
Data/ML Engineers deploying training and inference on Kubernetes.
Platform Engineers supporting ML teams.

Prerequisites

Comfort with basic Kubernetes (Pods, Deployments, Services, Jobs).
YAML editing and kubectl basics.
Understanding of CPU units (millicores), memory units (Mi/Gi), and GPU resources.

Learning path

Start: Understand requests vs limits and Pod QoS.
Practice: Apply to one service and one job; measure and iterate.
Add autoscaling: HPA for services; consider VPA for recommendations.
Scale the cluster: enable Cluster Autoscaler.
Review weekly: refine requests/limits from production metrics.

Next steps

Complete the exercises, then take the Quick Test to confirm your understanding.
Apply this to one live workload and track latency, error rates, and cost for a week.

Mini challenge

Your service shows rare OOMKills during traffic spikes. What do you change first?

Increase the memory limit to cover p99 usage and leave some headroom, then monitor. If you cannot reproduce, add memory profiling to find peak drivers. Consider HPA on memory if spikes correlate with batch size.

Menu

Resource Requests Limits Autoscaling

Table of Contents

Why this matters

Concept explained simply

Mental model

Worked examples

Example 1: Real-time CPU-bound inference service

Example 2: GPU training job

Example 3: Batch inference workers with HPA on memory

Reliable sizing workflow

Common mistakes and self-check

Exercises

Exercise 1 — Propose requests/limits

Exercise 2 — Write an HPA

Practical projects

Who this is for

Prerequisites

Learning path

Next steps

Mini challenge

Practice Exercises

Propose requests/limits for an inference pod

Instructions

Expected Output

Write an HPA for api-infer

Resource Requests Limits Autoscaling — Quick Test

Have questions about Resource Requests Limits Autoscaling?

AI Assistant