luvv to helpDiscover the Best Free Online Tools
Topic 3 of 7

Rolling Updates And Rollbacks

Learn Rolling Updates And Rollbacks for free with explanations, exercises, and a quick test (for MLOps Engineer).

Published: January 4, 2026 | Updated: January 4, 2026

Who this is for

Mental model

  • Max surge = how many extra new pods you can add temporarily.
  • Max unavailable = how many old pods you can take down at once.
  • Readiness probe = the gate. Only ready pods receive traffic.
  • Revision history = snapshots of previous ReplicaSets for undo.
What actually happens under the hood
  1. Deployment creates a new ReplicaSet for the new pod template.
  2. New pods start and must pass readiness probes and minReadySeconds.
  3. As new pods become ready, old pods are scaled down respecting maxUnavailable.
  4. If readiness never succeeds or failures spike, you can pause or undo.

Key settings you’ll use

  • strategy: RollingUpdate with maxSurge and maxUnavailable
  • minReadySeconds to ensure stability before progressing
  • readinessProbe aligned with your /health or /v1/models/ready endpoint
  • preStop hook + terminationGracePeriodSeconds to finish in-flight requests
  • revisionHistoryLimit to keep enough rollback points
  • PodDisruptionBudget (PDB) to maintain minimum availability

Worked examples

Example 1: Safe rolling update for a model server

Goal: Zero-downtime rollout for an inference service (e.g., TF Serving).

apiVersion: apps/v1
kind: Deployment
metadata:
  name: model-serving
spec:
  replicas: 4
  revisionHistoryLimit: 10
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
  minReadySeconds: 20
  selector:
    matchLabels:
      app: model-serving
  template:
    metadata:
      labels:
        app: model-serving
    spec:
      terminationGracePeriodSeconds: 30
      containers:
      - name: server
        image: myrepo/tf-serving:1.0.3
        args: ["--port=8500", "--rest_api_port=8501", "--model_name=classifier", "--model_base_path=/models/classifier"]
        readinessProbe:
          httpGet:
            path: /v1/models/classifier
            port: 8501
          initialDelaySeconds: 5
          periodSeconds: 5
          failureThreshold: 3
        lifecycle:
          preStop:
            exec:
              command: ["/bin/sh", "-c", "sleep 10"]

What this achieves: At most one extra pod while none become unavailable, readiness gates traffic, and a preStop delay lets connections drain.

Example 2: Pause, verify, resume a rollout

# Start an update by changing the image
a kubectl set image deploy/model-serving server=myrepo/tf-serving:1.0.4

# Pause the rollout to hold at the first new pods
kubectl rollout pause deployment model-serving

# Verify new pods come up healthy
a kubectl get pods -l app=model-serving -w

# Check status and events
a kubectl rollout status deployment model-serving
kubectl describe deployment model-serving

# Resume when confident
kubectl rollout resume deployment model-serving

Why pause? To validate readiness, latency, or error metrics before proceeding.

Example 3: Roll back quickly

# See rollout history
kubectl rollout history deployment model-serving

# Roll back to previous revision
kubectl rollout undo deployment model-serving

# Or roll back to a specific revision (e.g., 4)
kubectl rollout undo deployment model-serving --to-revision=4

# Confirm state
kubectl rollout status deployment model-serving
kubectl get rs -l app=model-serving

Tip: Keep revisionHistoryLimit high enough to have safe restore points.

Troubleshooting deep dive
  • If rollout stalls: describe the Deployment and new ReplicaSet; check readinessProbe, container logs, and image pull errors.
  • If requests drop: confirm Service selector matches pods; verify readiness gates traffic; ensure PDB and strategy don’t reduce capacity too much.
  • For GPU pods: ensure node capacity and scheduling allow maxSurge; otherwise reduce surge or use surge=0 and scale up first.

Common mistakes and how to self-check

  • Using the latest tag: Always pin image and model version. Self-check: Does kubectl describe show a specific immutable tag?
  • No readiness probe or wrong path: Add a reliable health endpoint. Self-check: Do new pods become Ready before old ones terminate?
  • Too aggressive rollout: maxUnavailable > 0 for small replicas can cause downtime. Self-check: During rollout, does available pod count ever drop?
  • Ignoring termination time: No preStop + short grace can cut in-flight requests. Self-check: Are connections drained before pod termination?
  • Not enough history: Low revisionHistoryLimit prevents undo. Self-check: kubectl rollout history shows at least a few revisions?
  • Forgetting PDB: Voluntary disruptions can drop capacity. Self-check: PDB ensures min pods available during node maintenance?

Exercises (hands-on)

Note: The quick test is available to everyone; only logged-in users get saved progress.

Exercise 1 — Configure a safe rolling update

Goal: Create a 3-replica Deployment with maxUnavailable=0, maxSurge=1, readinessProbe, minReadySeconds=20, and a 10s preStop. Trigger an image update and watch a zero-downtime rollout.

  • Verify pods never drop below 3 available during the update.
  • Confirm new pods become Ready before old ones terminate.
Need a hint?
  • kubectl apply -f deployment.yaml
  • kubectl set image deploy/... container=repo/image:new
  • kubectl rollout status deploy/...
  • kubectl get pods -w to watch changes

Exercise 2 — Simulate a bad release and roll back

Goal: Push an update that fails readiness (e.g., wrong port/path), observe a stalled rollout, then undo to restore service.

  • Confirm rollout status indicates failure or timeout.
  • Perform kubectl rollout undo and verify recovery.
Need a hint?
  • Break readinessProbe path temporarily, apply, then watch status.
  • kubectl describe deployment ... to see why rollout is stuck.
  • kubectl rollout undo deployment ...

Exercise checklist

Practical projects

  • Project 1: TorchServe upgrade playbook. Write a scripted workflow that updates image tags, pauses rollout, checks a health endpoint and test inference, then resumes or rolls back automatically.
  • Project 2: ConfigMap rotation with rollout. Mount a ConfigMap for model metadata. Change it, roll out, verify the app reload behavior, and ensure no downtime.
  • Project 3: Blue/green with controlled traffic. Run two Deployments (blue/green) behind a Service. Practice switching labels to move traffic and add a rollback procedure.

Learning path

  • Before: Deployments, Services, Probes, and Resource Requests
  • Now: Rolling updates and rollbacks
  • Next: Autoscaling (HPA/VPA), Canary strategies, Progressive delivery with metrics gates

Mini challenge

Your cluster has 4 replicas of a latency-sensitive model. During peak traffic, you must update the container image without losing capacity. Design the rollout settings (maxSurge, maxUnavailable, minReadySeconds, readinessProbe) and outline the commands to pause, validate, resume, and, if needed, roll back. Write your plan as a 5–7 step runbook.

Next steps

  • Template your Deployment with safe defaults (strategy, probes, preStop, PDB).
  • Automate pause-validate-resume in your CI/CD pipeline.
  • Track rollout success with visible metrics and logs; set clear rollback criteria.

Practice Exercises

2 exercises to complete

Instructions

Create deployment.yaml for a 3-replica model-serving Deployment with:

  • strategy: RollingUpdate (maxSurge=1, maxUnavailable=0)
  • minReadySeconds=20
  • readinessProbe on /healthz (HTTP 200)
  • preStop sleep 10 and terminationGracePeriodSeconds=30

Apply it, then update the image tag to a new version and watch the rollout.

Expected Output
Pods scale to 4 total briefly (surge), then return to 3 with zero downtime. New pods become Ready before old pods terminate.

Rolling Updates And Rollbacks — Quick Test

Test your knowledge with 8 questions. Pass with 70% or higher.

8 questions70% to pass

Have questions about Rolling Updates And Rollbacks?

AI Assistant

Ask questions about this tool