How to learn Rolling Updates And Rollbacks for Kubernetes For ML Workloads in MLOps Engineer for free

Who this is for

Mental model

Max surge = how many extra new pods you can add temporarily.
Max unavailable = how many old pods you can take down at once.
Readiness probe = the gate. Only ready pods receive traffic.
Revision history = snapshots of previous ReplicaSets for undo.

What actually happens under the hood

Deployment creates a new ReplicaSet for the new pod template.
New pods start and must pass readiness probes and minReadySeconds.
As new pods become ready, old pods are scaled down respecting maxUnavailable.
If readiness never succeeds or failures spike, you can pause or undo.

Key settings you’ll use

strategy: RollingUpdate with maxSurge and maxUnavailable
minReadySeconds to ensure stability before progressing
readinessProbe aligned with your /health or /v1/models/ready endpoint
preStop hook + terminationGracePeriodSeconds to finish in-flight requests
revisionHistoryLimit to keep enough rollback points
PodDisruptionBudget (PDB) to maintain minimum availability

Worked examples

Example 1: Safe rolling update for a model server

Goal: Zero-downtime rollout for an inference service (e.g., TF Serving).

apiVersion: apps/v1
kind: Deployment
metadata:
  name: model-serving
spec:
  replicas: 4
  revisionHistoryLimit: 10
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
  minReadySeconds: 20
  selector:
    matchLabels:
      app: model-serving
  template:
    metadata:
      labels:
        app: model-serving
    spec:
      terminationGracePeriodSeconds: 30
      containers:
      - name: server
        image: myrepo/tf-serving:1.0.3
        args: ["--port=8500", "--rest_api_port=8501", "--model_name=classifier", "--model_base_path=/models/classifier"]
        readinessProbe:
          httpGet:
            path: /v1/models/classifier
            port: 8501
          initialDelaySeconds: 5
          periodSeconds: 5
          failureThreshold: 3
        lifecycle:
          preStop:
            exec:
              command: ["/bin/sh", "-c", "sleep 10"]

What this achieves: At most one extra pod while none become unavailable, readiness gates traffic, and a preStop delay lets connections drain.

Example 2: Pause, verify, resume a rollout

# Start an update by changing the image
a kubectl set image deploy/model-serving server=myrepo/tf-serving:1.0.4

# Pause the rollout to hold at the first new pods
kubectl rollout pause deployment model-serving

# Verify new pods come up healthy
a kubectl get pods -l app=model-serving -w

# Check status and events
a kubectl rollout status deployment model-serving
kubectl describe deployment model-serving

# Resume when confident
kubectl rollout resume deployment model-serving

Why pause? To validate readiness, latency, or error metrics before proceeding.

Example 3: Roll back quickly

# See rollout history
kubectl rollout history deployment model-serving

# Roll back to previous revision
kubectl rollout undo deployment model-serving

# Or roll back to a specific revision (e.g., 4)
kubectl rollout undo deployment model-serving --to-revision=4

# Confirm state
kubectl rollout status deployment model-serving
kubectl get rs -l app=model-serving

Tip: Keep revisionHistoryLimit high enough to have safe restore points.

Troubleshooting deep dive

If rollout stalls: describe the Deployment and new ReplicaSet; check readinessProbe, container logs, and image pull errors.
If requests drop: confirm Service selector matches pods; verify readiness gates traffic; ensure PDB and strategy don’t reduce capacity too much.
For GPU pods: ensure node capacity and scheduling allow maxSurge; otherwise reduce surge or use surge=0 and scale up first.

Common mistakes and how to self-check

Using the latest tag: Always pin image and model version. Self-check: Does kubectl describe show a specific immutable tag?
No readiness probe or wrong path: Add a reliable health endpoint. Self-check: Do new pods become Ready before old ones terminate?
Too aggressive rollout: maxUnavailable > 0 for small replicas can cause downtime. Self-check: During rollout, does available pod count ever drop?
Ignoring termination time: No preStop + short grace can cut in-flight requests. Self-check: Are connections drained before pod termination?
Not enough history: Low revisionHistoryLimit prevents undo. Self-check: kubectl rollout history shows at least a few revisions?
Forgetting PDB: Voluntary disruptions can drop capacity. Self-check: PDB ensures min pods available during node maintenance?

Exercises (hands-on)

Note: The quick test is available to everyone; only logged-in users get saved progress.

Exercise 1 — Configure a safe rolling update

Goal: Create a 3-replica Deployment with maxUnavailable=0, maxSurge=1, readinessProbe, minReadySeconds=20, and a 10s preStop. Trigger an image update and watch a zero-downtime rollout.

Verify pods never drop below 3 available during the update.
Confirm new pods become Ready before old ones terminate.

Need a hint?

kubectl apply -f deployment.yaml
kubectl set image deploy/... container=repo/image:new
kubectl rollout status deploy/...
kubectl get pods -w to watch changes

Exercise 2 — Simulate a bad release and roll back

Goal: Push an update that fails readiness (e.g., wrong port/path), observe a stalled rollout, then undo to restore service.

Confirm rollout status indicates failure or timeout.
Perform kubectl rollout undo and verify recovery.

Need a hint?

Break readinessProbe path temporarily, apply, then watch status.
kubectl describe deployment ... to see why rollout is stuck.
kubectl rollout undo deployment ...

Exercise checklist

Deployment shows RollingUpdate with maxSurge=1 and maxUnavailable=0
Readiness probe gates traffic successfully
Zero downtime during a good rollout
Rollback restores previous version quickly

Practical projects

Project 1: TorchServe upgrade playbook. Write a scripted workflow that updates image tags, pauses rollout, checks a health endpoint and test inference, then resumes or rolls back automatically.
Project 2: ConfigMap rotation with rollout. Mount a ConfigMap for model metadata. Change it, roll out, verify the app reload behavior, and ensure no downtime.
Project 3: Blue/green with controlled traffic. Run two Deployments (blue/green) behind a Service. Practice switching labels to move traffic and add a rollback procedure.

Learning path

Before: Deployments, Services, Probes, and Resource Requests
Now: Rolling updates and rollbacks
Next: Autoscaling (HPA/VPA), Canary strategies, Progressive delivery with metrics gates

Mini challenge

Your cluster has 4 replicas of a latency-sensitive model. During peak traffic, you must update the container image without losing capacity. Design the rollout settings (maxSurge, maxUnavailable, minReadySeconds, readinessProbe) and outline the commands to pause, validate, resume, and, if needed, roll back. Write your plan as a 5–7 step runbook.

Next steps

Template your Deployment with safe defaults (strategy, probes, preStop, PDB).
Automate pause-validate-resume in your CI/CD pipeline.
Track rollout success with visible metrics and logs; set clear rollback criteria.

Menu

Rolling Updates And Rollbacks

Table of Contents

Who this is for

Mental model

Key settings you’ll use

Worked examples

Example 1: Safe rolling update for a model server

Example 2: Pause, verify, resume a rollout

Example 3: Roll back quickly

Common mistakes and how to self-check

Exercises (hands-on)

Exercise 1 — Configure a safe rolling update

Exercise 2 — Simulate a bad release and roll back

Exercise checklist

Practical projects

Learning path

Mini challenge

Next steps

Practice Exercises

Configure a safe rolling update

Instructions

Expected Output

Simulate a bad release and roll back

Rolling Updates And Rollbacks — Quick Test

Have questions about Rolling Updates And Rollbacks?

AI Assistant