Menu

Topic 7 of 8

Cluster Upgrades And Maintenance

Learn Cluster Upgrades And Maintenance for free with explanations, exercises, and a quick test (for Platform Engineer).

Published: January 23, 2026 | Updated: January 23, 2026

Who this is for

Mental model: Control plane, then nodes, with safety rails
  • Safety rails: Backups, health checks, PDBs, surge capacity
  • Order: Control plane first, then nodes (drain, upgrade, uncordon)
  • Compatibility: Respect version skew rules (typically kubelet can be one minor behind the API server)
  • Rollback: Prefer creating a fresh node pool with the previous version and migrating workloads back

Core ideas you must know

  • Version skew: kube-apiserver must be the newest; kubelets can usually be one minor version older
  • Disruption control: PDBs and maxUnavailable prevent unsafe evictions
  • Drain mechanics: cordon (stop new pods), drain (evict respecting PDBs), uncordon (resume scheduling)
  • Addon compatibility: CNI/CSI/Ingress controllers must support the target version
  • Backups: etcd and critical configs must be backed up before upgrades

Safe upgrade strategy (step-by-step)

  1. Readiness checks
    • List current versions (control plane, nodes, addons)
    • Confirm version skew allowances for target version
    • Verify PDBs, replicas, and surge capacity
    • Back up etcd and configuration (for self-managed clusters)
  2. Change review
    • Changelog and deprecations for the target version
    • Addon compatibility matrix
  3. Control plane first
    • Upgrade control plane components
    • Verify API health and core controllers
  4. Nodes in waves
    • Cordon, drain, upgrade, uncordon each node
    • Start with a small canary set of nodes
  5. Validate and roll forward
    • Run smoke tests
    • Monitor error rates, latency, and pod restarts
  6. Rollback if needed
    • Create or reuse a node pool on the previous version and migrate workloads
Minimal command set (self-managed with kubeadm)
# Control plane (on a control plane node)
sudo kubeadm upgrade plan
sudo kubeadm upgrade apply v1.28.x
# Node-by-node
kubectl cordon NODE
kubectl drain NODE --ignore-daemonsets --delete-emptydir-data
# Upgrade packages (example)
sudo apt-get update && sudo apt-get install -y kubelet=1.28.x-00 kubectl=1.28.x-00 kubeadm=1.28.x-00
sudo kubeadm upgrade node
sudo systemctl restart kubelet
kubectl uncordon NODE

Adapt package/version commands to your OS and repo. Validate each step.

Worked examples

Example 1: Managed cluster minor upgrade (control plane then nodes)

Scenario: Managed service offers control plane upgrade to 1.28. You choose a maintenance window.

  1. Pre-checks: Verify add-on compatibility; ensure Deployments have replicas >= 2; confirm PDBs allow at least one pod down.
  2. Upgrade control plane via provider UI/CLI.
  3. Create a new node pool at 1.28, cordon/drain old nodes gradually, or use rolling node upgrade feature.
  4. Monitor workloads; decommission old node pool after successful migration.
Example 2: Self-managed kubeadm minor upgrade
  1. Back up etcd and manifests.
  2. On a control plane node: kubeadm upgrade plan, then apply to 1.28.x.
  3. Verify kubectl get componentstatuses or health endpoints; check controller and scheduler logs.
  4. Upgrade worker nodes in small batches: cordon, drain, upgrade kubelet/kubeadm, uncordon.
  5. Smoke test: deploy a canary app; run a quick request loop; check errors/restarts.
Example 3: Handling a blocked drain due to PDB

Symptom: kubectl drain nodeA blocks with "cannot evict pod: PDB minAvailable violated".

Fix options:

  • Temporarily increase replicas or set PDB to maxUnavailable: 1
  • Add surge capacity (new node) so scheduler can place new pods
  • Stagger workloads (drain lower-risk namespaces first)

Validate: Pods reschedule successfully; SLOs remain within targets.

Exercises

These mirror the tasks below. Complete them, then compare with the provided solutions.

  1. Exercise 1: Plan a safe 1.27 to 1.28 upgrade for a production cluster.
  2. Exercise 2: Resolve a blocked drain caused by a strict PDB.
  • Note: The quick test is available to everyone; only logged-in users get saved progress.
Self-check after exercises
  • Your plan lists control plane first, then nodes
  • You included backups, PDB review, and addon compatibility
  • You defined a rollback using a previous-version node pool
  • You can explain why a PDB blocked your drain and how you fixed it safely

Checklists

Pre-upgrade checklist
  • Current versions collected (control plane, nodes, CNI/CSI/Ingress)
  • Target version chosen; changelog reviewed
  • Version skew rules validated
  • Backups taken (etcd/config for self-managed)
  • PDBs verified; replicas increased where needed
  • Monitoring and alerting in place
  • Rollback path defined (previous-version node pool)
During-upgrade checklist
  • Cordon before drain
  • Drain respects PDBs
  • Upgrade in canary waves
  • Uncordon only after node is healthy
  • Watch pod restarts and readiness gates
Post-upgrade checklist
  • All nodes at target version
  • Critical add-ons healthy
  • Error rates and latency normal
  • Backups rotated and validated
  • Runbook updated with lessons learned

Common mistakes and how to self-check

  • Upgrading workers before control plane: Always control plane first. Self-check: Is kube-apiserver the newest?
  • No etcd/config backup (self-managed): Ensure backups exist and can be restored.
  • Ignoring PDBs: Confirm every critical app has a realistic PDB (maxUnavailable usually easier than strict minAvailable).
  • Forgetting add-on compatibility: Verify CNI/CSI/Ingress versions support target Kubernetes.
  • Draining too many nodes at once: Use small waves; monitor SLOs.
  • No rollback plan: Define how to recreate a node pool with the previous version and migrate workloads back.

Practical projects

  • Write a cluster upgrade runbook template tailored to your environment.
  • In a test cluster, perform a 1-minor upgrade using the canary-node approach.
  • Break and fix: Create a strict PDB that blocks drains, then resolve it safely.

Next steps

  • Automate your runbook with CI/CD and chat notifications
  • Add preflight validation jobs that fail the release if upgrade checks fail
  • Expand maintenance to cover recurring tasks (certificate rotation, image GC, etc.)

Mini challenge

Describe, in 6 steps max, how you would recover if an upgrade leaves one namespace degraded while the rest of the cluster is healthy. Include a rollback or isolation step and how you would verify recovery.

Practice Exercises

2 exercises to complete

Instructions

You operate a production cluster currently on v1.27. You want to upgrade to v1.28 with minimal risk. Draft a plan that includes pre-checks, the upgrade order, validation, and rollback. Assume you have a mix of Deployments (replicas 2–5), a StatefulSet for a database (with a PDB), and standard add-ons (CNI, CSI, Ingress).

  • List concrete steps (bulleted or numbered).
  • Call out how you will handle PDBs and surge capacity.
  • Define a rollback path.
Expected Output
A clear, ordered plan: backups, compatibility checks, control plane first, canary node wave, drain with PDB awareness, monitoring, and rollback via previous-version node pool.

Cluster Upgrades And Maintenance — Quick Test

Test your knowledge with 8 questions. Pass with 70% or higher.

8 questions70% to pass

Have questions about Cluster Upgrades And Maintenance?

AI Assistant

Ask questions about this tool