Who this is for
Mental model: Control plane, then nodes, with safety rails
- Safety rails: Backups, health checks, PDBs, surge capacity
- Order: Control plane first, then nodes (drain, upgrade, uncordon)
- Compatibility: Respect version skew rules (typically kubelet can be one minor behind the API server)
- Rollback: Prefer creating a fresh node pool with the previous version and migrating workloads back
Core ideas you must know
- Version skew: kube-apiserver must be the newest; kubelets can usually be one minor version older
- Disruption control: PDBs and maxUnavailable prevent unsafe evictions
- Drain mechanics: cordon (stop new pods), drain (evict respecting PDBs), uncordon (resume scheduling)
- Addon compatibility: CNI/CSI/Ingress controllers must support the target version
- Backups: etcd and critical configs must be backed up before upgrades
Safe upgrade strategy (step-by-step)
- Readiness checks
- List current versions (control plane, nodes, addons)
- Confirm version skew allowances for target version
- Verify PDBs, replicas, and surge capacity
- Back up etcd and configuration (for self-managed clusters)
- Change review
- Changelog and deprecations for the target version
- Addon compatibility matrix
- Control plane first
- Upgrade control plane components
- Verify API health and core controllers
- Nodes in waves
- Cordon, drain, upgrade, uncordon each node
- Start with a small canary set of nodes
- Validate and roll forward
- Run smoke tests
- Monitor error rates, latency, and pod restarts
- Rollback if needed
- Create or reuse a node pool on the previous version and migrate workloads
Minimal command set (self-managed with kubeadm)
# Control plane (on a control plane node)
sudo kubeadm upgrade plan
sudo kubeadm upgrade apply v1.28.x
# Node-by-node
kubectl cordon NODE
kubectl drain NODE --ignore-daemonsets --delete-emptydir-data
# Upgrade packages (example)
sudo apt-get update && sudo apt-get install -y kubelet=1.28.x-00 kubectl=1.28.x-00 kubeadm=1.28.x-00
sudo kubeadm upgrade node
sudo systemctl restart kubelet
kubectl uncordon NODE
Adapt package/version commands to your OS and repo. Validate each step.
Worked examples
Example 1: Managed cluster minor upgrade (control plane then nodes)
Scenario: Managed service offers control plane upgrade to 1.28. You choose a maintenance window.
- Pre-checks: Verify add-on compatibility; ensure Deployments have replicas >= 2; confirm PDBs allow at least one pod down.
- Upgrade control plane via provider UI/CLI.
- Create a new node pool at 1.28, cordon/drain old nodes gradually, or use rolling node upgrade feature.
- Monitor workloads; decommission old node pool after successful migration.
Example 2: Self-managed kubeadm minor upgrade
- Back up etcd and manifests.
- On a control plane node: kubeadm upgrade plan, then apply to 1.28.x.
- Verify kubectl get componentstatuses or health endpoints; check controller and scheduler logs.
- Upgrade worker nodes in small batches: cordon, drain, upgrade kubelet/kubeadm, uncordon.
- Smoke test: deploy a canary app; run a quick request loop; check errors/restarts.
Example 3: Handling a blocked drain due to PDB
Symptom: kubectl drain nodeA blocks with "cannot evict pod: PDB minAvailable violated".
Fix options:
- Temporarily increase replicas or set PDB to maxUnavailable: 1
- Add surge capacity (new node) so scheduler can place new pods
- Stagger workloads (drain lower-risk namespaces first)
Validate: Pods reschedule successfully; SLOs remain within targets.
Exercises
These mirror the tasks below. Complete them, then compare with the provided solutions.
- Exercise 1: Plan a safe 1.27 to 1.28 upgrade for a production cluster.
- Exercise 2: Resolve a blocked drain caused by a strict PDB.
- Note: The quick test is available to everyone; only logged-in users get saved progress.
Self-check after exercises
- Your plan lists control plane first, then nodes
- You included backups, PDB review, and addon compatibility
- You defined a rollback using a previous-version node pool
- You can explain why a PDB blocked your drain and how you fixed it safely
Checklists
Pre-upgrade checklist
- Current versions collected (control plane, nodes, CNI/CSI/Ingress)
- Target version chosen; changelog reviewed
- Version skew rules validated
- Backups taken (etcd/config for self-managed)
- PDBs verified; replicas increased where needed
- Monitoring and alerting in place
- Rollback path defined (previous-version node pool)
During-upgrade checklist
- Cordon before drain
- Drain respects PDBs
- Upgrade in canary waves
- Uncordon only after node is healthy
- Watch pod restarts and readiness gates
Post-upgrade checklist
- All nodes at target version
- Critical add-ons healthy
- Error rates and latency normal
- Backups rotated and validated
- Runbook updated with lessons learned
Common mistakes and how to self-check
- Upgrading workers before control plane: Always control plane first. Self-check: Is kube-apiserver the newest?
- No etcd/config backup (self-managed): Ensure backups exist and can be restored.
- Ignoring PDBs: Confirm every critical app has a realistic PDB (maxUnavailable usually easier than strict minAvailable).
- Forgetting add-on compatibility: Verify CNI/CSI/Ingress versions support target Kubernetes.
- Draining too many nodes at once: Use small waves; monitor SLOs.
- No rollback plan: Define how to recreate a node pool with the previous version and migrate workloads back.
Practical projects
- Write a cluster upgrade runbook template tailored to your environment.
- In a test cluster, perform a 1-minor upgrade using the canary-node approach.
- Break and fix: Create a strict PDB that blocks drains, then resolve it safely.
Next steps
- Automate your runbook with CI/CD and chat notifications
- Add preflight validation jobs that fail the release if upgrade checks fail
- Expand maintenance to cover recurring tasks (certificate rotation, image GC, etc.)
Mini challenge
Describe, in 6 steps max, how you would recover if an upgrade leaves one namespace degraded while the rest of the cluster is healthy. Include a rollback or isolation step and how you would verify recovery.