How to learn Cluster Upgrades And Maintenance for Containers And Kubernetes in Platform Engineer for free

Who this is for

Mental model: Control plane, then nodes, with safety rails

Safety rails: Backups, health checks, PDBs, surge capacity
Order: Control plane first, then nodes (drain, upgrade, uncordon)
Compatibility: Respect version skew rules (typically kubelet can be one minor behind the API server)
Rollback: Prefer creating a fresh node pool with the previous version and migrating workloads back

Core ideas you must know

Version skew: kube-apiserver must be the newest; kubelets can usually be one minor version older
Disruption control: PDBs and maxUnavailable prevent unsafe evictions
Drain mechanics: cordon (stop new pods), drain (evict respecting PDBs), uncordon (resume scheduling)
Addon compatibility: CNI/CSI/Ingress controllers must support the target version
Backups: etcd and critical configs must be backed up before upgrades

Safe upgrade strategy (step-by-step)

Readiness checks
- List current versions (control plane, nodes, addons)
- Confirm version skew allowances for target version
- Verify PDBs, replicas, and surge capacity
- Back up etcd and configuration (for self-managed clusters)
Change review
- Changelog and deprecations for the target version
- Addon compatibility matrix
Control plane first
- Upgrade control plane components
- Verify API health and core controllers
Nodes in waves
- Cordon, drain, upgrade, uncordon each node
- Start with a small canary set of nodes
Validate and roll forward
- Run smoke tests
- Monitor error rates, latency, and pod restarts
Rollback if needed
- Create or reuse a node pool on the previous version and migrate workloads

Minimal command set (self-managed with kubeadm)

# Control plane (on a control plane node)
sudo kubeadm upgrade plan
sudo kubeadm upgrade apply v1.28.x
# Node-by-node
kubectl cordon NODE
kubectl drain NODE --ignore-daemonsets --delete-emptydir-data
# Upgrade packages (example)
sudo apt-get update && sudo apt-get install -y kubelet=1.28.x-00 kubectl=1.28.x-00 kubeadm=1.28.x-00
sudo kubeadm upgrade node
sudo systemctl restart kubelet
kubectl uncordon NODE

Adapt package/version commands to your OS and repo. Validate each step.

Worked examples

Example 1: Managed cluster minor upgrade (control plane then nodes)

Scenario: Managed service offers control plane upgrade to 1.28. You choose a maintenance window.

Pre-checks: Verify add-on compatibility; ensure Deployments have replicas >= 2; confirm PDBs allow at least one pod down.
Upgrade control plane via provider UI/CLI.
Create a new node pool at 1.28, cordon/drain old nodes gradually, or use rolling node upgrade feature.
Monitor workloads; decommission old node pool after successful migration.

Example 2: Self-managed kubeadm minor upgrade

Back up etcd and manifests.
On a control plane node: kubeadm upgrade plan, then apply to 1.28.x.
Verify kubectl get componentstatuses or health endpoints; check controller and scheduler logs.
Upgrade worker nodes in small batches: cordon, drain, upgrade kubelet/kubeadm, uncordon.
Smoke test: deploy a canary app; run a quick request loop; check errors/restarts.

Example 3: Handling a blocked drain due to PDB

Symptom: kubectl drain nodeA blocks with "cannot evict pod: PDB minAvailable violated".

Fix options:

Temporarily increase replicas or set PDB to maxUnavailable: 1
Add surge capacity (new node) so scheduler can place new pods
Stagger workloads (drain lower-risk namespaces first)

Validate: Pods reschedule successfully; SLOs remain within targets.

Exercises

These mirror the tasks below. Complete them, then compare with the provided solutions.

Exercise 1: Plan a safe 1.27 to 1.28 upgrade for a production cluster.
Exercise 2: Resolve a blocked drain caused by a strict PDB.

Note: The quick test is available to everyone; only logged-in users get saved progress.

Self-check after exercises

Your plan lists control plane first, then nodes
You included backups, PDB review, and addon compatibility
You defined a rollback using a previous-version node pool
You can explain why a PDB blocked your drain and how you fixed it safely

Checklists

Pre-upgrade checklist

Current versions collected (control plane, nodes, CNI/CSI/Ingress)
Target version chosen; changelog reviewed
Version skew rules validated
Backups taken (etcd/config for self-managed)
PDBs verified; replicas increased where needed
Monitoring and alerting in place
Rollback path defined (previous-version node pool)

During-upgrade checklist

Cordon before drain
Drain respects PDBs
Upgrade in canary waves
Uncordon only after node is healthy
Watch pod restarts and readiness gates

Post-upgrade checklist

All nodes at target version
Critical add-ons healthy
Error rates and latency normal
Backups rotated and validated
Runbook updated with lessons learned

Common mistakes and how to self-check

Upgrading workers before control plane: Always control plane first. Self-check: Is kube-apiserver the newest?
No etcd/config backup (self-managed): Ensure backups exist and can be restored.
Ignoring PDBs: Confirm every critical app has a realistic PDB (maxUnavailable usually easier than strict minAvailable).
Forgetting add-on compatibility: Verify CNI/CSI/Ingress versions support target Kubernetes.
Draining too many nodes at once: Use small waves; monitor SLOs.
No rollback plan: Define how to recreate a node pool with the previous version and migrate workloads back.

Practical projects

Write a cluster upgrade runbook template tailored to your environment.
In a test cluster, perform a 1-minor upgrade using the canary-node approach.
Break and fix: Create a strict PDB that blocks drains, then resolve it safely.

Next steps

Automate your runbook with CI/CD and chat notifications
Add preflight validation jobs that fail the release if upgrade checks fail
Expand maintenance to cover recurring tasks (certificate rotation, image GC, etc.)

Mini challenge

Describe, in 6 steps max, how you would recover if an upgrade leaves one namespace degraded while the rest of the cluster is healthy. Include a rollback or isolation step and how you would verify recovery.

Menu

Cluster Upgrades And Maintenance

Table of Contents

Who this is for

Core ideas you must know

Safe upgrade strategy (step-by-step)

Worked examples

Exercises

Checklists

Common mistakes and how to self-check

Practical projects

Next steps

Mini challenge

Practice Exercises

Plan a safe minor upgrade (1.27 → 1.28)

Instructions

Expected Output

Fix a blocked drain (PDB issue)

Cluster Upgrades And Maintenance — Quick Test

Have questions about Cluster Upgrades And Maintenance?

AI Assistant