Cluster Upgrades: Orchestrating the Version Transition
Upgrading a production Kubernetes cluster is a high-availability exercise. It requires a precise sequence of operations to ensure the Control Plane remains consistent while the Data Plane (worker nodes) is cycled without violating application Service Level Objectives (SLOs).
1. THE ARCHITECTURE OF VERSIONING
Kubernetes adheres to strict Semantic Versioning (SemVer): v[Major].[Minor].[Patch].
1.1 Support Lifecycle (N-2)
Kubernetes officially supports the three most recent minor versions.
- Example: When
v1.33is released,v1.30drops out of support and no longer receives security patches. - Bible Rule: Never skip a minor version (e.g.,
1.31 -> 1.33). This bypasses critical API migrations and database schema updates inetcd.
1.2 Version Skew Policy (The Safety Matrix)
The Kube-API Server is the "Source of Truth." All other components must revolve around its version.
| Component | Max Skew | Logic |
|---|---|---|
| API Server | X (Anchor) | Must be upgraded first. No other component can be newer than this. |
| KCM / Scheduler | X - 1 | Must match API server or be 1 minor version older. |
| Kubelet | X - 2 | Can be up to 2 minor versions older. This allows workers to run while CP is upgraded. |
| Kubectl | X +/- 1 | Supported within one minor version difference. |
2. PRE-UPGRADE RECONNAISSANCE
Before initiating the binary swap, an architect must ensure the state is recoverable.
- Etcd Snapshot:
kubeadmdoes not backup etcd for you.ETCDCTL_API=3 etcdctl snapshot save /tmp/pre-upgrade.db \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key - Manifest Backup:
cp -r /etc/kubernetes/ ~/k8s-backup/. - Addon Compatibility: Verify that your CNI (Calico/Cilium) and CSI drivers support the target version by checking their respective support matrices.
3. THE UPGRADE SEQUENCE [STEP-BY-STEP]
3.1 Repository Synchronization [ALL NODES]
Kubernetes moved to community-hosted repositories (pkgs.k8s.io). You must manually update your package manager's source list to point to the target minor version.
# Example: Moving to v1.33
echo "deb [signed-by=/etc/apt/keyrings/kubernetes-apt-keyring.gpg] https://pkgs.k8s.io/core:/stable:/v1.33/deb/ /" | sudo tee /etc/apt/sources.list.d/kubernetes.list
sudo apt-get update
3.2 Upgrading the Primary Control Plane
1. Binary Swap:
sudo apt-mark unhold kubeadm
sudo apt-get install -y kubeadm=1.33.x-x
sudo apt-mark hold kubeadm
2. The Execution Plan:
kubeadm performs pre-flight checks to ensure the cluster is healthy (e.g., all nodes are Ready, certificates are valid).
sudo kubeadm upgrade plan
Sample Output:
COMPONENT CURRENT TARGET
kube-apiserver v1.32.2 v1.33.4
kube-controller-mgr v1.32.2 v1.33.4
...
CERTIFICATE EXPIRES PROXY DONE
admin.conf Oct 27, 2024 10:00 UTC no
3. Apply the Logic:
This command replaces the static pod manifests in /etc/kubernetes/manifests. The Kubelet will observe the file change and restart the Control Plane pods automatically.
sudo kubeadm upgrade apply v1.33.4 --yes
4. Finalizing CP Kubelet:
sudo apt-mark unhold kubelet kubectl
sudo apt-get install -y kubelet=1.33.x-x kubectl=1.33.x-x
sudo systemctl daemon-reload && sudo systemctl restart kubelet
sudo apt-mark hold kubelet kubectl
3.3 Upgrading Worker Nodes [SERIAL EXECUTION]
Worker nodes must be upgraded one by one to maintain application availability.
1. The Eviction (From CP Node):
We use drain to gracefully move workloads.
kubectl drain worker-01 --ignore-daemonsets --delete-emptydir-data
Sample Output:
node/worker-01 cordoned
evicting pod default/api-v1-7d4b8c9f5-xyz12
evicting pod kube-system/coredns-85cb69466-abcde
pod/api-v1-7d4b8c9f5-xyz12 evicted
node/worker-01 drained
2. Node Upgrade (On the Worker):
sudo apt-get install -y kubeadm=1.33.x-x
# 'upgrade node' pulls the latest cluster configuration from the API Server
sudo kubeadm upgrade node
sudo apt-get install -y kubelet=1.33.x-x kubectl=1.33.x-x
sudo systemctl daemon-reload && sudo systemctl restart kubelet
3. Restoration (From CP Node):
kubectl uncordon worker-01
4. PRODUCTION SAFEGUARD: POD DISRUPTION BUDGETS (PDB)
In production, you must prevent the drain command from killing too many replicas of a mission-critical app.
4.1 The PDB Logic
A PDB ensures that a minimum number of replicas are always "Ready" during voluntary disruptions (like upgrades).
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: billing-api-pdb
spec:
minAvailable: 2 # At least 2 pods must stay alive
selector:
matchLabels:
app: billing
4.2 The "Drain Hang" Pitfall
If you have a 3-replica app with a minAvailable: 3 PDB, the kubectl drain command will hang indefinitely.
- Reason: The PDB tells the Eviction API "No pods can be removed."
- Fix: Ensure
minAvailableis always less than the total replica count, or add temporary nodes to the cluster before starting the upgrade.
5. POST-UPGRADE CNI TROUBLESHOOTING
Upgrades often reveal "hidden" CNI misconfigurations, particularly with Calico VXLAN/BGP.
- Symptom: Nodes show
Ready, but Pods cannot ping each other across nodes. - Internal Cause: The Calico node binary may have updated and is trying to use BGP (Port 179), which is blocked by Cloud firewalls (AWS/Azure).
- The Architect's Fix: Ensure Calico is forced into VXLAN mode (UDP 4789).
kubectl patch installation.operator.tigera.io default --type=merge -p '{"spec":{"calicoNetwork":{"bgp":"Disabled"}}}'
6. KEY FILE REFERENCE FOR ARCHITECTS
| File Path | Description | Internal Role |
|---|---|---|
/etc/kubernetes/manifests/ | Static Pods | kubeadm upgrade writes the new API Server/Etcd JSON here. |
/etc/kubernetes/admin.conf | Root Kubeconfig | The primary credential for the cluster. Check its cert expiry after upgrades. |
/var/lib/kubelet/config.yaml | Kubelet Config | Stores the Cgroup driver and Cluster DNS settings. |
/etc/apt/sources.list.d/kubernetes.list | Binary Source | Determines which version apt will "see." |
7. NINJA COMMANDS: VERIFICATION
1. Audit actual binary versions running on nodes:
kubectl get nodes -o custom-columns=NAME:.metadata.name,VERSION:.status.nodeInfo.kubeletVersion,OS:.status.nodeInfo.osImage
2. Verify API Version and Component Health:
kubectl version --short
kubectl get componentstatuses # Note: Legacy, but still useful for etcd check
3. Check Certificate Renewal:
kubeadm upgrade apply automatically renews certs. Verify the new expiration dates:
kubeadm certs check-expiration