Cluster Upgrades: Orchestrating the Version Transition

Upgrading a production Kubernetes cluster is a high-availability exercise. It requires a precise sequence of operations to ensure the Control Plane remains consistent while the Data Plane (worker nodes) is cycled without violating application Service Level Objectives (SLOs).

1. THE ARCHITECTURE OF VERSIONING

Kubernetes adheres to strict Semantic Versioning (SemVer): v[Major].[Minor].[Patch].

1.1 Support Lifecycle (N-2)

Kubernetes officially supports the three most recent minor versions.

Example: When v1.33 is released, v1.30 drops out of support and no longer receives security patches.
Bible Rule: Never skip a minor version (e.g., 1.31 -> 1.33). This bypasses critical API migrations and database schema updates in etcd.

1.2 Version Skew Policy (The Safety Matrix)

The Kube-API Server is the "Source of Truth." All other components must revolve around its version.

Component	Max Skew	Logic
API Server	`X` (Anchor)	Must be upgraded first. No other component can be newer than this.
KCM / Scheduler	`X - 1`	Must match API server or be 1 minor version older.
Kubelet	`X - 2`	Can be up to 2 minor versions older. This allows workers to run while CP is upgraded.
Kubectl	`X +/- 1`	Supported within one minor version difference.

2. PRE-UPGRADE RECONNAISSANCE

Before initiating the binary swap, an architect must ensure the state is recoverable.

Etcd Snapshot: kubeadm does not backup etcd for you.

ETCDCTL_API=3 etcdctl snapshot save /tmp/pre-upgrade.db \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key

Manifest Backup: cp -r /etc/kubernetes/ ~/k8s-backup/.
Addon Compatibility: Verify that your CNI (Calico/Cilium) and CSI drivers support the target version by checking their respective support matrices.

3. THE UPGRADE SEQUENCE [STEP-BY-STEP]

3.1 Repository Synchronization [ALL NODES]

Kubernetes moved to community-hosted repositories (pkgs.k8s.io). You must manually update your package manager's source list to point to the target minor version.

# Example: Moving to v1.33
echo "deb [signed-by=/etc/apt/keyrings/kubernetes-apt-keyring.gpg] https://pkgs.k8s.io/core:/stable:/v1.33/deb/ /" | sudo tee /etc/apt/sources.list.d/kubernetes.list
sudo apt-get update

3.2 Upgrading the Primary Control Plane

1. Binary Swap:

sudo apt-mark unhold kubeadm
sudo apt-get install -y kubeadm=1.33.x-x
sudo apt-mark hold kubeadm

2. The Execution Plan: kubeadm performs pre-flight checks to ensure the cluster is healthy (e.g., all nodes are Ready, certificates are valid).

sudo kubeadm upgrade plan

Sample Output:

COMPONENT            CURRENT   TARGET
kube-apiserver       v1.32.2   v1.33.4
kube-controller-mgr  v1.32.2   v1.33.4
...
CERTIFICATE                EXPIRES                 PROXY DONE
admin.conf                 Oct 27, 2024 10:00 UTC  no

3. Apply the Logic: This command replaces the static pod manifests in /etc/kubernetes/manifests. The Kubelet will observe the file change and restart the Control Plane pods automatically.

sudo kubeadm upgrade apply v1.33.4 --yes

4. Finalizing CP Kubelet:

sudo apt-mark unhold kubelet kubectl
sudo apt-get install -y kubelet=1.33.x-x kubectl=1.33.x-x
sudo systemctl daemon-reload && sudo systemctl restart kubelet
sudo apt-mark hold kubelet kubectl

3.3 Upgrading Worker Nodes [SERIAL EXECUTION]

Worker nodes must be upgraded one by one to maintain application availability.

1. The Eviction (From CP Node): We use drain to gracefully move workloads.

kubectl drain worker-01 --ignore-daemonsets --delete-emptydir-data

Sample Output:

node/worker-01 cordoned
evicting pod default/api-v1-7d4b8c9f5-xyz12
evicting pod kube-system/coredns-85cb69466-abcde
pod/api-v1-7d4b8c9f5-xyz12 evicted
node/worker-01 drained

2. Node Upgrade (On the Worker):

sudo apt-get install -y kubeadm=1.33.x-x
# 'upgrade node' pulls the latest cluster configuration from the API Server
sudo kubeadm upgrade node
sudo apt-get install -y kubelet=1.33.x-x kubectl=1.33.x-x
sudo systemctl daemon-reload && sudo systemctl restart kubelet

3. Restoration (From CP Node):

kubectl uncordon worker-01

4. PRODUCTION SAFEGUARD: POD DISRUPTION BUDGETS (PDB)

In production, you must prevent the drain command from killing too many replicas of a mission-critical app.

4.1 The PDB Logic

A PDB ensures that a minimum number of replicas are always "Ready" during voluntary disruptions (like upgrades).

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: billing-api-pdb
spec:
  minAvailable: 2 # At least 2 pods must stay alive
  selector:
    matchLabels:
      app: billing

4.2 The "Drain Hang" Pitfall

If you have a 3-replica app with a minAvailable: 3 PDB, the kubectl drain command will hang indefinitely.

Reason: The PDB tells the Eviction API "No pods can be removed."
Fix: Ensure minAvailable is always less than the total replica count, or add temporary nodes to the cluster before starting the upgrade.

5. POST-UPGRADE CNI TROUBLESHOOTING

Upgrades often reveal "hidden" CNI misconfigurations, particularly with Calico VXLAN/BGP.

Symptom: Nodes show Ready, but Pods cannot ping each other across nodes.
Internal Cause: The Calico node binary may have updated and is trying to use BGP (Port 179), which is blocked by Cloud firewalls (AWS/Azure).

The Architect's Fix: Ensure Calico is forced into VXLAN mode (UDP 4789).

kubectl patch installation.operator.tigera.io default --type=merge -p '{"spec":{"calicoNetwork":{"bgp":"Disabled"}}}'

6. KEY FILE REFERENCE FOR ARCHITECTS

File Path	Description	Internal Role
`/etc/kubernetes/manifests/`	Static Pods	`kubeadm upgrade` writes the new API Server/Etcd JSON here.
`/etc/kubernetes/admin.conf`	Root Kubeconfig	The primary credential for the cluster. Check its cert expiry after upgrades.
`/var/lib/kubelet/config.yaml`	Kubelet Config	Stores the Cgroup driver and Cluster DNS settings.
`/etc/apt/sources.list.d/kubernetes.list`	Binary Source	Determines which version `apt` will "see."

7. NINJA COMMANDS: VERIFICATION

1. Audit actual binary versions running on nodes:

kubectl get nodes -o custom-columns=NAME:.metadata.name,VERSION:.status.nodeInfo.kubeletVersion,OS:.status.nodeInfo.osImage

2. Verify API Version and Component Health:

kubectl version --short
kubectl get componentstatuses # Note: Legacy, but still useful for etcd check

3. Check Certificate Renewal: kubeadm upgrade apply automatically renews certs. Verify the new expiration dates:

kubeadm certs check-expiration

1. THE ARCHITECTURE OF VERSIONING​

1.1 Support Lifecycle (N-2)​

1.2 Version Skew Policy (The Safety Matrix)​

2. PRE-UPGRADE RECONNAISSANCE​

3. THE UPGRADE SEQUENCE [STEP-BY-STEP]​

3.1 Repository Synchronization [ALL NODES]​

3.2 Upgrading the Primary Control Plane​

3.3 Upgrading Worker Nodes [SERIAL EXECUTION]​

4. PRODUCTION SAFEGUARD: POD DISRUPTION BUDGETS (PDB)​

4.1 The PDB Logic​

4.2 The "Drain Hang" Pitfall​

5. POST-UPGRADE CNI TROUBLESHOOTING​

6. KEY FILE REFERENCE FOR ARCHITECTS​

7. NINJA COMMANDS: VERIFICATION​