Project Lab 14: Cluster Lifecycle Management (Bootstrap & Upgrade)

Managing a self-managed cluster requires deep knowledge of the binaries that run the cluster. This lab demonstrates how to move away from simple CLI flags to Declarative Cluster Configuration and how to perform a minor version upgrade without crashing the Data Plane.

Reference Material:

docs/13-cluster-setup/1-cluster-setup.md
docs/13-cluster-setup/2-upgrade-cluster.md
Previous Knowledge: Taints/Drain (Chapter 03), Pod Disruption Budgets (Chapter 13 Upgrade notes).

1. OBJECTIVE: THE HARDENED CLUSTER LIFECYCLE

The goal is to manage a cluster through its most critical stages:

Hardened Bootstrap: Use a YAML config file to ensure the API Server, Etcd, and Kubelet are configured for performance (Cgroup sync).
Health Audit: Validate the static pod architecture.
Day-Two Upgrade: Perform a rolling upgrade from v1.32.x to v1.33.x, following the N-1/N-2 skew policy rules.

2. PHASE 1: THE DECLARATIVE BOOTSTRAP (v1.32)

Instead of long kubeadm init commands, we use a configuration file. This is the only way to ensure 100% reproducibility in production.

2.1 The Kubeadm Init Config (`init-config.yaml`)

This file synchronizes the systemd cgroup driver across the OS, the Container Runtime, and the Kubelet.

apiVersion: kubeadm.k8s.io/v1beta3
kind: InitConfiguration
localAPIEndpoint:
  advertiseAddress: "10.0.0.10" # Master Private IP
  bindPort: 6443
nodeRegistration:
  criSocket: "unix:///run/containerd/containerd.sock"
  imagePullPolicy: IfNotPresent
---
apiVersion: kubeadm.k8s.io/v1beta3
kind: ClusterConfiguration
kubernetesVersion: v1.32.0
clusterName: "production-cluster"
networking:
  podSubnet: "192.168.0.0/16" # Align with Calico
---
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
# CRITICAL: Must match Containerd config
cgroupDriver: systemd 

2.2 Execution

# Pre-flight: Ensure containerd is tuned (Ref: Chapter 13.1)
sudo kubeadm init --config init-config.yaml --upload-certs

3. PHASE 2: AUDITING THE STATIC PLANE

Once initialized, the control plane components run as Static Pods. We must verify their identity.

3.1 Audit Static Manifests

ls /etc/kubernetes/manifests/
# Expected: etcd.yaml, kube-apiserver.yaml, kube-controller-manager.yaml, kube-scheduler.yaml

3.2 Audit Certificate Health

Before upgrading, ensure your current certificates are valid.

kubeadm certs check-expiration

4. PHASE 3: THE MINOR VERSION UPGRADE (v1.32 -> v1.33)

We will now perform the upgrade. We follow the strict order: Master Binary -> Master Config -> Master Kubelet -> Worker Nodes.

4.1 Master Node Upgrade

Update APT Source: Point to the new v1.33 repository.

echo "deb [signed-by=/etc/apt/keyrings/kubernetes-apt-keyring.gpg] https://pkgs.k8s.io/core:/stable:/v1.33/deb/ /" | sudo tee /etc/apt/sources.list.d/kubernetes.list
sudo apt-get update

Upgrade Kubeadm:

sudo apt-get install -y kubeadm=1.33.x-x

Plan and Apply:

sudo kubeadm upgrade plan
sudo kubeadm upgrade apply v1.33.x

Upgrade Kubelet:

sudo apt-get install -y kubelet=1.33.x-x kubectl=1.33.x-x
sudo systemctl daemon-reload && sudo systemctl restart kubelet

4.2 Worker Node Upgrade (The Eviction Flow)

We must upgrade workers one by one. The Bible Rule: Never drain multiple nodes simultaneously unless you have verified your cluster capacity.

Step 1: Drain (From Master)

kubectl drain worker-01 --ignore-daemonsets --delete-emptydir-data

Step 2: Upgrade Binary and Config (On Worker)

sudo apt-get install -y kubeadm=1.33.x-x
sudo kubeadm upgrade node # Retrieves the new cluster config from API
sudo apt-get install -y kubelet=1.33.x-x kubectl=1.33.x-x
sudo systemctl restart kubelet

Step 3: Restore (From Master)

kubectl uncordon worker-01

5. VERIFICATION & POST-MORTEM AUDIT

5.1 Version Skew Verification

Verify that the cluster is in a supported skew state.

kubectl get nodes -o custom-columns=NAME:.metadata.name,VERSION:.status.nodeInfo.kubeletVersion

Architect's Checklist:

Is any Worker Node running a version newer than the API Server? (If YES, the cluster is in an unsupported/broken state).
Is the Kubelet on the Master node matching the API Server version? (Best practice).

5.2 Pod Disruption Audit

If any application failed during the drain process, check for missing PodDisruptionBudgets (PDBs).

kubectl get pdb -A

Discovery: If a mission-critical app has minAvailable: 100%, the drain command will hang forever, preventing the node from being upgraded.

6. TROUBLESHOOTING & NINJA COMMANDS

6.1 The "Kubelet Stopped Heartbeating" Fix

If a worker node stays in NotReady after an upgrade:

Check Kubelet logs: journalctl -u kubelet -f
Look for: failed to run Kubelet: misconfiguration: cgroup driver mismatch.
Fix: Ensure /var/lib/kubelet/config.yaml has cgroupDriver: systemd and matches containerd.

6.2 Emergency Rollback Logic

kubeadm does not support an automated rollback of an apply.

The Architect's Safety Net: Always take an etcd snapshot before running kubeadm upgrade apply. If the upgrade corrupts the data, you must restore etcd to the pre-upgrade state. (Ref: Chapter 14 Troubleshooting Notes).

7. ARCHITECT'S KEY TAKEAWAYS

API Server is the Version Anchor: It must always be the first component upgraded and the highest version in the cluster.
Configuration over Flags: Use InitConfiguration files to manage complex production settings like custom SANs or specialized Cgroup drivers.
The Drain is a Contract: The drain command respects PDBs. If an upgrade is stuck, the bottleneck is usually an overly restrictive PDB.
Cgroup Sync is Stability: Mismatched cgroup drivers between the OS and the CRI are the #1 cause of flaky nodes after a fresh install.

1. OBJECTIVE: THE HARDENED CLUSTER LIFECYCLE​

2. PHASE 1: THE DECLARATIVE BOOTSTRAP (v1.32)​

2.1 The Kubeadm Init Config (init-config.yaml)​

2.2 Execution​

3. PHASE 2: AUDITING THE STATIC PLANE​

3.1 Audit Static Manifests​

3.2 Audit Certificate Health​

4. PHASE 3: THE MINOR VERSION UPGRADE (v1.32 -> v1.33)​

4.1 Master Node Upgrade​

4.2 Worker Node Upgrade (The Eviction Flow)​

5. VERIFICATION & POST-MORTEM AUDIT​

5.1 Version Skew Verification​

5.2 Pod Disruption Audit​

6. TROUBLESHOOTING & NINJA COMMANDS​

6.1 The "Kubelet Stopped Heartbeating" Fix​

6.2 Emergency Rollback Logic​

7. ARCHITECT'S KEY TAKEAWAYS​