Control Plane Troubleshooting: Recovering the Brain
When the Control Plane fails, the cluster becomes "headless." While existing workloads may continue to run (depending on their dependency on the API), you lose the ability to scale, heal, or update the system. This document details the recovery of the "Big Three" components: kube-apiserver, etcd, and kube-controller-manager.
1. THE "KUBECTL IS DEAD" SCENARIO
If kubectl returns The connection to the server was refused, you cannot use Kubernetes to debug Kubernetes. You must move to the Host Level.
1.1 Immediate Host Triage
- Check the Kubelet: The Kubelet manages the Control Plane static pods. If it's down, nothing else will start.
systemctl status kubelet
journalctl -u kubelet -f - Inspect Container Runtime: Use
crictlto see if the Control Plane containers are even existing.# Set the socket if not in crictl.yaml
export CONTAINER_RUNTIME_ENDPOINT=unix:///run/containerd/containerd.sock
crictl ps | grep kube- - Check Static Pod Manifests: Ensure the files in
/etc/kubernetes/manifests/are valid YAML and haven't been corrupted.
2. ETCD DISASTER RECOVERY (BIBLE GRADE)
Etcd is the only stateful component of the Control Plane. If it loses Quorum (majority), the API Server will go into a CrashLoopBackOff.
2.1 The Restore Architecture
Rule: You cannot restore etcd while the API Server is running. The API Server will attempt to write to etcd during the restore, leading to data corruption.
2.2 The Step-by-Step Restoration (Single Node)
Scenario: You have a snapshot at /tmp/etcd-backup.db and your cluster is down.
Step 1: Stop the Control Plane Move the manifests out of the Kubelet's watch directory to stop the pods gracefully.
mkdir -p /etc/kubernetes/manifests-backup
mv /etc/kubernetes/manifests/*.yaml /etc/kubernetes/manifests-backup/
# Wait 30s for 'crictl ps' to show no kube-apiserver or etcd containers.
Step 2: Execute the Restore We must restore the data to a new directory to avoid conflicting with old, corrupted state.
export ETCDCTL_API=3
etcdctl snapshot restore /tmp/etcd-backup.db \
--data-dir=/var/lib/etcd-new \
--initial-cluster=master=https://127.0.0.1:2380 \
--initial-cluster-token=etcd-cluster-1 \
--initial-advertise-peer-urls=https://127.0.0.1:2380 \
--name=master
Step 3: Update the Etcd Manifest
Edit /etc/kubernetes/manifests-backup/etcd.yaml. You must update the hostPath volume for etcd-data to point to /var/lib/etcd-new.
# ... inside etcd.yaml
volumes:
- hostPath:
path: /var/lib/etcd-new # Update this from /var/lib/etcd
type: DirectoryOrCreate
name: etcd-data
Step 4: Restart the Control Plane
mv /etc/kubernetes/manifests-backup/*.yaml /etc/kubernetes/manifests/
# Kubelet will detect the files and start the pods.
3. COMPONENT-SPECIFIC LOGGING & ERRORS
3.1 Kube-API Server
- Log Location:
tail -f /var/log/pods/kube-system_kube-apiserver-*/kube-apiserver/*.log - Common Error:
ETCD server unreachable: Check if etcd is running on127.0.0.1:2379. - Common Error:
Address already in use: Usually a zombie apiserver process. Kill it:pkill kube-apiserver.
3.2 Kube-Controller Manager & Scheduler
These use Leader Election. In an HA cluster (3 Masters), only one pod is active.
- Troubleshooting: If the cluster is "frozen" (pods aren't scheduling), check the logs for:
leaderelection.go:330] successfully acquired lease kube-system/kube-scheduler - The Clock Skew Hazard: If the clocks on your Master nodes differ by more than a few seconds, Leader Election will "flap," causing constant restarts. Always sync Master clocks via NTP.
4. PKI & CERTIFICATE TROUBLESHOOTING
Expired certificates are the #1 cause of cluster failure after 1 year of uptime.
4.1 Auditing Expiration
kubeadm certs check-expiration
4.2 Manual Renewal
If certs are expired, the API Server won't start. You must renew them on the host.
# Renew all certificates
kubeadm certs renew all
# Regenerate the admin.conf for your local kubectl
cp /etc/kubernetes/admin.conf ~/.kube/config
5. CONTROL PLANE AUDITING
Use these commands to extract high-fidelity health data from the Control Plane.
1. Identify which Master Nodes are not "Ready":
kubectl get nodes -l node-role.kubernetes.io/control-plane -o jsonpath='{range .items[?(@.status.conditions[-1].status!="True")]}{.metadata.name}{"\t"}{.status.conditions[-1].message}{"\n"}{end}'
2. Audit API Server secure port and admission plugins:
kubectl get pod -n kube-system -l component=kube-apiserver -o jsonpath='{.items[0].spec.containers[0].command}' | jq
3. Check Etcd Member List (Internal API):
# This requires exec into the etcd pod
kubectl exec -n kube-system etcd-master -- etcdctl \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
endpoint health
6. PRODUCTION WARNINGS & BEST PRACTICES
- Etcd Defragmentation: Over time, etcd's DB file grows due to MVCC history. Periodically run
etcdctl defragto reclaim disk space and preventdatabase space exceedederrors. - Audit Logs: Enable API Server Audit Logs to
/var/log/kubernetes/audit.log. This is the "Black Box" recorder for who changed what and when. - Manifest Integrity: Use an automated tool (like Tripwire or simple cron) to alert if files in
/etc/kubernetes/manifestschange without a corresponding GitOps commit.