Skip to main content

Control Plane Troubleshooting: Recovering the Brain

When the Control Plane fails, the cluster becomes "headless." While existing workloads may continue to run (depending on their dependency on the API), you lose the ability to scale, heal, or update the system. This document details the recovery of the "Big Three" components: kube-apiserver, etcd, and kube-controller-manager.


1. THE "KUBECTL IS DEAD" SCENARIO

If kubectl returns The connection to the server was refused, you cannot use Kubernetes to debug Kubernetes. You must move to the Host Level.

1.1 Immediate Host Triage

  1. Check the Kubelet: The Kubelet manages the Control Plane static pods. If it's down, nothing else will start.
    systemctl status kubelet
    journalctl -u kubelet -f
  2. Inspect Container Runtime: Use crictl to see if the Control Plane containers are even existing.
    # Set the socket if not in crictl.yaml
    export CONTAINER_RUNTIME_ENDPOINT=unix:///run/containerd/containerd.sock
    crictl ps | grep kube-
  3. Check Static Pod Manifests: Ensure the files in /etc/kubernetes/manifests/ are valid YAML and haven't been corrupted.

2. ETCD DISASTER RECOVERY (BIBLE GRADE)

Etcd is the only stateful component of the Control Plane. If it loses Quorum (majority), the API Server will go into a CrashLoopBackOff.

2.1 The Restore Architecture

Rule: You cannot restore etcd while the API Server is running. The API Server will attempt to write to etcd during the restore, leading to data corruption.

2.2 The Step-by-Step Restoration (Single Node)

Scenario: You have a snapshot at /tmp/etcd-backup.db and your cluster is down.

Step 1: Stop the Control Plane Move the manifests out of the Kubelet's watch directory to stop the pods gracefully.

mkdir -p /etc/kubernetes/manifests-backup
mv /etc/kubernetes/manifests/*.yaml /etc/kubernetes/manifests-backup/
# Wait 30s for 'crictl ps' to show no kube-apiserver or etcd containers.

Step 2: Execute the Restore We must restore the data to a new directory to avoid conflicting with old, corrupted state.

export ETCDCTL_API=3
etcdctl snapshot restore /tmp/etcd-backup.db \
--data-dir=/var/lib/etcd-new \
--initial-cluster=master=https://127.0.0.1:2380 \
--initial-cluster-token=etcd-cluster-1 \
--initial-advertise-peer-urls=https://127.0.0.1:2380 \
--name=master

Step 3: Update the Etcd Manifest Edit /etc/kubernetes/manifests-backup/etcd.yaml. You must update the hostPath volume for etcd-data to point to /var/lib/etcd-new.

# ... inside etcd.yaml
volumes:
- hostPath:
path: /var/lib/etcd-new # Update this from /var/lib/etcd
type: DirectoryOrCreate
name: etcd-data

Step 4: Restart the Control Plane

mv /etc/kubernetes/manifests-backup/*.yaml /etc/kubernetes/manifests/
# Kubelet will detect the files and start the pods.

3. COMPONENT-SPECIFIC LOGGING & ERRORS

3.1 Kube-API Server

  • Log Location: tail -f /var/log/pods/kube-system_kube-apiserver-*/kube-apiserver/*.log
  • Common Error: ETCD server unreachable: Check if etcd is running on 127.0.0.1:2379.
  • Common Error: Address already in use: Usually a zombie apiserver process. Kill it: pkill kube-apiserver.

3.2 Kube-Controller Manager & Scheduler

These use Leader Election. In an HA cluster (3 Masters), only one pod is active.

  • Troubleshooting: If the cluster is "frozen" (pods aren't scheduling), check the logs for: leaderelection.go:330] successfully acquired lease kube-system/kube-scheduler
  • The Clock Skew Hazard: If the clocks on your Master nodes differ by more than a few seconds, Leader Election will "flap," causing constant restarts. Always sync Master clocks via NTP.

4. PKI & CERTIFICATE TROUBLESHOOTING

Expired certificates are the #1 cause of cluster failure after 1 year of uptime.

4.1 Auditing Expiration

kubeadm certs check-expiration

4.2 Manual Renewal

If certs are expired, the API Server won't start. You must renew them on the host.

# Renew all certificates
kubeadm certs renew all

# Regenerate the admin.conf for your local kubectl
cp /etc/kubernetes/admin.conf ~/.kube/config

5. CONTROL PLANE AUDITING

Use these commands to extract high-fidelity health data from the Control Plane.

1. Identify which Master Nodes are not "Ready":

kubectl get nodes -l node-role.kubernetes.io/control-plane -o jsonpath='{range .items[?(@.status.conditions[-1].status!="True")]}{.metadata.name}{"\t"}{.status.conditions[-1].message}{"\n"}{end}'

2. Audit API Server secure port and admission plugins:

kubectl get pod -n kube-system -l component=kube-apiserver -o jsonpath='{.items[0].spec.containers[0].command}' | jq

3. Check Etcd Member List (Internal API):

# This requires exec into the etcd pod
kubectl exec -n kube-system etcd-master -- etcdctl \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
endpoint health

6. PRODUCTION WARNINGS & BEST PRACTICES

  1. Etcd Defragmentation: Over time, etcd's DB file grows due to MVCC history. Periodically run etcdctl defrag to reclaim disk space and prevent database space exceeded errors.
  2. Audit Logs: Enable API Server Audit Logs to /var/log/kubernetes/audit.log. This is the "Black Box" recorder for who changed what and when.
  3. Manifest Integrity: Use an automated tool (like Tripwire or simple cron) to alert if files in /etc/kubernetes/manifests change without a corresponding GitOps commit.