Skip to main content

Data Plane Troubleshooting: Fixing the Muscle

The Data Plane is where the actual work happens. When this layer fails, Pods get stuck in Pending, nodes transition to NotReady, and services experience intermittent latency. Troubleshooting the Data Plane requires moving from the Kubernetes API down to the Linux Kernel.


1. NODE TROUBLESHOOTING: THE KUBELET FAILURE

When a node shows Status: NotReady, the Kubelet (the node-level agent) has stopped reporting heartbeats to the API Server.

1.1 The Kubelet Triage (Systemd)

The Kubelet is a binary process, not a Pod. You must SSH into the node to debug it.

# 1. Check if the process is running
systemctl status kubelet

# 2. Check the logs for the last 5 minutes
journalctl -u kubelet --since "5 min ago" -f

1.2 Common Node Root Causes

Symptom in LogsInternal CauseFix
failed to run Kubelet: misconfiguration: cgroup driver mismatchKubelet is using cgroupfs while Containerd uses systemd.Update /var/lib/kubelet/config.yaml to cgroupDriver: systemd.
failed to get cgroup stats... OOMKilledThe Kubelet itself ran out of memory.Increase memory reservation via --kube-reserved.
PLEG is not healthyPod Lifecycle Event Generator is stuck. CRI is too slow to respond.Check top for high CPU or iostat for disk IO wait on the node.

2. NETWORKING TROUBLESHOOTING (CNI)

If Pod A cannot ping Pod B, or gRPC connections timeout, the issue is likely the Container Network Interface (CNI) or an MTU Mismatch.

2.1 The "Silent Killer": MTU Mismatch

Scenario: Small packets (ping) work, but large packets (API responses) hang or drop.

  • The Logic: VXLAN/Overlay networking adds a 50-byte encapsulation header. If your physical network MTU is 1500, the Pod MTU must be 1450.
  • Bible Test:
# Try to send a large packet (1472 bytes) without fragmentation
# If this fails, your MTU is misconfigured.
kubectl exec pod-a -- ping -s 1472 -M do <pod-b-ip>

2.2 IP Address Exhaustion

If Pods stay in ContainerCreating with the error failed to assign an IP address:

  • Internal Cause: The CNI's IPAM (IP Address Management) has run out of IPs in the local node's subnet (HostLocal).
  • Audit: Check the /var/lib/cni/networks/ directory on the node to see assigned IP files.

3. POD LIFECYCLE FORENSICS

3.1 The "logs --previous" Trick

If a Pod is in CrashLoopBackOff, the current logs might be empty because the container keeps restarting.

# View the logs of the instance that just crashed
kubectl logs <pod-name> --previous

3.2 OOMKilled vs. Process Crash

How to distinguish between an application bug and a resource limit?

kubectl get pod <name> -o jsonpath='{.status.containerStatuses[0].lastState.terminated.reason}'
  • OOMKilled: The Linux Kernel killed the process because it hit the limit.memory in the YAML.
  • Error (Exit Code 1/139): The application crashed (Segfault or Exception).

4. EMERGENCY: HOST-LEVEL DEBUGGING (The "Ninja" Path)

When kubectl exec fails (e.g., the Pod has no shell or the API is slow), use nsenter to break into the container's namespace from the host.

4.1 Step-by-Step nsenter

  1. Find the Container PID:
    # On the worker node
    crictl inspect <container-id> | grep pid
    # Let's say PID is 12345
  2. Enter the Network Namespace:
    # -t: Target PID
    # -n: Enter Network namespace
    # -u: Enter UTS (Hostname) namespace
    nsenter -t 12345 -n -u ip addr
  3. Why do this? You are now using the host's tools (tcpdump, ip, netstat) to look at the Pod's network stack. This is the ultimate way to debug CNI issues.

5. JSONPATH NINJA: DATA PLANE AUDITING

1. Find all Pods that are NOT on the node they are supposed to be:

kubectl get pods -A -o jsonpath='{range .items[?(@.spec.nodeName=="")]}{.metadata.namespace}{"\t"}{.metadata.name}{"\n"}{end}'

2. Audit Pod IPs and Node IPs in one view (Connectivity Mapping):

kubectl get pods -A -o jsonpath='{range .items[*]}{.status.podIP}{" -> "}{.spec.nodeName}{"\n"}{end}'

3. Identify Pods with high restart counts across the whole cluster:

kubectl get pods -A -o jsonpath='{range .items[?(@.status.containerStatuses[0].restartCount > 10)]}{.status.containerStatuses[0].restartCount}{"\t"}{.metadata.name}{"\n"}{end}'

6. PRODUCTION CHECKLIST: PREVENTATIVE OPS

  1. Node Problem Detector: Install the node-problem-detector to surface kernel-level issues (like "kernel oops" or "read-only filesystem") as Kubernetes Events.
  2. Reserved Resources: Always set --kube-reserved and --system-reserved in your Kubelet config to prevent the OS and Kubelet from being starved by rogue Pods.
  3. Graceful Node Shutdown: Ensure the Kubelet is configured to handle node reboots gracefully so it can migrate Pods before the OS terminates.