Skip to main content

Health Probes: The Self-Healing Data Plane

In Kubernetes, "Running" does not mean "Healthy." A process can be alive (PID exists) but deadlocked, or alive but still loading cache and unable to serve traffic. Health probes allow the Kubelet to introspect the application state and take corrective action.


1. THE TRINITY OF PROBES

Kubernetes uses three distinct probes to manage the container lifecycle. Understanding the "Handover" between these probes is critical.

Probe TypeInternal PurposeKubelet Action on FailureTraffic Impact
StartupFor slow-starting legacy apps. Disables Liveness/Readiness until the app is "up."Restart Container.None (Pod is not yet Ready).
LivenessDetects deadlocks or "frozen" states where the app cannot recover itself.Restart Container.None (Existing connections severed).
ReadinessDetects temporary high load or initialization (cache loading, DB migrations).None. (Wait for success).Isolate. IP removed from Service Endpoints.

1.1 The Execution Flow

  1. Container starts.
  2. Startup Probe begins. (Liveness and Readiness are suspended).
  3. Startup Probe succeeds.
  4. Liveness and Readiness probes begin running in parallel for the rest of the container's life.

2. PROBE MECHANISMS (HOW KUBELET CHECKS)

2.1 HTTP GET

The Kubelet sends an HTTP request.

  • Success: Status code $\ge 200$ and $< 400$.
  • Bible Note: Avoid pointing this to a heavy /metrics endpoint. Create a dedicated /healthz or /readyz endpoint that is lightweight.

2.2 TCP Socket

Kubelet attempts to open a TCP connection on a specific port.

  • Success: TCP Handshake completes.
  • Use Case: Ideal for databases (PostgreSQL, Redis) or non-HTTP services.

2.3 Exec Command

Kubelet executes a command inside the container's process namespace.

  • Success: Exit Code 0.
  • Caveat: Resource intensive. Kubelet forks a process inside the container for every check.

2.4 gRPC (Standard since v1.24)

Kubelet uses the gRPC Health Checking Protocol.

  • Success: Response status is SERVING.
  • Benefit: Native support for modern microservices without needing HTTP sidecars.

3. PROBE PARAMETERS (TUNING THE ENGINE)

ParameterDefaultBible Usage / Strategy
initialDelaySeconds0Use Startup Probes instead of high initial delays.
periodSeconds10Frequency of checks. 5-10s is standard for production.
timeoutSeconds1If the check takes longer than this, it is a failure.
successThreshold1Must be 1 for Liveness. Can be >1 for Readiness to prevent "flapping."
failureThreshold3"Grace count." Total failures before taking action.

4. PRODUCTION-GRADE MANIFEST (BIBLE STANDARD)

This manifest demonstrates a "Slow Java App" pattern:

  1. Startup Probe allows 5 minutes for initialization.
  2. Readiness Probe ensures the app can handle traffic.
  3. Liveness Probe restarts the app if it deadlocks.
apiVersion: v1
kind: Pod
metadata:
name: enterprise-java-app
spec:
containers:
- name: java-app
image: openjdk:17-jdk-slim
ports:
- containerPort: 8080

# 1. STARTUP: Give the JVM 5 mins to load classes/cache
startupProbe:
httpGet:
path: /health/started
port: 8080
failureThreshold: 30
periodSeconds: 10 # 30 * 10s = 300s (5 mins)

# 2. LIVENESS: Check every 20s if the app is deadlocked
livenessProbe:
httpGet:
path: /health/live
port: 8080
initialDelaySeconds: 0 # Not needed; Startup probe handles the wait
periodSeconds: 20
failureThreshold: 3

# 3. READINESS: Is the DB connection pool active?
readinessProbe:
httpGet:
path: /health/ready
port: 8080
periodSeconds: 5
successThreshold: 2 # Require 2 successes to join Load Balancer
failureThreshold: 2 # Remove from LB after 2 failures

5. ARCHITECTURAL INTERNALS: HOW PROBES IMPACT TRAFFIC

5.1 The Readiness -> Endpoint Lifecycle

When a Readiness probe fails:

  1. Kubelet updates the Pod status to Ready: False.
  2. The Endpoint Controller observes the change.
  3. The Pod's IP is removed from the Endpoints (or EndpointSlice) object.
  4. Kube-Proxy on every node updates iptables/IPVS rules.
  5. Traffic stops flowing to the Pod. Note: This process is asynchronous. Expect a 1-2 second propagation delay.

5.2 The Liveness -> Restart Lifecycle

When a Liveness probe fails:

  1. Kubelet records the failure event.
  2. Kubelet kills the container (honoring terminationGracePeriodSeconds).
  3. The Container Runtime (CRI) recreates the container.
  4. The Pod's restartCount increments.

6. PRODUCTION ANTI-PATTERNS (PITFALLS)

  1. Liveness == Readiness: Never point both probes to the same logic/endpoint.
    • Scenario: Your DB goes down. If Readiness fails, traffic stops (Good). If Liveness fails, K8s restarts the app (Bad). Restarting won't fix the DB, it just adds "Boot-up" load to your cluster.
  2. External Dependencies in Liveness: Never include external calls (e.g., checking if an S3 bucket or external API is up) in a Liveness probe. If the external service has a blip, your entire cluster will enter a "CrashLoopBackOff" as all pods restart simultaneously.
  3. Heavy Resource Usage: Don't run exec probes that perform complex find or grep operations. This consumes CPU cycles and can trigger throttling.
  4. Failure Threshold of 1: Avoid setting failureThreshold: 1. Network blips happen. Use at least 3 to ensure stability.

7. TROUBLESHOOTING CHEATSHEET

Identify Probe Failures

kubectl describe pod <pod-name>

Look for Events:

  • Warning Unhealthy Liveness probe failed: HTTP probe failed with statuscode: 500
  • Warning Unhealthy Readiness probe failed: Get "http://10.244.1.5:8080/ready": dial tcp 10.244.1.5:8080: connect: connection refused

Watch Traffic State

kubectl get pod <pod-name> -w
# Look for the READY column (e.g., 0/1 means readiness failed)

Check Container Restart History

kubectl get pod <pod-name> -o jsonpath='{.status.containerStatuses[0].restartCount}'
# If this number is high, your Liveness probe is likely too aggressive or your app is leaking memory.