Health Probes: The Self-Healing Data Plane

In Kubernetes, "Running" does not mean "Healthy." A process can be alive (PID exists) but deadlocked, or alive but still loading cache and unable to serve traffic. Health probes allow the Kubelet to introspect the application state and take corrective action.

1. THE TRINITY OF PROBES

Kubernetes uses three distinct probes to manage the container lifecycle. Understanding the "Handover" between these probes is critical.

Probe Type	Internal Purpose	Kubelet Action on Failure	Traffic Impact
Startup	For slow-starting legacy apps. Disables Liveness/Readiness until the app is "up."	Restart Container.	None (Pod is not yet Ready).
Liveness	Detects deadlocks or "frozen" states where the app cannot recover itself.	Restart Container.	None (Existing connections severed).
Readiness	Detects temporary high load or initialization (cache loading, DB migrations).	None. (Wait for success).	Isolate. IP removed from Service Endpoints.

1.1 The Execution Flow

Container starts.
Startup Probe begins. (Liveness and Readiness are suspended).
Startup Probe succeeds.
Liveness and Readiness probes begin running in parallel for the rest of the container's life.

2. PROBE MECHANISMS (HOW KUBELET CHECKS)

2.1 HTTP GET

The Kubelet sends an HTTP request.

Success: Status code $\ge 200$ and $< 400$.
Bible Note: Avoid pointing this to a heavy /metrics endpoint. Create a dedicated /healthz or /readyz endpoint that is lightweight.

2.2 TCP Socket

Kubelet attempts to open a TCP connection on a specific port.

Success: TCP Handshake completes.
Use Case: Ideal for databases (PostgreSQL, Redis) or non-HTTP services.

2.3 Exec Command

Kubelet executes a command inside the container's process namespace.

Success: Exit Code 0.
Caveat: Resource intensive. Kubelet forks a process inside the container for every check.

2.4 gRPC (Standard since v1.24)

Kubelet uses the gRPC Health Checking Protocol.

Success: Response status is SERVING.
Benefit: Native support for modern microservices without needing HTTP sidecars.

3. PROBE PARAMETERS (TUNING THE ENGINE)

Parameter	Default	Bible Usage / Strategy
`initialDelaySeconds`	`0`	Use Startup Probes instead of high initial delays.
`periodSeconds`	`10`	Frequency of checks. 5-10s is standard for production.
`timeoutSeconds`	`1`	If the check takes longer than this, it is a failure.
`successThreshold`	`1`	Must be 1 for Liveness. Can be >1 for Readiness to prevent "flapping."
`failureThreshold`	`3`	"Grace count." Total failures before taking action.

4. PRODUCTION-GRADE MANIFEST (BIBLE STANDARD)

This manifest demonstrates a "Slow Java App" pattern:

Startup Probe allows 5 minutes for initialization.
Readiness Probe ensures the app can handle traffic.
Liveness Probe restarts the app if it deadlocks.

apiVersion: v1
kind: Pod
metadata:
  name: enterprise-java-app
spec:
  containers:
  - name: java-app
    image: openjdk:17-jdk-slim
    ports:
    - containerPort: 8080
    
    # 1. STARTUP: Give the JVM 5 mins to load classes/cache
    startupProbe:
      httpGet:
        path: /health/started
        port: 8080
      failureThreshold: 30
      periodSeconds: 10 # 30 * 10s = 300s (5 mins)
      
    # 2. LIVENESS: Check every 20s if the app is deadlocked
    livenessProbe:
      httpGet:
        path: /health/live
        port: 8080
      initialDelaySeconds: 0 # Not needed; Startup probe handles the wait
      periodSeconds: 20
      failureThreshold: 3
      
    # 3. READINESS: Is the DB connection pool active?
    readinessProbe:
      httpGet:
        path: /health/ready
        port: 8080
      periodSeconds: 5
      successThreshold: 2 # Require 2 successes to join Load Balancer
      failureThreshold: 2 # Remove from LB after 2 failures

5. ARCHITECTURAL INTERNALS: HOW PROBES IMPACT TRAFFIC

5.1 The Readiness -> Endpoint Lifecycle

When a Readiness probe fails:

Kubelet updates the Pod status to Ready: False.
The Endpoint Controller observes the change.
The Pod's IP is removed from the Endpoints (or EndpointSlice) object.
Kube-Proxy on every node updates iptables/IPVS rules.
Traffic stops flowing to the Pod. Note: This process is asynchronous. Expect a 1-2 second propagation delay.

5.2 The Liveness -> Restart Lifecycle

When a Liveness probe fails:

Kubelet records the failure event.
Kubelet kills the container (honoring terminationGracePeriodSeconds).
The Container Runtime (CRI) recreates the container.
The Pod's restartCount increments.

6. PRODUCTION ANTI-PATTERNS (PITFALLS)

Liveness == Readiness: Never point both probes to the same logic/endpoint.
- Scenario: Your DB goes down. If Readiness fails, traffic stops (Good). If Liveness fails, K8s restarts the app (Bad). Restarting won't fix the DB, it just adds "Boot-up" load to your cluster.
External Dependencies in Liveness: Never include external calls (e.g., checking if an S3 bucket or external API is up) in a Liveness probe. If the external service has a blip, your entire cluster will enter a "CrashLoopBackOff" as all pods restart simultaneously.
Heavy Resource Usage: Don't run exec probes that perform complex find or grep operations. This consumes CPU cycles and can trigger throttling.
Failure Threshold of 1: Avoid setting failureThreshold: 1. Network blips happen. Use at least 3 to ensure stability.

7. TROUBLESHOOTING CHEATSHEET

Identify Probe Failures

kubectl describe pod <pod-name>

Look for Events:

Warning Unhealthy Liveness probe failed: HTTP probe failed with statuscode: 500
Warning Unhealthy Readiness probe failed: Get "http://10.244.1.5:8080/ready": dial tcp 10.244.1.5:8080: connect: connection refused

Watch Traffic State

kubectl get pod <pod-name> -w
# Look for the READY column (e.g., 0/1 means readiness failed)

Check Container Restart History

kubectl get pod <pod-name> -o jsonpath='{.status.containerStatuses[0].restartCount}'
# If this number is high, your Liveness probe is likely too aggressive or your app is leaking memory.

1. THE TRINITY OF PROBES​

1.1 The Execution Flow​

2. PROBE MECHANISMS (HOW KUBELET CHECKS)​

2.1 HTTP GET​

2.2 TCP Socket​

2.3 Exec Command​

2.4 gRPC (Standard since v1.24)​

3. PROBE PARAMETERS (TUNING THE ENGINE)​

4. PRODUCTION-GRADE MANIFEST (BIBLE STANDARD)​

5. ARCHITECTURAL INTERNALS: HOW PROBES IMPACT TRAFFIC​

5.1 The Readiness -> Endpoint Lifecycle​

5.2 The Liveness -> Restart Lifecycle​

6. PRODUCTION ANTI-PATTERNS (PITFALLS)​

7. TROUBLESHOOTING CHEATSHEET​

Identify Probe Failures​

Watch Traffic State​

Check Container Restart History​