Health Probes: The Self-Healing Data Plane
In Kubernetes, "Running" does not mean "Healthy." A process can be alive (PID exists) but deadlocked, or alive but still loading cache and unable to serve traffic. Health probes allow the Kubelet to introspect the application state and take corrective action.
1. THE TRINITY OF PROBES
Kubernetes uses three distinct probes to manage the container lifecycle. Understanding the "Handover" between these probes is critical.
| Probe Type | Internal Purpose | Kubelet Action on Failure | Traffic Impact |
|---|---|---|---|
| Startup | For slow-starting legacy apps. Disables Liveness/Readiness until the app is "up." | Restart Container. | None (Pod is not yet Ready). |
| Liveness | Detects deadlocks or "frozen" states where the app cannot recover itself. | Restart Container. | None (Existing connections severed). |
| Readiness | Detects temporary high load or initialization (cache loading, DB migrations). | None. (Wait for success). | Isolate. IP removed from Service Endpoints. |
1.1 The Execution Flow
- Container starts.
- Startup Probe begins. (Liveness and Readiness are suspended).
- Startup Probe succeeds.
- Liveness and Readiness probes begin running in parallel for the rest of the container's life.
2. PROBE MECHANISMS (HOW KUBELET CHECKS)
2.1 HTTP GET
The Kubelet sends an HTTP request.
- Success: Status code $\ge 200$ and $< 400$.
- Bible Note: Avoid pointing this to a heavy
/metricsendpoint. Create a dedicated/healthzor/readyzendpoint that is lightweight.
2.2 TCP Socket
Kubelet attempts to open a TCP connection on a specific port.
- Success: TCP Handshake completes.
- Use Case: Ideal for databases (PostgreSQL, Redis) or non-HTTP services.
2.3 Exec Command
Kubelet executes a command inside the container's process namespace.
- Success: Exit Code
0. - Caveat: Resource intensive. Kubelet forks a process inside the container for every check.
2.4 gRPC (Standard since v1.24)
Kubelet uses the gRPC Health Checking Protocol.
- Success: Response status is
SERVING. - Benefit: Native support for modern microservices without needing HTTP sidecars.
3. PROBE PARAMETERS (TUNING THE ENGINE)
| Parameter | Default | Bible Usage / Strategy |
|---|---|---|
initialDelaySeconds | 0 | Use Startup Probes instead of high initial delays. |
periodSeconds | 10 | Frequency of checks. 5-10s is standard for production. |
timeoutSeconds | 1 | If the check takes longer than this, it is a failure. |
successThreshold | 1 | Must be 1 for Liveness. Can be >1 for Readiness to prevent "flapping." |
failureThreshold | 3 | "Grace count." Total failures before taking action. |
4. PRODUCTION-GRADE MANIFEST (BIBLE STANDARD)
This manifest demonstrates a "Slow Java App" pattern:
- Startup Probe allows 5 minutes for initialization.
- Readiness Probe ensures the app can handle traffic.
- Liveness Probe restarts the app if it deadlocks.
apiVersion: v1
kind: Pod
metadata:
name: enterprise-java-app
spec:
containers:
- name: java-app
image: openjdk:17-jdk-slim
ports:
- containerPort: 8080
# 1. STARTUP: Give the JVM 5 mins to load classes/cache
startupProbe:
httpGet:
path: /health/started
port: 8080
failureThreshold: 30
periodSeconds: 10 # 30 * 10s = 300s (5 mins)
# 2. LIVENESS: Check every 20s if the app is deadlocked
livenessProbe:
httpGet:
path: /health/live
port: 8080
initialDelaySeconds: 0 # Not needed; Startup probe handles the wait
periodSeconds: 20
failureThreshold: 3
# 3. READINESS: Is the DB connection pool active?
readinessProbe:
httpGet:
path: /health/ready
port: 8080
periodSeconds: 5
successThreshold: 2 # Require 2 successes to join Load Balancer
failureThreshold: 2 # Remove from LB after 2 failures
5. ARCHITECTURAL INTERNALS: HOW PROBES IMPACT TRAFFIC
5.1 The Readiness -> Endpoint Lifecycle
When a Readiness probe fails:
- Kubelet updates the Pod status to
Ready: False. - The Endpoint Controller observes the change.
- The Pod's IP is removed from the
Endpoints(orEndpointSlice) object. - Kube-Proxy on every node updates
iptables/IPVSrules. - Traffic stops flowing to the Pod. Note: This process is asynchronous. Expect a 1-2 second propagation delay.
5.2 The Liveness -> Restart Lifecycle
When a Liveness probe fails:
- Kubelet records the failure event.
- Kubelet kills the container (honoring
terminationGracePeriodSeconds). - The Container Runtime (CRI) recreates the container.
- The Pod's
restartCountincrements.
6. PRODUCTION ANTI-PATTERNS (PITFALLS)
- Liveness == Readiness: Never point both probes to the same logic/endpoint.
- Scenario: Your DB goes down. If Readiness fails, traffic stops (Good). If Liveness fails, K8s restarts the app (Bad). Restarting won't fix the DB, it just adds "Boot-up" load to your cluster.
- External Dependencies in Liveness: Never include external calls (e.g., checking if an S3 bucket or external API is up) in a Liveness probe. If the external service has a blip, your entire cluster will enter a "CrashLoopBackOff" as all pods restart simultaneously.
- Heavy Resource Usage: Don't run
execprobes that perform complexfindorgrepoperations. This consumes CPU cycles and can trigger throttling. - Failure Threshold of 1: Avoid setting
failureThreshold: 1. Network blips happen. Use at least 3 to ensure stability.
7. TROUBLESHOOTING CHEATSHEET
Identify Probe Failures
kubectl describe pod <pod-name>
Look for Events:
Warning Unhealthy Liveness probe failed: HTTP probe failed with statuscode: 500Warning Unhealthy Readiness probe failed: Get "http://10.244.1.5:8080/ready": dial tcp 10.244.1.5:8080: connect: connection refused
Watch Traffic State
kubectl get pod <pod-name> -w
# Look for the READY column (e.g., 0/1 means readiness failed)
Check Container Restart History
kubectl get pod <pod-name> -o jsonpath='{.status.containerStatuses[0].restartCount}'
# If this number is high, your Liveness probe is likely too aggressive or your app is leaking memory.