Pod Lifecycle: Termination, Restarts, and Pull Policies
1. POD TERMINATION: THE GRACEFUL RACE
When a Pod is marked for deletion, Kubernetes initiates two parallel workflows. A common production failure is not accounting for the race condition between these two.
1.1 The Internal Sequence
- State Change: Pod
metadata.deletionTimestampis set. Status becomesTerminating. - Workflow A (Networking): The Service Controller and Endpoint Controller observe the deletion and remove the Pod's IP from all Endpoints/EndpointSlices.
- Workflow B (Node-Level):
- preStop Hook: If defined, the Kubelet executes the
preStophook synchronously. - SIGTERM: Kubelet sends
SIGTERM(Signal 15) to PID 1 inside each container. - Grace Period: Kubelet waits for
terminationGracePeriodSeconds(default 30s). - SIGKILL: If containers are still running, Kubelet sends
SIGKILL(Signal 9).
- preStop Hook: If defined, the Kubelet executes the
1.2 The Race Condition Pitfall
The Problem: Workflow A (removing IPs from Load Balancers) and Workflow B (killing the process) happen simultaneously. In highly distributed clusters, iptables/IPVS updates can take several seconds to propagate to all nodes.
The Result: A Load Balancer might send a request to a Pod that has already received SIGTERM and closed its listener, resulting in a 502 Bad Gateway.
The Bible-Grade Solution: Use a preStop hook to delay the SIGTERM, giving the network layer time to finish propagation.
spec:
terminationGracePeriodSeconds: 60
containers:
- name: app
image: my-app:v1
lifecycle:
preStop:
exec:
# Wait 10 seconds for Endpoints to propagate before sending SIGTERM
command: ["/bin/sh", "-c", "sleep 10"]
2. RESTART POLICY & EXPONENTIAL BACKOFF
The restartPolicy (defined at spec.restartPolicy) determines how the Kubelet reacts to a container exit.
| Policy | Exit Code 0 (Success) | Exit Code >0 (Error) | Internal Logic |
|---|---|---|---|
Always | Restart | Restart | Standard for Deployments/StatefulSets. |
OnFailure | Do Not Restart | Restart | Standard for Jobs/CronJobs. |
Never | Do Not Restart | Do Not Restart | One-off tasks/Debug sessions. |
2.1 The Backoff Algorithm
To prevent a "Hot Loop" (draining CPU/Logs by restarting a crashing container thousands of times per second), Kubelet implements an Exponential Backoff.
- Initial delay: 10 seconds.
- Multiplier: 2x per subsequent failure.
- Maximum delay: 300 seconds (5 minutes).
- Reset: If a container runs successfully for 10 minutes, the Kubelet resets the backoff timer.
3. IMAGE PULL POLICY: MECHANICS & CACHING
Pulling images is often the slowest part of the Pod lifecycle.
| Policy | Technical Behavior |
|---|---|
Always | Kubelet queries the container registry for the Digest (SHA256). If the local digest doesn't match the remote, it pulls the image. |
IfNotPresent | Kubelet checks the local node cache. If the Tag exists locally, it skips the registry check entirely. |
Never | Kubelet assumes the image is pre-loaded on the node (e.g., via AMI or specialized disk). Reaches ErrImageNeverPull if missing. |
3.1 The :latest Trap
If you use the :latest tag (or no tag), Kubernetes implicitly sets the pull policy to Always. This introduces a dependency on the registry for every single Pod restart/scale event.
Production Requirement: Always use specific semantic versions (e.g., v1.2.3) to ensure IfNotPresent works as intended and to guarantee deterministic rollbacks.
4. THE POD STATE MACHINE (Phases vs. Conditions)
A Pod's "Phase" is a high-level summary. For real debugging, you must look at Conditions.
4.1 Pod Phases
- Pending: The API Server accepted the Pod, but it hasn't been scheduled or the image is still pulling.
- Running: All containers are created; at least one is still running.
- Succeeded/Failed: Terminal states (Job completed or crashed).
- Unknown: Kubelet is unresponsive (Node loss).
4.2 Pod Conditions (The "Truth")
Conditions provide the "Why" behind the "Phase."
kubectl get pod <name> -o jsonpath='{.status.conditions[*]}'
| Condition | Meaning |
|---|---|
PodScheduled | The Scheduler has assigned a Node. |
Initialized | All Init Containers have finished successfully. |
ContainersReady | All application containers have passed their Readiness Probes. |
Ready | The Pod is officially in the Load Balancer (Service). |
5. TROUBLESHOOTING & ARCHITECT COMMANDS
5.1 Inspecting Termination Reasons
If a Pod vanished or was killed, check the Last State.
kubectl get pod <pod-name> -o json | jq '.status.containerStatuses[0].lastState'
Sample Output (OOM):
{
"terminated": {
"exitCode": 137,
"reason": "OOMKilled",
"startedAt": "2023-10-27T10:00:00Z",
"finishedAt": "2023-10-27T10:05:00Z"
}
}
5.2 Common Exit Codes for Senior Engineers
- 0: Graceful exit.
- 1: Application-level crash (check logs).
- 137:
SIGKILL(Likely OOMKilled or Grace Period expired). - 139: Segmentation Fault.
- 143:
SIGTERM(Normal shutdown).
6. PRODUCTION CHECKLIST
- Handle SIGTERM: Ensure your application's entrypoint (PID 1) is not a shell script that eats signals. Use
exec my-binaryin your shell wrapper or use tools liketini. - Match Grace Period to App: If your Java app takes 45 seconds to flush a buffer, set
terminationGracePeriodSeconds: 60. - Use preStop for Propagations: In high-traffic environments, a
sleep 5in thepreStophook is the simplest way to ensure zero-downtime during rolling updates. - Avoid Latest: Use immutable tags to prevent
ImagePullBackOffduring regional registry outages if the image is already cached.