Skip to main content

Pod Lifecycle: Termination, Restarts, and Pull Policies


1. POD TERMINATION: THE GRACEFUL RACE

When a Pod is marked for deletion, Kubernetes initiates two parallel workflows. A common production failure is not accounting for the race condition between these two.

1.1 The Internal Sequence

  1. State Change: Pod metadata.deletionTimestamp is set. Status becomes Terminating.
  2. Workflow A (Networking): The Service Controller and Endpoint Controller observe the deletion and remove the Pod's IP from all Endpoints/EndpointSlices.
  3. Workflow B (Node-Level):
    • preStop Hook: If defined, the Kubelet executes the preStop hook synchronously.
    • SIGTERM: Kubelet sends SIGTERM (Signal 15) to PID 1 inside each container.
    • Grace Period: Kubelet waits for terminationGracePeriodSeconds (default 30s).
    • SIGKILL: If containers are still running, Kubelet sends SIGKILL (Signal 9).

1.2 The Race Condition Pitfall

The Problem: Workflow A (removing IPs from Load Balancers) and Workflow B (killing the process) happen simultaneously. In highly distributed clusters, iptables/IPVS updates can take several seconds to propagate to all nodes. The Result: A Load Balancer might send a request to a Pod that has already received SIGTERM and closed its listener, resulting in a 502 Bad Gateway.

The Bible-Grade Solution: Use a preStop hook to delay the SIGTERM, giving the network layer time to finish propagation.

spec:
terminationGracePeriodSeconds: 60
containers:
- name: app
image: my-app:v1
lifecycle:
preStop:
exec:
# Wait 10 seconds for Endpoints to propagate before sending SIGTERM
command: ["/bin/sh", "-c", "sleep 10"]

2. RESTART POLICY & EXPONENTIAL BACKOFF

The restartPolicy (defined at spec.restartPolicy) determines how the Kubelet reacts to a container exit.

PolicyExit Code 0 (Success)Exit Code >0 (Error)Internal Logic
AlwaysRestartRestartStandard for Deployments/StatefulSets.
OnFailureDo Not RestartRestartStandard for Jobs/CronJobs.
NeverDo Not RestartDo Not RestartOne-off tasks/Debug sessions.

2.1 The Backoff Algorithm

To prevent a "Hot Loop" (draining CPU/Logs by restarting a crashing container thousands of times per second), Kubelet implements an Exponential Backoff.

  • Initial delay: 10 seconds.
  • Multiplier: 2x per subsequent failure.
  • Maximum delay: 300 seconds (5 minutes).
  • Reset: If a container runs successfully for 10 minutes, the Kubelet resets the backoff timer.

3. IMAGE PULL POLICY: MECHANICS & CACHING

Pulling images is often the slowest part of the Pod lifecycle.

PolicyTechnical Behavior
AlwaysKubelet queries the container registry for the Digest (SHA256). If the local digest doesn't match the remote, it pulls the image.
IfNotPresentKubelet checks the local node cache. If the Tag exists locally, it skips the registry check entirely.
NeverKubelet assumes the image is pre-loaded on the node (e.g., via AMI or specialized disk). Reaches ErrImageNeverPull if missing.

3.1 The :latest Trap

If you use the :latest tag (or no tag), Kubernetes implicitly sets the pull policy to Always. This introduces a dependency on the registry for every single Pod restart/scale event. Production Requirement: Always use specific semantic versions (e.g., v1.2.3) to ensure IfNotPresent works as intended and to guarantee deterministic rollbacks.


4. THE POD STATE MACHINE (Phases vs. Conditions)

A Pod's "Phase" is a high-level summary. For real debugging, you must look at Conditions.

4.1 Pod Phases

  • Pending: The API Server accepted the Pod, but it hasn't been scheduled or the image is still pulling.
  • Running: All containers are created; at least one is still running.
  • Succeeded/Failed: Terminal states (Job completed or crashed).
  • Unknown: Kubelet is unresponsive (Node loss).

4.2 Pod Conditions (The "Truth")

Conditions provide the "Why" behind the "Phase."

kubectl get pod <name> -o jsonpath='{.status.conditions[*]}'
ConditionMeaning
PodScheduledThe Scheduler has assigned a Node.
InitializedAll Init Containers have finished successfully.
ContainersReadyAll application containers have passed their Readiness Probes.
ReadyThe Pod is officially in the Load Balancer (Service).

5. TROUBLESHOOTING & ARCHITECT COMMANDS

5.1 Inspecting Termination Reasons

If a Pod vanished or was killed, check the Last State.

kubectl get pod <pod-name> -o json | jq '.status.containerStatuses[0].lastState'

Sample Output (OOM):

{
"terminated": {
"exitCode": 137,
"reason": "OOMKilled",
"startedAt": "2023-10-27T10:00:00Z",
"finishedAt": "2023-10-27T10:05:00Z"
}
}

5.2 Common Exit Codes for Senior Engineers

  • 0: Graceful exit.
  • 1: Application-level crash (check logs).
  • 137: SIGKILL (Likely OOMKilled or Grace Period expired).
  • 139: Segmentation Fault.
  • 143: SIGTERM (Normal shutdown).

6. PRODUCTION CHECKLIST

  1. Handle SIGTERM: Ensure your application's entrypoint (PID 1) is not a shell script that eats signals. Use exec my-binary in your shell wrapper or use tools like tini.
  2. Match Grace Period to App: If your Java app takes 45 seconds to flush a buffer, set terminationGracePeriodSeconds: 60.
  3. Use preStop for Propagations: In high-traffic environments, a sleep 5 in the preStop hook is the simplest way to ensure zero-downtime during rolling updates.
  4. Avoid Latest: Use immutable tags to prevent ImagePullBackOff during regional registry outages if the image is already cached.