Skip to main content

Pods, ReplicaSets, and Deployments

1. PODS: The Atomic Unit

1.1 Internal Architecture

A Pod is not a process; it is an environment for processes. In Linux kernel terms, a Pod is a collection of namespaces (cgroups) shared by a group of containers.

  • The "Pause" Container: Every Pod acts as a logical host. When a Pod starts, Kubernetes first spins up a hidden "infra container" (often called the pause container). This container holds the network namespace (IP address) and IPC namespace.
  • Shared Context: All application containers in the Pod join the namespaces created by the pause container. This is why:
    • localhost works between containers in the same Pod.
    • They share the same IP address.
    • They share the same volume mounts.

1.2 The Lifecycle State Machine

Understanding the lifecycle is mandatory for debugging CrashLoopBackOff or stuck deployments.

PhaseInternal StateDescription & Debug Action
PendingSchedulingAPI Server has the object in Etcd, but the Scheduler has not found a node.
Debug: kubectl get events (Look for "Insufficient CPU", "Taints", or "Unbound PVC").
ContainerCreatingPulling/MountingNode assigned. Kubelet is creating the sandbox, mounting CSI volumes, and pulling images.
Debug: kubectl describe pod (Check "ImagePullBackOff", "ErrImagePull", "MountFailed").
RunningActiveThe process has started. Note: This does not mean the app is healthy; only that the PID exists.
SucceededExit Code 0Process terminated gracefully. Normal for Batch Jobs.
FailedExit Code > 0Process crashed or was OOMKilled.
Debug: kubectl logs -p (previous logs) or check memory limits.
UnknownNode LostController Manager lost contact with the Node Kubelet (usually network partition or node crash).

1.3 Container Lifecycle Hooks

Hooks allow execution of code at specific lifecycle points.

  • postStart: Async execution. There is no guarantee this runs before the container ENTRYPOINT. Do not use this for database migrations or critical dependencies.
  • preStop: Synchronous blocking. Critical for graceful shutdowns.
    • The Problem: When a Pod is deleted, Kubelet sends SIGTERM. Many apps (Nginx, Java) sever connections immediately.
    • The Solution: Use preStop to sleep (allowing Load Balancers to drain traffic) or issue a graceful shutdown command.
    • Constraint: Must complete within terminationGracePeriodSeconds (default 30s).

1.4 Production-Grade Pod Manifest

Do not use kubectl run for production definitions. Use this comprehensive reference.

apiVersion: v1
kind: Pod
metadata:
name: prod-payment-processor
labels:
app: payment-processor # Used by Service selectors
version: v1.2.0
tier: backend
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "9090"
spec:
# GRACEFUL SHUTDOWN: Time for preStop + SIGTERM handling
terminationGracePeriodSeconds: 45

# SCHEDULING: Soft preference for specific nodes
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 1
preference:
matchExpressions:
- key: disktype
operator: In
values:
- ssd

# SECURITY: Run as non-root user (Best Practice)
securityContext:
runAsUser: 1000
runAsGroup: 3000
fsGroup: 2000

containers:
- name: app
image: enterprise/payment-api:1.2.0
imagePullPolicy: IfNotPresent

# PORTS: Informational only, does not open firewall
ports:
- containerPort: 8080
name: http

# RESOURCES: Mandatory for Scheduler and QoS classes
resources:
requests:
memory: "512Mi" # Guaranteed memory
cpu: "250m" # 1/4 Core guaranteed
limits:
memory: "1Gi" # OOMKill if exceeded
cpu: "500m" # Throttled if exceeded

# PROBES: Self-healing configuration
livenessProbe: # Restart if dead
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 15
periodSeconds: 20
readinessProbe: # Remove from Load Balancer if failing
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 5

# LIFECYCLE HOOKS
lifecycle:
preStop:
exec:
# Nginx example: quit gracefully, don't just kill
# Generic app: sleep 5 to allow iptables propagation
command: ["/bin/sh", "-c", "sleep 5; /usr/sbin/nginx -s quit"]

2. REPLICASET (RS)

2.1 The Controller Logic

The ReplicaSet is the pure "availability engine." Its reconciliation loop is simple:

  1. Check current number of Pods with matching labels.
  2. Compare with spec.replicas.
  3. If Current < Desired: Create Pod.
  4. If Current > Desired: Delete Pod (youngest first usually).

Note: In modern Kubernetes, you rarely manage RS directly. Deployments manage RS, and RS manages Pods.

2.2 Selectors: Equality vs. Set-Based

ReplicaSets support complex selection logic, unlike the obsolete ReplicationController.

spec:
selector:
matchExpressions:
- {key: tier, operator: In, values: [frontend, api]}
- {key: env, operator: NotIn, values: [dev]}

2.3 Debugging Pattern: Pod Quarantine

A powerful technique for debugging intermittent failures without affecting production capacity.

Scenario: One Pod in a set of 10 is throwing errors. You want to debug it, but if you kubectl exec into it and kill the process, the logs are gone.

The Fix (Label Hijacking):

  1. The RS tracks the Pod via the label app=payment.
  2. Overwrite the label on the broken Pod:
    kubectl label pod payment-xyz-123 app=payment-debug --overwrite
  3. Result:
    • The RS sees 9/10 pods. It immediately spins up a fresh replacement Pod to restore capacity.
    • The "broken" Pod (payment-xyz-123) is no longer managed by the RS. It stays running, isolated from the load balancer (Service), ready for you to kubectl exec, install debug tools (strace, curl), and analyze logs at your leisure.

3. DEPLOYMENT

The Deployment object is a higher-level abstraction that manages ReplicaSets to provide Declarative Updates.

3.1 Internals: How Updates Work

When you update a Deployment (e.g., change image tag), it does not patch existing Pods.

  1. Deployment creates a New ReplicaSet.
  2. It ramps up the New RS (e.g., 0 -> 1 -> 2 replicas).
  3. It ramps down the Old RS (e.g., 10 -> 9 -> 8 replicas).
  4. This cross-scaling is controlled by the Strategy.

3.2 Deployment Strategies

A. Recreate

  • Behavior: replicas: 0 (Old) -> replicas: N (New).
  • Result: Downtime.
  • Use Case: Database schema changes where Version A and Version B cannot write to the DB simultaneously.

Calculates the pace of the rollout to ensure availability.

  • maxSurge: How many extra pods can we create above the desired count? (Can be % or integer).
  • maxUnavailable: How many pods can be missing from the desired count?

The Zero-Downtime Configuration: To guarantee that you never drop below 100% capacity during an update:

spec:
replicas: 10
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 2 # Allow 12 pods total during update
maxUnavailable: 0 # NEVER kill a pod until a new one is Ready

3.3 Rollbacks and History

Kubernetes maintains a history of ReplicaSets to facilitate rollbacks.

1. Check History:

kubectl rollout history deployment/web-app

Sample Output:

REVISION  CHANGE-CAUSE
1 kubectl create deployment web-app --image=nginx:1.19 --record
2 kubectl set image deployment/web-app nginx=nginx:1.20 --record
3 kubectl set image deployment/web-app nginx=nginx:1.21 --record

2. Rollback: The following command flips the spec.replicas of Revision 2 to the desired count and Revision 3 to 0.

kubectl rollout undo deployment/web-app --to-revision=2

3.4 Production Deployment Manifest

apiVersion: apps/v1
kind: Deployment
metadata:
name: mission-critical-api
labels:
app: api
spec:
replicas: 3
# REVISION HISTORY: Keep only last 5 RS to save Etcd space
revisionHistoryLimit: 5

selector:
matchLabels:
app: api

strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 25%
maxUnavailable: 25%

template:
metadata:
labels:
app: api
spec:
containers:
- name: api-server
image: my-registry/api:v4.5
# ... (Insert Pod Spec from Section 1.4) ...

4. Advanced Commands & Troubleshooting

4.1 "Tough" Command Examples

1. Decode the internal reason for a Pod failure: When kubectl get pod just says "CrashLoopBackOff", you need the Exit Code and the Reason.

kubectl get pod <pod-name> -o jsonpath='{.status.containerStatuses[0].state}'

Sample Output:

{"terminated":{"containerID":"containerd://...","exitCode":137,"reason":"OOMKilled","startedAt":"2023-10-27T10:00:00Z"}}

(Exit Code 137 indicates 128 + 9 (SIGKILL), confirming an Out Of Memory kill).

2. Watch a Rolling Update in real-time: Don't just wait; watch the ReplicaSets swap over.

kubectl get rs -l app=my-app --watch

Sample Output:

NAME                 DESIRED   CURRENT   READY   AGE
my-app-6b474c4b7 10 10 10 5d (Old RS)
my-app-8f92739a2 0 0 0 0s (New RS created)
my-app-8f92739a2 3 3 0 2s (New RS scaling up)
my-app-6b474c4b7 8 8 8 2s (Old RS scaling down)

3. Force Replace (The Nuclear Option): Sometimes a Deployment gets stuck because of immutable field conflicts.

kubectl replace --force -f deployment.yaml

Warning: This deletes the deployment and recreates it. Causes downtime unless carefully managed.

4.2 Common Pitfalls

  1. Missing imagePullSecrets: Pod hangs in ImagePullBackOff.
  2. latest tag: Avoid using :latest. It breaks the immutability of rollbacks (rolling back to previous revision won't help if the image behind :latest has changed).
  3. Mismatched Selectors: If the Deployment selector does not match the template.metadata.labels, the Deployment will fail to create.
  4. Zombie Pods: If a preStop hook hangs forever, the Pod sticks in Terminating state. Force delete if necessary:
    kubectl delete pod <pod-name> --grace-period=0 --force