Pods, ReplicaSets, and Deployments
1. PODS: The Atomic Unit
1.1 Internal Architecture
A Pod is not a process; it is an environment for processes. In Linux kernel terms, a Pod is a collection of namespaces (cgroups) shared by a group of containers.
- The "Pause" Container: Every Pod acts as a logical host. When a Pod starts, Kubernetes first spins up a hidden "infra container" (often called the
pausecontainer). This container holds the network namespace (IP address) and IPC namespace. - Shared Context: All application containers in the Pod join the namespaces created by the pause container. This is why:
localhostworks between containers in the same Pod.- They share the same IP address.
- They share the same volume mounts.
1.2 The Lifecycle State Machine
Understanding the lifecycle is mandatory for debugging CrashLoopBackOff or stuck deployments.
| Phase | Internal State | Description & Debug Action |
|---|---|---|
| Pending | Scheduling | API Server has the object in Etcd, but the Scheduler has not found a node. Debug: kubectl get events (Look for "Insufficient CPU", "Taints", or "Unbound PVC"). |
| ContainerCreating | Pulling/Mounting | Node assigned. Kubelet is creating the sandbox, mounting CSI volumes, and pulling images. Debug: kubectl describe pod (Check "ImagePullBackOff", "ErrImagePull", "MountFailed"). |
| Running | Active | The process has started. Note: This does not mean the app is healthy; only that the PID exists. |
| Succeeded | Exit Code 0 | Process terminated gracefully. Normal for Batch Jobs. |
| Failed | Exit Code > 0 | Process crashed or was OOMKilled. Debug: kubectl logs -p (previous logs) or check memory limits. |
| Unknown | Node Lost | Controller Manager lost contact with the Node Kubelet (usually network partition or node crash). |
1.3 Container Lifecycle Hooks
Hooks allow execution of code at specific lifecycle points.
postStart: Async execution. There is no guarantee this runs before the container ENTRYPOINT. Do not use this for database migrations or critical dependencies.preStop: Synchronous blocking. Critical for graceful shutdowns.- The Problem: When a Pod is deleted, Kubelet sends
SIGTERM. Many apps (Nginx, Java) sever connections immediately. - The Solution: Use
preStopto sleep (allowing Load Balancers to drain traffic) or issue a graceful shutdown command. - Constraint: Must complete within
terminationGracePeriodSeconds(default 30s).
- The Problem: When a Pod is deleted, Kubelet sends
1.4 Production-Grade Pod Manifest
Do not use kubectl run for production definitions. Use this comprehensive reference.
apiVersion: v1
kind: Pod
metadata:
name: prod-payment-processor
labels:
app: payment-processor # Used by Service selectors
version: v1.2.0
tier: backend
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "9090"
spec:
# GRACEFUL SHUTDOWN: Time for preStop + SIGTERM handling
terminationGracePeriodSeconds: 45
# SCHEDULING: Soft preference for specific nodes
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 1
preference:
matchExpressions:
- key: disktype
operator: In
values:
- ssd
# SECURITY: Run as non-root user (Best Practice)
securityContext:
runAsUser: 1000
runAsGroup: 3000
fsGroup: 2000
containers:
- name: app
image: enterprise/payment-api:1.2.0
imagePullPolicy: IfNotPresent
# PORTS: Informational only, does not open firewall
ports:
- containerPort: 8080
name: http
# RESOURCES: Mandatory for Scheduler and QoS classes
resources:
requests:
memory: "512Mi" # Guaranteed memory
cpu: "250m" # 1/4 Core guaranteed
limits:
memory: "1Gi" # OOMKill if exceeded
cpu: "500m" # Throttled if exceeded
# PROBES: Self-healing configuration
livenessProbe: # Restart if dead
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 15
periodSeconds: 20
readinessProbe: # Remove from Load Balancer if failing
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 5
# LIFECYCLE HOOKS
lifecycle:
preStop:
exec:
# Nginx example: quit gracefully, don't just kill
# Generic app: sleep 5 to allow iptables propagation
command: ["/bin/sh", "-c", "sleep 5; /usr/sbin/nginx -s quit"]
2. REPLICASET (RS)
2.1 The Controller Logic
The ReplicaSet is the pure "availability engine." Its reconciliation loop is simple:
- Check current number of Pods with matching labels.
- Compare with
spec.replicas. - If
Current < Desired: Create Pod. - If
Current > Desired: Delete Pod (youngest first usually).
Note: In modern Kubernetes, you rarely manage RS directly. Deployments manage RS, and RS manages Pods.
2.2 Selectors: Equality vs. Set-Based
ReplicaSets support complex selection logic, unlike the obsolete ReplicationController.
spec:
selector:
matchExpressions:
- {key: tier, operator: In, values: [frontend, api]}
- {key: env, operator: NotIn, values: [dev]}
2.3 Debugging Pattern: Pod Quarantine
A powerful technique for debugging intermittent failures without affecting production capacity.
Scenario: One Pod in a set of 10 is throwing errors. You want to debug it, but if you kubectl exec into it and kill the process, the logs are gone.
The Fix (Label Hijacking):
- The RS tracks the Pod via the label
app=payment. - Overwrite the label on the broken Pod:
kubectl label pod payment-xyz-123 app=payment-debug --overwrite - Result:
- The RS sees 9/10 pods. It immediately spins up a fresh replacement Pod to restore capacity.
- The "broken" Pod (
payment-xyz-123) is no longer managed by the RS. It stays running, isolated from the load balancer (Service), ready for you tokubectl exec, install debug tools (strace,curl), and analyze logs at your leisure.
3. DEPLOYMENT
The Deployment object is a higher-level abstraction that manages ReplicaSets to provide Declarative Updates.
3.1 Internals: How Updates Work
When you update a Deployment (e.g., change image tag), it does not patch existing Pods.
- Deployment creates a New ReplicaSet.
- It ramps up the New RS (e.g., 0 -> 1 -> 2 replicas).
- It ramps down the Old RS (e.g., 10 -> 9 -> 8 replicas).
- This cross-scaling is controlled by the
Strategy.
3.2 Deployment Strategies
A. Recreate
- Behavior:
replicas: 0(Old) ->replicas: N(New). - Result: Downtime.
- Use Case: Database schema changes where Version A and Version B cannot write to the DB simultaneously.
B. RollingUpdate (Default & Recommended)
Calculates the pace of the rollout to ensure availability.
maxSurge: How many extra pods can we create above the desired count? (Can be % or integer).maxUnavailable: How many pods can be missing from the desired count?
The Zero-Downtime Configuration: To guarantee that you never drop below 100% capacity during an update:
spec:
replicas: 10
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 2 # Allow 12 pods total during update
maxUnavailable: 0 # NEVER kill a pod until a new one is Ready
3.3 Rollbacks and History
Kubernetes maintains a history of ReplicaSets to facilitate rollbacks.
1. Check History:
kubectl rollout history deployment/web-app
Sample Output:
REVISION CHANGE-CAUSE
1 kubectl create deployment web-app --image=nginx:1.19 --record
2 kubectl set image deployment/web-app nginx=nginx:1.20 --record
3 kubectl set image deployment/web-app nginx=nginx:1.21 --record
2. Rollback:
The following command flips the spec.replicas of Revision 2 to the desired count and Revision 3 to 0.
kubectl rollout undo deployment/web-app --to-revision=2
3.4 Production Deployment Manifest
apiVersion: apps/v1
kind: Deployment
metadata:
name: mission-critical-api
labels:
app: api
spec:
replicas: 3
# REVISION HISTORY: Keep only last 5 RS to save Etcd space
revisionHistoryLimit: 5
selector:
matchLabels:
app: api
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 25%
maxUnavailable: 25%
template:
metadata:
labels:
app: api
spec:
containers:
- name: api-server
image: my-registry/api:v4.5
# ... (Insert Pod Spec from Section 1.4) ...
4. Advanced Commands & Troubleshooting
4.1 "Tough" Command Examples
1. Decode the internal reason for a Pod failure:
When kubectl get pod just says "CrashLoopBackOff", you need the Exit Code and the Reason.
kubectl get pod <pod-name> -o jsonpath='{.status.containerStatuses[0].state}'
Sample Output:
{"terminated":{"containerID":"containerd://...","exitCode":137,"reason":"OOMKilled","startedAt":"2023-10-27T10:00:00Z"}}
(Exit Code 137 indicates 128 + 9 (SIGKILL), confirming an Out Of Memory kill).
2. Watch a Rolling Update in real-time: Don't just wait; watch the ReplicaSets swap over.
kubectl get rs -l app=my-app --watch
Sample Output:
NAME DESIRED CURRENT READY AGE
my-app-6b474c4b7 10 10 10 5d (Old RS)
my-app-8f92739a2 0 0 0 0s (New RS created)
my-app-8f92739a2 3 3 0 2s (New RS scaling up)
my-app-6b474c4b7 8 8 8 2s (Old RS scaling down)
3. Force Replace (The Nuclear Option): Sometimes a Deployment gets stuck because of immutable field conflicts.
kubectl replace --force -f deployment.yaml
Warning: This deletes the deployment and recreates it. Causes downtime unless carefully managed.
4.2 Common Pitfalls
- Missing
imagePullSecrets: Pod hangs inImagePullBackOff. latesttag: Avoid using:latest. It breaks the immutability of rollbacks (rolling back to previous revision won't help if the image behind:latesthas changed).- Mismatched Selectors: If the Deployment
selectordoes not match thetemplate.metadata.labels, the Deployment will fail to create. - Zombie Pods: If a
preStophook hangs forever, the Pod sticks inTerminatingstate. Force delete if necessary:kubectl delete pod <pod-name> --grace-period=0 --force