Project Lab 03: Zero-Downtime Rolling Update and Service Handoff

Achieving true zero-downtime deployments is a core requirement for production workloads. This lab demonstrates how to configure a Deployment to perform an update in a non-disruptive, highly-available manner by controlling the speed and safety of the rollout using maxSurge and the Readiness Probe.

Reference Material: docs/02-workloads-services/1-pods-rc-rs-deployment.md

1. OBJECTIVE: ROLLING UPDATE WITH 100% AVAILABILITY

The goal is to update a running application from version v1.0.0 to v2.0.0 while maintaining the full desired replica count, ensuring no user ever receives an error.

Key Setting: We will use maxUnavailable: 0 to ensure no old pods are terminated until a new pod is fully Ready.

2. LAB SETUP: THE VERSIONED SERVICE

We use a simple NGINX deployment that exposes its application version via a custom header.

2.1 The Application Image (Conceptual)

We need two versions of an image (v1.0.0 and v2.0.0) that take time to start.

Image Tag	Application State	Readiness Probe Path
`v1.0.0` (Old)	Returns `{"version": "v1.0.0"}`	`/ready` returns 200 OK
`v2.0.0` (New)	Returns `{"version": "v2.0.0"}`	`/ready` returns 200 OK after 30 seconds

2.2 Initial Deployment Manifest (`v1-initial.yaml`)

We start with 4 replicas and enforce Pod Anti-Affinity to spread them across nodes (best practice).

apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-router
  labels:
    app: api-router
    version: v1
spec:
  replicas: 4
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 0 # CRITICAL: No pods down at any time
      maxSurge: 1       # CRITICAL: Allow 1 extra pod during the rollout (4+1 = 5 max pods)
  selector:
    matchLabels:
      app: api-router
  template:
    metadata:
      labels:
        app: api-router
        version: v1
    spec:
      # Spread replicas across different nodes
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchLabels:
                  app: api-router
              topologyKey: "kubernetes.io/hostname"
      containers:
      - name: router-app
        image: custom-registry/router-app:v1.0.0 # Old version
        ports:
        - containerPort: 8080
        resources:
          limits:
            cpu: "250m"
        readinessProbe: # The key to a safe rollout
          httpGet:
            path: /ready
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 5
          failureThreshold: 3

2.3 Service Manifest (`service.yaml`)

The service only selects on the generic app: api-router label.

apiVersion: v1
kind: Service
metadata:
  name: api-svc
spec:
  type: ClusterIP
  selector:
    app: api-router # Will target both v1 and v2 during the rollout
  ports:
  - port: 80
    targetPort: 8080

3. EXECUTION AND AUDIT

3.1 Initial Deployment

Apply the initial manifests.

kubectl apply -f v1-initial.yaml
kubectl apply -f service.yaml
kubectl get pods -l app=api-router
kubectl get endpoints api-svc

Expected State: 4 Pods (v1 only), Service Endpoints = 4 IPs.

3.2 The Upgrade Command

Update the deployment image to the new version (v2.0.0) and change the metadata label to trigger a new ReplicaSet.

kubectl set image deployment/api-router router-app=custom-registry/router-app:v2.0.0
kubectl patch deployment api-router -p '{"spec":{"template":{"metadata":{"labels":{"version":"v2"}}}}}'

3.3 The Rolling Update Audit (The Core Test)

Use kubectl rollout status and watch the ReplicaSets in parallel.

# Terminal 1: Watch the Rollout Status
kubectl rollout status deployment/api-router -w

# Terminal 2: Watch the Pods and their Readiness
kubectl get pods -l app=api-router -L version -w

# Terminal 3: The Load Test (The Zero-Downtime Proof)
# Run a continuous loop to hit the service endpoint
while true; do curl -s api-svc/version ; sleep 0.1; done 
# OR: kubectl exec -it debug-pod -- while true; do curl -s api-svc/version; sleep 0.1; done

Expected Observation in Terminal 2 (The Rollout Sequence):

Pods	Old RS (v1)	New RS (v2)	Status Change	Endpoints
Start	4	0	All Running/Ready	4 IPs
Surge	4	1	New Pod `v2` starts $\to$ Running	4 IPs (v2 not Ready yet)
Ready	4	1	New Pod `v2` $\to$ Ready (After 30s)	5 IPs
Drain	3	1	Old Pod `v1` $\to$ Terminating	4 IPs
Continue	0	4	Process repeats until completion	4 IPs

Expected Observation in Terminal 3 (Load Test): The output should continuously show version strings (v1.0.0 or v2.0.0) without any connection errors or 502/503 responses. The service ensures that traffic is only sent to the 4 or 5 pods that report Ready.

4. TROUBLESHOOTING AND NINJA COMMANDS

4.1 Checking Endpoint Stalling

If the rollout stalls, the Readiness Probe is likely failing or the maxSurge / maxUnavailable settings are impossible.

# Check the status of the Service Endpoints (should never drop below 4)
kubectl get endpoints api-svc -o jsonpath='{.subsets[*].addresses[*].ip}'

4.2 Force a Failed Rollout Investigation

If a rollout is stuck in Progressing: True but the Pods are CrashLoopBackOff, the deployment will hang forever.

# Check the internal progress of the rollout
kubectl rollout history deployment api-router
kubectl rollout status deployment api-router

Fix: Rollback immediately.

kubectl rollout undo deployment/api-router

1. OBJECTIVE: ROLLING UPDATE WITH 100% AVAILABILITY​

2. LAB SETUP: THE VERSIONED SERVICE​

2.1 The Application Image (Conceptual)​

2.2 Initial Deployment Manifest (v1-initial.yaml)​

2.3 Service Manifest (service.yaml)​

3. EXECUTION AND AUDIT​

3.1 Initial Deployment​

3.2 The Upgrade Command​

3.3 The Rolling Update Audit (The Core Test)​

4. TROUBLESHOOTING AND NINJA COMMANDS​

4.1 Checking Endpoint Stalling​

4.2 Force a Failed Rollout Investigation​