Project Lab 04: Hardware Isolation & Topology-Aware Scheduling

In large-scale clusters, nodes are often not identical. You may have a subset of nodes with expensive GPUs or localized NVMe storage. Without proper scheduling logic, standard "cattle" pods might consume resources on these specialized nodes, preventing critical workloads from running.

This lab demonstrates how to implement a Hardware Isolation strategy and a Zonal High-Availability pattern.

Reference Material:

docs/03-scheduling/manual-scheduling-static-pods.md
docs/03-scheduling/taints-tolerations-affinity.md

1. OBJECTIVE: DEDICATED COMPUTE TIERS

The goal is to configure the cluster so that:

Standard Apps never land on GPU-enabled nodes.
ML-Training Apps are attracted to GPU-enabled nodes and tolerate their "lock."
High Availability is enforced so replicas of the ML app never share a single physical node or become unbalanced across Availability Zones.

2. PHASE 1: NODE PREPARATION (TAINTS & LABELS)

We will simulate hardware specialization by applying Taints (to repel) and Labels (to attract) to specific nodes.

2.1 Identify and Taint GPU Nodes

Assume worker-2 is our GPU-enabled node. We apply a NoSchedule taint.

# 1. Taint the node to repel standard workloads
kubectl taint nodes worker-2 hardware=gpu:NoSchedule

# 2. Label the node for Affinity rules
kubectl label nodes worker-2 accelerator=nvidia-tesla-a100
kubectl label nodes worker-2 topology.kubernetes.io/zone=us-east-1a

# 3. Label other nodes for zonal context
kubectl label nodes worker-1 topology.kubernetes.io/zone=us-east-1b

2. PHASE 2: IMPLEMENTING THE ISOLATED WORKLOAD

We will deploy an ML-Training application that is designed to run only on our specialized hardware.

2.1 The Training Manifest (`ml-training.yaml`)

apiVersion: apps/v1
kind: Deployment
metadata:
  name: ml-processor
spec:
  replicas: 2
  selector:
    matchLabels:
      app: ml-processor
  template:
    metadata:
      labels:
        app: ml-processor
    spec:
      # 1. TOLERATION: Allows the pod to stay on the tainted node
      tolerations:
      - key: "hardware"
        operator: "Equal"
        value: "gpu"
        effect: "NoSchedule"

      # 2. NODE AFFINITY: Hard requirement to land on specific hardware
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: accelerator
                operator: In
                values:
                - nvidia-tesla-a100
        
        # 3. POD ANTI-AFFINITY: Ensure replicas don't land on the same node
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchLabels:
                app: ml-processor
            topologyKey: "kubernetes.io/hostname"

      # 4. TOPOLOGY SPREAD: Ensure pods are spread across zones
      topologySpreadConstraints:
      - maxSkew: 1
        topologyKey: "topology.kubernetes.io/zone"
        whenUnsatisfiable: DoNotSchedule
        labelSelector:
          matchLabels:
            app: ml-processor

      containers:
      - name: processor
        image: bitnami/pytorch:latest
        resources:
          requests:
            cpu: "500m"
            memory: "1Gi"

3. PHASE 3: THE "CATTLE" TEST

We will now verify that a standard application (without tolerations) cannot schedule onto our GPU node.

3.1 Deploy Standard Nginx

kubectl create deployment web-frontend --image=nginx --replicas=10

3.2 Audit Placement

# Check which nodes are hosting the frontend
kubectl get pods -o wide -l app=web-frontend

Expected Observation: None of the web-frontend pods should land on worker-2 because they lack the required toleration for the hardware=gpu taint.

4. VERIFICATION & AUDIT

4.1 Confirm Specialized Placement

kubectl get pods -o wide -l app=ml-processor

Success Criteria: Both ml-processor pods must be on worker-2 (matching the affinity and toleration).

4.2 Investigate Scheduling Latency

If you scale the ml-processor to 3 replicas while having only one GPU node:

kubectl scale deployment ml-processor --replicas=3
kubectl get pods -l app=ml-processor

Observation: The 3rd pod will stay in Pending. Why? Run kubectl describe pod <pending-pod-name>. You will see FailedScheduling because the Pod Anti-Affinity rule topologyKey: kubernetes.io/hostname prevents a second pod from landing on the same node, and no other nodes match the Node Affinity for nvidia-tesla-a100.

5. TROUBLESHOOTING & NINJA COMMANDS

5.1 Audit Taints and Labels across the Cluster

kubectl get nodes -o custom-columns=NAME:.metadata.name,TAINTS:.spec.taints,ZONES:.metadata.labels."topology\.kubernetes\.io/zone"

5.2 Simulating Node Failure

If you remove the taint while the standard app is running:

kubectl taint nodes worker-2 hardware:NoSchedule-

The pods will not move immediately. This is the meaning of IgnoredDuringExecution. New pods created by a scale-up event or restart will now be eligible for worker-2.

6. KEY TAKEAWAYS FOR PRODUCTION

Taints are for Exclusion: Use them to protect specialized resources from general-purpose workloads.
Affinity is for Attraction: Use it to ensure workloads find the specific hardware they need to perform.
Anti-Affinity is for Reliability: Always use it for production replicas to prevent a single node failure from taking down your entire service.
Spread is for Balance: Use Topology Spread Constraints to survive Availability Zone outages.

1. OBJECTIVE: DEDICATED COMPUTE TIERS​

2. PHASE 1: NODE PREPARATION (TAINTS & LABELS)​

2.1 Identify and Taint GPU Nodes​

2. PHASE 2: IMPLEMENTING THE ISOLATED WORKLOAD​

2.1 The Training Manifest (ml-training.yaml)​

3. PHASE 3: THE "CATTLE" TEST​

3.1 Deploy Standard Nginx​

3.2 Audit Placement​

4. VERIFICATION & AUDIT​

4.1 Confirm Specialized Placement​

4.2 Investigate Scheduling Latency​

5. TROUBLESHOOTING & NINJA COMMANDS​

5.1 Audit Taints and Labels across the Cluster​

5.2 Simulating Node Failure​

6. KEY TAKEAWAYS FOR PRODUCTION​