Project Lab 04: Hardware Isolation & Topology-Aware Scheduling
In large-scale clusters, nodes are often not identical. You may have a subset of nodes with expensive GPUs or localized NVMe storage. Without proper scheduling logic, standard "cattle" pods might consume resources on these specialized nodes, preventing critical workloads from running.
This lab demonstrates how to implement a Hardware Isolation strategy and a Zonal High-Availability pattern.
Reference Material:
docs/03-scheduling/manual-scheduling-static-pods.mddocs/03-scheduling/taints-tolerations-affinity.md
1. OBJECTIVE: DEDICATED COMPUTE TIERS
The goal is to configure the cluster so that:
- Standard Apps never land on GPU-enabled nodes.
- ML-Training Apps are attracted to GPU-enabled nodes and tolerate their "lock."
- High Availability is enforced so replicas of the ML app never share a single physical node or become unbalanced across Availability Zones.
2. PHASE 1: NODE PREPARATION (TAINTS & LABELS)
We will simulate hardware specialization by applying Taints (to repel) and Labels (to attract) to specific nodes.
2.1 Identify and Taint GPU Nodes
Assume worker-2 is our GPU-enabled node. We apply a NoSchedule taint.
# 1. Taint the node to repel standard workloads
kubectl taint nodes worker-2 hardware=gpu:NoSchedule
# 2. Label the node for Affinity rules
kubectl label nodes worker-2 accelerator=nvidia-tesla-a100
kubectl label nodes worker-2 topology.kubernetes.io/zone=us-east-1a
# 3. Label other nodes for zonal context
kubectl label nodes worker-1 topology.kubernetes.io/zone=us-east-1b
2. PHASE 2: IMPLEMENTING THE ISOLATED WORKLOAD
We will deploy an ML-Training application that is designed to run only on our specialized hardware.
2.1 The Training Manifest (ml-training.yaml)
apiVersion: apps/v1
kind: Deployment
metadata:
name: ml-processor
spec:
replicas: 2
selector:
matchLabels:
app: ml-processor
template:
metadata:
labels:
app: ml-processor
spec:
# 1. TOLERATION: Allows the pod to stay on the tainted node
tolerations:
- key: "hardware"
operator: "Equal"
value: "gpu"
effect: "NoSchedule"
# 2. NODE AFFINITY: Hard requirement to land on specific hardware
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: accelerator
operator: In
values:
- nvidia-tesla-a100
# 3. POD ANTI-AFFINITY: Ensure replicas don't land on the same node
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app: ml-processor
topologyKey: "kubernetes.io/hostname"
# 4. TOPOLOGY SPREAD: Ensure pods are spread across zones
topologySpreadConstraints:
- maxSkew: 1
topologyKey: "topology.kubernetes.io/zone"
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: ml-processor
containers:
- name: processor
image: bitnami/pytorch:latest
resources:
requests:
cpu: "500m"
memory: "1Gi"
3. PHASE 3: THE "CATTLE" TEST
We will now verify that a standard application (without tolerations) cannot schedule onto our GPU node.
3.1 Deploy Standard Nginx
kubectl create deployment web-frontend --image=nginx --replicas=10
3.2 Audit Placement
# Check which nodes are hosting the frontend
kubectl get pods -o wide -l app=web-frontend
Expected Observation: None of the web-frontend pods should land on worker-2 because they lack the required toleration for the hardware=gpu taint.
4. VERIFICATION & AUDIT
4.1 Confirm Specialized Placement
kubectl get pods -o wide -l app=ml-processor
Success Criteria: Both ml-processor pods must be on worker-2 (matching the affinity and toleration).
4.2 Investigate Scheduling Latency
If you scale the ml-processor to 3 replicas while having only one GPU node:
kubectl scale deployment ml-processor --replicas=3
kubectl get pods -l app=ml-processor
Observation: The 3rd pod will stay in Pending.
Why? Run kubectl describe pod <pending-pod-name>. You will see FailedScheduling because the Pod Anti-Affinity rule topologyKey: kubernetes.io/hostname prevents a second pod from landing on the same node, and no other nodes match the Node Affinity for nvidia-tesla-a100.
5. TROUBLESHOOTING & NINJA COMMANDS
5.1 Audit Taints and Labels across the Cluster
kubectl get nodes -o custom-columns=NAME:.metadata.name,TAINTS:.spec.taints,ZONES:.metadata.labels."topology\.kubernetes\.io/zone"
5.2 Simulating Node Failure
If you remove the taint while the standard app is running:
kubectl taint nodes worker-2 hardware:NoSchedule-
The pods will not move immediately. This is the meaning of IgnoredDuringExecution. New pods created by a scale-up event or restart will now be eligible for worker-2.
6. KEY TAKEAWAYS FOR PRODUCTION
- Taints are for Exclusion: Use them to protect specialized resources from general-purpose workloads.
- Affinity is for Attraction: Use it to ensure workloads find the specific hardware they need to perform.
- Anti-Affinity is for Reliability: Always use it for production replicas to prevent a single node failure from taking down your entire service.
- Spread is for Balance: Use Topology Spread Constraints to survive Availability Zone outages.