Advanced Scheduling: Taints, Affinity, and Spread
The Kubernetes Scheduler is a sophisticated engine that balances resource requests against node capabilities. However, for production workloads, you must guide the scheduler to ensure high availability, data locality, and hardware isolation.
1. NODE MAINTENANCE & EVICTION FLOW
Before modifying scheduling rules, an architect must understand how to safely remove nodes from the cluster.
1.1 Cordon vs. Drain
- Cordon: Sets
spec.unschedulable: trueon the Node object. The scheduler will stop placing new pods there, but existing pods are untouched. - Drain: Executes a cordon, then uses the Eviction API to gracefully terminate pods.
1.2 The Eviction Process
When you run kubectl drain:
- Cordon: Node is marked unschedulable.
- Filter: Kubelet identifies pods (ignoring Mirror Pods and DaemonSets).
- Evict: Kubelet sends a
SIGTERMto containers. - Wait: Kubelet honors the
terminationGracePeriodSeconds. - PDB Check: The eviction will fail or hang if it violates a PodDisruptionBudget (PDB).
Production Drain Command:
# --force: Deletes 'naked' pods (not managed by ReplicaSet/Job)
# --ignore-daemonsets: Required because DaemonSets cannot be 'evicted' off a node
# --delete-emptydir-data: Acknowledges local data loss for pods using emptyDir
kubectl drain worker-01 --force --ignore-daemonsets --delete-emptydir-data
2. TAINTS AND TOLERATIONS (Node Repulsion)
Taints allow a node to repel a set of pods. They are the "Lock" on a node; Tolerations are the "Key" on the pod.
2.1 Taint Effects Internals
NoSchedule: Strong repulsion. New pods cannot schedule unless they have a matching toleration. Existing pods stay.PreferNoSchedule: Soft repulsion. Scheduler avoids the node but will use it if the cluster is at capacity.NoExecute: Eviction trigger. If you apply this to a node, any pod currently running without a toleration is immediately deleted.
2.2 The tolerationSeconds Feature
When using NoExecute, pods can specify how long they stay on a failing/tainted node before being evicted. This is critical for network blips.
spec:
tolerations:
- key: "node.kubernetes.io/unreachable"
operator: "Exists"
effect: "NoExecute"
tolerationSeconds: 300 # Stay for 5 mins if node goes unreachable
2.3 Production Use Case: Dedicated GPU Nodes
Taint the Node:
kubectl taint nodes gpu-node-01 hardware=nvidia-a100:NoSchedule
Pod Manifest (The Key):
apiVersion: v1
kind: Pod
metadata:
name: ml-workload
spec:
containers:
- name: cuda-container
image: nvidia/cuda:12.0-base
tolerations:
- key: "hardware"
operator: "Equal"
value: "nvidia-a100"
effect: "NoSchedule"
3. NODE AFFINITY (Node Attraction)
Node Affinity is the successor to nodeSelector. It allows for complex logic (AND/OR/NOT) and soft preferences.
3.1 Hard vs. Soft Rules
requiredDuringSchedulingIgnoredDuringExecution: Hard requirement. If no node matches, pod staysPending.preferredDuringSchedulingIgnoredDuringExecution: Soft preference. You assign a Weight (1-100). The scheduler adds weights to nodes that match and picks the highest score.
3.2 Production Example: Zonal Isolation
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: topology.kubernetes.io/zone
operator: In
values: ["us-east-1a", "us-east-1b"]
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
preference:
matchExpressions:
- key: capacity-type
operator: In
values: ["on-demand"] # Prefer on-demand over spot
4. INTER-POD AFFINITY & ANTI-AFFINITY
This logic is based on labels of Pods rather than labels of Nodes.
4.1 Pod Affinity (Co-location)
Use Case: Place a web app in the same rack/zone as its Redis cache to minimize latency.
4.2 Pod Anti-Affinity (High Availability)
Use Case: Ensure that two replicas of the same service never run on the same node or in the same zone.
Architect's Note: Pod Anti-Affinity is computationally expensive. In clusters with >500 nodes, it can significantly slow down scheduling.
4.3 Production Manifest: HA Across Nodes
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app: billing-service
topologyKey: "kubernetes.io/hostname" # Do not put two 'billing' pods on one node
5. TOPOLOGY SPREAD CONSTRAINTS
This is the most modern and flexible way to achieve "Even Distribution." It solves the "all-or-nothing" problem of Anti-Affinity.
5.1 Key Concepts
maxSkew: The degree to which pods may be unevenly distributed. AmaxSkew: 1means the difference in pod count between any two zones can be at most 1.topologyKey: The domain (e.g.,topology.kubernetes.io/zone).whenUnsatisfiable:DoNotSchedule: Hard rule.ScheduleAnyway: Still tries to balance but prioritizes placement.
5.2 Production Example: Perfect Zonal Balance
If you have 3 zones, this ensures your 6 pods are distributed 2-2-2.
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: "topology.kubernetes.io/zone"
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: ecommerce-web
6. ARCHITECTURAL COMPARISON SUMMARY
| Feature | Primary Purpose | Logic | Strategy |
|---|---|---|---|
| Taints | Exclude pods from nodes. | Repulsion | Node-centric lock. |
| Tolerations | Allow pods on tainted nodes. | Immunity | Pod-centric key. |
| Node Affinity | Attract pods to specific nodes. | Attraction | Hardware/Cloud-label awareness. |
| Pod Affinity | Keep related pods together. | Locality | Performance / Low-latency. |
| Pod Anti-Affinity | Keep pods apart. | Isolation | High Availability / Fault-tolerance. |
| Topology Spread | Evenly distribute pods. | Balancing | Multi-AZ / Multi-Rack reliability. |
The "IgnoredDuringExecution" Nuance
Almost all affinity rules contain the suffix IgnoredDuringExecution.
- Scenario: A pod is scheduled to a node because it has the label
disk=ssd. While the pod is running, an admin changes the node label todisk=hdd. - Result: The pod will NOT be evicted.
- Exception: Only Taints with
NoExecutewill cause the Kubelet to immediately evict a running pod if the rules change.
7. DEBUGGING COMMANDS
1. Identify why a pod is stuck in Pending:
kubectl describe pod <pod-name>
# Look for "FailedScheduling" events.
# It will say "0/3 nodes are available: 1 node(s) had taint {key: val}, 2 node(s) didn't match pod affinity/anti-affinity."
2. Check Node Labels (The source of affinity):
kubectl get nodes --show-labels
3. Test a Drain (Without actually doing it):
kubectl drain <node> --dry-run=client