Skip to main content

Advanced Scheduling: Taints, Affinity, and Spread

The Kubernetes Scheduler is a sophisticated engine that balances resource requests against node capabilities. However, for production workloads, you must guide the scheduler to ensure high availability, data locality, and hardware isolation.


1. NODE MAINTENANCE & EVICTION FLOW

Before modifying scheduling rules, an architect must understand how to safely remove nodes from the cluster.

1.1 Cordon vs. Drain

  • Cordon: Sets spec.unschedulable: true on the Node object. The scheduler will stop placing new pods there, but existing pods are untouched.
  • Drain: Executes a cordon, then uses the Eviction API to gracefully terminate pods.

1.2 The Eviction Process

When you run kubectl drain:

  1. Cordon: Node is marked unschedulable.
  2. Filter: Kubelet identifies pods (ignoring Mirror Pods and DaemonSets).
  3. Evict: Kubelet sends a SIGTERM to containers.
  4. Wait: Kubelet honors the terminationGracePeriodSeconds.
  5. PDB Check: The eviction will fail or hang if it violates a PodDisruptionBudget (PDB).

Production Drain Command:

# --force: Deletes 'naked' pods (not managed by ReplicaSet/Job)
# --ignore-daemonsets: Required because DaemonSets cannot be 'evicted' off a node
# --delete-emptydir-data: Acknowledges local data loss for pods using emptyDir
kubectl drain worker-01 --force --ignore-daemonsets --delete-emptydir-data

2. TAINTS AND TOLERATIONS (Node Repulsion)

Taints allow a node to repel a set of pods. They are the "Lock" on a node; Tolerations are the "Key" on the pod.

2.1 Taint Effects Internals

  1. NoSchedule: Strong repulsion. New pods cannot schedule unless they have a matching toleration. Existing pods stay.
  2. PreferNoSchedule: Soft repulsion. Scheduler avoids the node but will use it if the cluster is at capacity.
  3. NoExecute: Eviction trigger. If you apply this to a node, any pod currently running without a toleration is immediately deleted.

2.2 The tolerationSeconds Feature

When using NoExecute, pods can specify how long they stay on a failing/tainted node before being evicted. This is critical for network blips.

spec:
tolerations:
- key: "node.kubernetes.io/unreachable"
operator: "Exists"
effect: "NoExecute"
tolerationSeconds: 300 # Stay for 5 mins if node goes unreachable

2.3 Production Use Case: Dedicated GPU Nodes

Taint the Node:

kubectl taint nodes gpu-node-01 hardware=nvidia-a100:NoSchedule

Pod Manifest (The Key):

apiVersion: v1
kind: Pod
metadata:
name: ml-workload
spec:
containers:
- name: cuda-container
image: nvidia/cuda:12.0-base
tolerations:
- key: "hardware"
operator: "Equal"
value: "nvidia-a100"
effect: "NoSchedule"

3. NODE AFFINITY (Node Attraction)

Node Affinity is the successor to nodeSelector. It allows for complex logic (AND/OR/NOT) and soft preferences.

3.1 Hard vs. Soft Rules

  • requiredDuringSchedulingIgnoredDuringExecution: Hard requirement. If no node matches, pod stays Pending.
  • preferredDuringSchedulingIgnoredDuringExecution: Soft preference. You assign a Weight (1-100). The scheduler adds weights to nodes that match and picks the highest score.

3.2 Production Example: Zonal Isolation

spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: topology.kubernetes.io/zone
operator: In
values: ["us-east-1a", "us-east-1b"]
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
preference:
matchExpressions:
- key: capacity-type
operator: In
values: ["on-demand"] # Prefer on-demand over spot

4. INTER-POD AFFINITY & ANTI-AFFINITY

This logic is based on labels of Pods rather than labels of Nodes.

4.1 Pod Affinity (Co-location)

Use Case: Place a web app in the same rack/zone as its Redis cache to minimize latency.

4.2 Pod Anti-Affinity (High Availability)

Use Case: Ensure that two replicas of the same service never run on the same node or in the same zone.

Architect's Note: Pod Anti-Affinity is computationally expensive. In clusters with >500 nodes, it can significantly slow down scheduling.

4.3 Production Manifest: HA Across Nodes

spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app: billing-service
topologyKey: "kubernetes.io/hostname" # Do not put two 'billing' pods on one node

5. TOPOLOGY SPREAD CONSTRAINTS

This is the most modern and flexible way to achieve "Even Distribution." It solves the "all-or-nothing" problem of Anti-Affinity.

5.1 Key Concepts

  • maxSkew: The degree to which pods may be unevenly distributed. A maxSkew: 1 means the difference in pod count between any two zones can be at most 1.
  • topologyKey: The domain (e.g., topology.kubernetes.io/zone).
  • whenUnsatisfiable:
    • DoNotSchedule: Hard rule.
    • ScheduleAnyway: Still tries to balance but prioritizes placement.

5.2 Production Example: Perfect Zonal Balance

If you have 3 zones, this ensures your 6 pods are distributed 2-2-2.

spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: "topology.kubernetes.io/zone"
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: ecommerce-web

6. ARCHITECTURAL COMPARISON SUMMARY

FeaturePrimary PurposeLogicStrategy
TaintsExclude pods from nodes.RepulsionNode-centric lock.
TolerationsAllow pods on tainted nodes.ImmunityPod-centric key.
Node AffinityAttract pods to specific nodes.AttractionHardware/Cloud-label awareness.
Pod AffinityKeep related pods together.LocalityPerformance / Low-latency.
Pod Anti-AffinityKeep pods apart.IsolationHigh Availability / Fault-tolerance.
Topology SpreadEvenly distribute pods.BalancingMulti-AZ / Multi-Rack reliability.

The "IgnoredDuringExecution" Nuance

Almost all affinity rules contain the suffix IgnoredDuringExecution.

  • Scenario: A pod is scheduled to a node because it has the label disk=ssd. While the pod is running, an admin changes the node label to disk=hdd.
  • Result: The pod will NOT be evicted.
  • Exception: Only Taints with NoExecute will cause the Kubelet to immediately evict a running pod if the rules change.

7. DEBUGGING COMMANDS

1. Identify why a pod is stuck in Pending:

kubectl describe pod <pod-name>
# Look for "FailedScheduling" events.
# It will say "0/3 nodes are available: 1 node(s) had taint {key: val}, 2 node(s) didn't match pod affinity/anti-affinity."

2. Check Node Labels (The source of affinity):

kubectl get nodes --show-labels

3. Test a Drain (Without actually doing it):

kubectl drain <node> --dry-run=client