Advanced Scheduling: Taints, Affinity, and Spread

The Kubernetes Scheduler is a sophisticated engine that balances resource requests against node capabilities. However, for production workloads, you must guide the scheduler to ensure high availability, data locality, and hardware isolation.

1. NODE MAINTENANCE & EVICTION FLOW

Before modifying scheduling rules, an architect must understand how to safely remove nodes from the cluster.

1.1 Cordon vs. Drain

Cordon: Sets spec.unschedulable: true on the Node object. The scheduler will stop placing new pods there, but existing pods are untouched.
Drain: Executes a cordon, then uses the Eviction API to gracefully terminate pods.

1.2 The Eviction Process

When you run kubectl drain:

Cordon: Node is marked unschedulable.
Filter: Kubelet identifies pods (ignoring Mirror Pods and DaemonSets).
Evict: Kubelet sends a SIGTERM to containers.
Wait: Kubelet honors the terminationGracePeriodSeconds.
PDB Check: The eviction will fail or hang if it violates a PodDisruptionBudget (PDB).

Production Drain Command:

# --force: Deletes 'naked' pods (not managed by ReplicaSet/Job)
# --ignore-daemonsets: Required because DaemonSets cannot be 'evicted' off a node
# --delete-emptydir-data: Acknowledges local data loss for pods using emptyDir
kubectl drain worker-01 --force --ignore-daemonsets --delete-emptydir-data

2. TAINTS AND TOLERATIONS (Node Repulsion)

Taints allow a node to repel a set of pods. They are the "Lock" on a node; Tolerations are the "Key" on the pod.

2.1 Taint Effects Internals

NoSchedule: Strong repulsion. New pods cannot schedule unless they have a matching toleration. Existing pods stay.
PreferNoSchedule: Soft repulsion. Scheduler avoids the node but will use it if the cluster is at capacity.
NoExecute: Eviction trigger. If you apply this to a node, any pod currently running without a toleration is immediately deleted.

2.2 The `tolerationSeconds` Feature

When using NoExecute, pods can specify how long they stay on a failing/tainted node before being evicted. This is critical for network blips.

spec:
  tolerations:
  - key: "node.kubernetes.io/unreachable"
    operator: "Exists"
    effect: "NoExecute"
    tolerationSeconds: 300 # Stay for 5 mins if node goes unreachable

2.3 Production Use Case: Dedicated GPU Nodes

Taint the Node:

kubectl taint nodes gpu-node-01 hardware=nvidia-a100:NoSchedule

Pod Manifest (The Key):

apiVersion: v1
kind: Pod
metadata:
  name: ml-workload
spec:
  containers:
  - name: cuda-container
    image: nvidia/cuda:12.0-base
  tolerations:
  - key: "hardware"
    operator: "Equal"
    value: "nvidia-a100"
    effect: "NoSchedule"

3. NODE AFFINITY (Node Attraction)

Node Affinity is the successor to nodeSelector. It allows for complex logic (AND/OR/NOT) and soft preferences.

3.1 Hard vs. Soft Rules

requiredDuringSchedulingIgnoredDuringExecution: Hard requirement. If no node matches, pod stays Pending.
preferredDuringSchedulingIgnoredDuringExecution: Soft preference. You assign a Weight (1-100). The scheduler adds weights to nodes that match and picks the highest score.

3.2 Production Example: Zonal Isolation

spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: topology.kubernetes.io/zone
            operator: In
            values: ["us-east-1a", "us-east-1b"]
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 100
        preference:
          matchExpressions:
          - key: capacity-type
            operator: In
            values: ["on-demand"] # Prefer on-demand over spot

4. INTER-POD AFFINITY & ANTI-AFFINITY

This logic is based on labels of Pods rather than labels of Nodes.

4.1 Pod Affinity (Co-location)

Use Case: Place a web app in the same rack/zone as its Redis cache to minimize latency.

4.2 Pod Anti-Affinity (High Availability)

Use Case: Ensure that two replicas of the same service never run on the same node or in the same zone.

Architect's Note: Pod Anti-Affinity is computationally expensive. In clusters with >500 nodes, it can significantly slow down scheduling.

4.3 Production Manifest: HA Across Nodes

spec:
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchLabels:
            app: billing-service
        topologyKey: "kubernetes.io/hostname" # Do not put two 'billing' pods on one node

5. TOPOLOGY SPREAD CONSTRAINTS

This is the most modern and flexible way to achieve "Even Distribution." It solves the "all-or-nothing" problem of Anti-Affinity.

5.1 Key Concepts

maxSkew: The degree to which pods may be unevenly distributed. A maxSkew: 1 means the difference in pod count between any two zones can be at most 1.
topologyKey: The domain (e.g., topology.kubernetes.io/zone).
whenUnsatisfiable:
- DoNotSchedule: Hard rule.
- ScheduleAnyway: Still tries to balance but prioritizes placement.

5.2 Production Example: Perfect Zonal Balance

If you have 3 zones, this ensures your 6 pods are distributed 2-2-2.

spec:
  topologySpreadConstraints:
  - maxSkew: 1
    topologyKey: "topology.kubernetes.io/zone"
    whenUnsatisfiable: DoNotSchedule
    labelSelector:
      matchLabels:
        app: ecommerce-web

6. ARCHITECTURAL COMPARISON SUMMARY

Feature	Primary Purpose	Logic	Strategy
Taints	Exclude pods from nodes.	Repulsion	Node-centric lock.
Tolerations	Allow pods on tainted nodes.	Immunity	Pod-centric key.
Node Affinity	Attract pods to specific nodes.	Attraction	Hardware/Cloud-label awareness.
Pod Affinity	Keep related pods together.	Locality	Performance / Low-latency.
Pod Anti-Affinity	Keep pods apart.	Isolation	High Availability / Fault-tolerance.
Topology Spread	Evenly distribute pods.	Balancing	Multi-AZ / Multi-Rack reliability.

The "IgnoredDuringExecution" Nuance

Almost all affinity rules contain the suffix IgnoredDuringExecution.

Scenario: A pod is scheduled to a node because it has the label disk=ssd. While the pod is running, an admin changes the node label to disk=hdd.
Result: The pod will NOT be evicted.
Exception: Only Taints with NoExecute will cause the Kubelet to immediately evict a running pod if the rules change.

7. DEBUGGING COMMANDS

1. Identify why a pod is stuck in Pending:

kubectl describe pod <pod-name>
# Look for "FailedScheduling" events. 
# It will say "0/3 nodes are available: 1 node(s) had taint {key: val}, 2 node(s) didn't match pod affinity/anti-affinity."

2. Check Node Labels (The source of affinity):

kubectl get nodes --show-labels

3. Test a Drain (Without actually doing it):

kubectl drain <node> --dry-run=client

1. NODE MAINTENANCE & EVICTION FLOW​

1.1 Cordon vs. Drain​

1.2 The Eviction Process​

2. TAINTS AND TOLERATIONS (Node Repulsion)​

2.1 Taint Effects Internals​

2.2 The tolerationSeconds Feature​

2.3 Production Use Case: Dedicated GPU Nodes​

3. NODE AFFINITY (Node Attraction)​

3.1 Hard vs. Soft Rules​

3.2 Production Example: Zonal Isolation​

4. INTER-POD AFFINITY & ANTI-AFFINITY​

4.1 Pod Affinity (Co-location)​

4.2 Pod Anti-Affinity (High Availability)​

4.3 Production Manifest: HA Across Nodes​

5. TOPOLOGY SPREAD CONSTRAINTS​

5.1 Key Concepts​

5.2 Production Example: Perfect Zonal Balance​

6. ARCHITECTURAL COMPARISON SUMMARY​

The "IgnoredDuringExecution" Nuance​

7. DEBUGGING COMMANDS​