Priority Classes: Governing Cluster Entitlement

In a multi-tenant or resource-constrained cluster, not all workloads are equal. A payment processing API is more critical than a nightly log-aggregation Job. Priority Classes allow you to codify this business value into scheduling logic, ensuring that mission-critical Pods are always placed, even if it requires evicting lower-priority "victim" Pods.

1. ARCHITECTURE: THE PREEMPTION LOOP

When a Pod is submitted to the API Server, it enters the Scheduling Queue. If the Scheduler cannot find a Node with sufficient resources (CPU/Mem/Ports) for a high-priority Pod, it triggers the Preemption Logic.

1.1 The Victim Selection Algorithm

The Scheduler does not just kill random Pods. It follows a strict "Victim Selection" process:

Filtering: It identifies Nodes where evicting a set of lower-priority Pods would allow the high-priority Pod to fit.
Minimization: It picks the Node that requires the minimum number of evictions.
Priority Ranking: If multiple Nodes require the same number of evictions, it picks the one with the lowest aggregate priority of victims.
PDB Awareness: The Scheduler tries to avoid evicting Pods that would violate their PodDisruptionBudget (PDB), though this is a "best-effort" check.

1.2 The Nominated Node (`status.nominatedNodeName`)

Unlike standard scheduling, preemption is a two-step process:

Nomination: The Scheduler identifies a Node and sets the status.nominatedNodeName on the high-priority Pod.
Termination: The Kubelet on that Node receives the eviction signal for the victims and begins the graceful shutdown (SIGTERM).
Binding: Once the victims have physically exited, the Scheduler officially binds the high-priority Pod to the Node.

2. PRIORITY VS. QoS (QUALITY OF SERVICE)

Senior Architects must never confuse these two concepts. They are enforced by different components at different lifecycle stages.

Feature	Pod Priority	QoS Class
Controlled By	kube-scheduler	kubelet
Stage	Placement (Initial Scheduling)	Runtime (Ongoing Execution)
Decision Point	Cluster is full; Pod can't start.	Node is OOM/DiskFull; Pod must die.
Action	Preemption: Kill LP Pods to start HP Pod.	Eviction: Kill unstable Pods to save the Node.
Metric	`priorityValue` (Integer)	`requests` vs `limits` (Logic)

3. PRIORITY RANGES & RESERVED VALUES

Kubernetes uses a 32-bit integer for priority.

Range	Usage	Example
$2 \times 10^9$ +	System-Critical	`system-node-critical`, `system-cluster-critical`
$1 \times 10^9$ to $2 \times 10^9$	Internal Components	Reserved for future K8s core use.
$-2 \times 10^9$ to $1 \times 10^9$	User-Defined	Your application workloads.

3.1 Critical System Classes

system-node-critical: Used for Pods that are required for the Node to function (e.g., CNI plugins, Storage drivers).
system-cluster-critical: Used for Pods required for the Cluster to function (e.g., CoreDNS, API Aggregators).
Bible Rule: Never use these for application workloads. It circumvents all resource protections and can destabilize the control plane.

4. BIBLE-GRADE YAML: PRODUCTION HIERARCHY

This manifest establishes a standard hierarchy: Tier-0 (Critical), Tier-1 (Production), Tier-2 (Batch).

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: tier-0-mission-critical
value: 1000000
globalDefault: false
description: "Mission critical APIs. Will preempt anything else."
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: tier-1-production
value: 500000
globalDefault: false
description: "Standard production workloads."
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: tier-2-background
value: 10000
globalDefault: true # All pods without a class get this priority
description: "Default for non-essential or dev workloads."

4.1 Non-Preempting Priority Classes

Sometimes you want a Pod to go to the front of the queue but you don't want it to kill other Pods (e.g., a "nice-to-have" batch job that should run as soon as space is free).

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: high-priority-non-preempting
value: 800000
preemptionPolicy: Never # Move to front of line, but don't kick anyone out

5. RESOURCE QUOTAS & PRIORITY

In multi-tenant clusters, "Priority" is a limited resource. You don't want a "Dev" user creating "Mission-Critical" pods to hijack all cluster resources.

Production Hardening: Use ResourceQuota with scopeSelector to limit who can use which Priority Class.

apiVersion: v1
kind: ResourceQuota
metadata:
  name: critical-pods-quota
  namespace: dev-team
spec:
  hard:
    pods: "0" # Disable all pods in this namespace from using Tier-0
  scopeSelector:
    matchExpressions:
    - operator: In
      scopeName: PriorityClass
      values: ["tier-0-mission-critical"]

6. PRODUCTION PITFALLS & REAL-WORLD WARNINGS

6.1 The "Graceful Termination" Delay

If Pod-A (HP) preempts Pod-B (LP), Pod-A is bound to the node but cannot start until Pod-B finishes its terminationGracePeriodSeconds.

Pitfall: If Pod-B has a 5-minute grace period, your "Critical" Pod-A is effectively stuck in Pending for 5 minutes.
Architect Recommendation: Keep terminationGracePeriodSeconds low for lower-priority workloads.

6.2 Preemption Starvation

If you have a high-priority Deployment that constantly restarts (CrashLoopBackOff), it may repeatedly preempt lower-priority Pods every time it tries to restart, causing cluster-wide instability.

Monitor: Watch for high eviction counts in your monitoring system (Prometheus kube_pod_status_reason{reason="Evicted"}).

6.3 Interactions with Cluster Autoscaler

If you use Priority Classes, the Cluster Autoscaler will ignore Pods with very low priority when deciding whether to scale up.

The "Overprovisioning" Pattern: Create "Pause" pods with extremely low priority (e.g., -10). They will sit in the cluster taking up space. When a real Pod needs room, it preempts the "Pause" pod. This creates a "Buffer" of compute that is always available.

7. TROUBLESHOOTING & NINJA COMMANDS

7.1 Inspecting Nominated Nodes

If a Pod is stuck pending during preemption, check where it wants to go:

kubectl get pod <pod-name> -o custom-columns=NAME:.metadata.name,NOMINATED_NODE:.status.nominatedNodeName

7.2 Auditing Preemption Events

Preemption is recorded as an event in the namespace.

kubectl get events -A --field-selector reason=Preempted

Sample Output:

LAST SEEN   TYPE      REASON      OBJECT             MESSAGE
12s         Normal    Preempted   pod/lp-app-xyz     Preempted by default/hp-api-123 on node worker-1

7.3 Checking All Priority Classes

kubectl get pc

Note: Look for the GLOBAL-DEFAULT column to ensure only one class is acting as the fallback.

1. ARCHITECTURE: THE PREEMPTION LOOP​

1.1 The Victim Selection Algorithm​

1.2 The Nominated Node (status.nominatedNodeName)​

2. PRIORITY VS. QoS (QUALITY OF SERVICE)​

3. PRIORITY RANGES & RESERVED VALUES​

3.1 Critical System Classes​

4. BIBLE-GRADE YAML: PRODUCTION HIERARCHY​

4.1 Non-Preempting Priority Classes​

5. RESOURCE QUOTAS & PRIORITY​

6. PRODUCTION PITFALLS & REAL-WORLD WARNINGS​

6.1 The "Graceful Termination" Delay​

6.2 Preemption Starvation​

6.3 Interactions with Cluster Autoscaler​

7. TROUBLESHOOTING & NINJA COMMANDS​

7.1 Inspecting Nominated Nodes​

7.2 Auditing Preemption Events​

7.3 Checking All Priority Classes​