Skip to main content

Priority Classes: Governing Cluster Entitlement

In a multi-tenant or resource-constrained cluster, not all workloads are equal. A payment processing API is more critical than a nightly log-aggregation Job. Priority Classes allow you to codify this business value into scheduling logic, ensuring that mission-critical Pods are always placed, even if it requires evicting lower-priority "victim" Pods.


1. ARCHITECTURE: THE PREEMPTION LOOP

When a Pod is submitted to the API Server, it enters the Scheduling Queue. If the Scheduler cannot find a Node with sufficient resources (CPU/Mem/Ports) for a high-priority Pod, it triggers the Preemption Logic.

1.1 The Victim Selection Algorithm

The Scheduler does not just kill random Pods. It follows a strict "Victim Selection" process:

  1. Filtering: It identifies Nodes where evicting a set of lower-priority Pods would allow the high-priority Pod to fit.
  2. Minimization: It picks the Node that requires the minimum number of evictions.
  3. Priority Ranking: If multiple Nodes require the same number of evictions, it picks the one with the lowest aggregate priority of victims.
  4. PDB Awareness: The Scheduler tries to avoid evicting Pods that would violate their PodDisruptionBudget (PDB), though this is a "best-effort" check.

1.2 The Nominated Node (status.nominatedNodeName)

Unlike standard scheduling, preemption is a two-step process:

  1. Nomination: The Scheduler identifies a Node and sets the status.nominatedNodeName on the high-priority Pod.
  2. Termination: The Kubelet on that Node receives the eviction signal for the victims and begins the graceful shutdown (SIGTERM).
  3. Binding: Once the victims have physically exited, the Scheduler officially binds the high-priority Pod to the Node.

2. PRIORITY VS. QoS (QUALITY OF SERVICE)

Senior Architects must never confuse these two concepts. They are enforced by different components at different lifecycle stages.

FeaturePod PriorityQoS Class
Controlled Bykube-schedulerkubelet
StagePlacement (Initial Scheduling)Runtime (Ongoing Execution)
Decision PointCluster is full; Pod can't start.Node is OOM/DiskFull; Pod must die.
ActionPreemption: Kill LP Pods to start HP Pod.Eviction: Kill unstable Pods to save the Node.
MetricpriorityValue (Integer)requests vs limits (Logic)

3. PRIORITY RANGES & RESERVED VALUES

Kubernetes uses a 32-bit integer for priority.

RangeUsageExample
$2 \times 10^9$ +System-Criticalsystem-node-critical, system-cluster-critical
$1 \times 10^9$ to $2 \times 10^9$Internal ComponentsReserved for future K8s core use.
$-2 \times 10^9$ to $1 \times 10^9$User-DefinedYour application workloads.

3.1 Critical System Classes

  • system-node-critical: Used for Pods that are required for the Node to function (e.g., CNI plugins, Storage drivers).
  • system-cluster-critical: Used for Pods required for the Cluster to function (e.g., CoreDNS, API Aggregators).
  • Bible Rule: Never use these for application workloads. It circumvents all resource protections and can destabilize the control plane.

4. BIBLE-GRADE YAML: PRODUCTION HIERARCHY

This manifest establishes a standard hierarchy: Tier-0 (Critical), Tier-1 (Production), Tier-2 (Batch).

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: tier-0-mission-critical
value: 1000000
globalDefault: false
description: "Mission critical APIs. Will preempt anything else."
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: tier-1-production
value: 500000
globalDefault: false
description: "Standard production workloads."
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: tier-2-background
value: 10000
globalDefault: true # All pods without a class get this priority
description: "Default for non-essential or dev workloads."

4.1 Non-Preempting Priority Classes

Sometimes you want a Pod to go to the front of the queue but you don't want it to kill other Pods (e.g., a "nice-to-have" batch job that should run as soon as space is free).

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: high-priority-non-preempting
value: 800000
preemptionPolicy: Never # Move to front of line, but don't kick anyone out

5. RESOURCE QUOTAS & PRIORITY

In multi-tenant clusters, "Priority" is a limited resource. You don't want a "Dev" user creating "Mission-Critical" pods to hijack all cluster resources.

Production Hardening: Use ResourceQuota with scopeSelector to limit who can use which Priority Class.

apiVersion: v1
kind: ResourceQuota
metadata:
name: critical-pods-quota
namespace: dev-team
spec:
hard:
pods: "0" # Disable all pods in this namespace from using Tier-0
scopeSelector:
matchExpressions:
- operator: In
scopeName: PriorityClass
values: ["tier-0-mission-critical"]

6. PRODUCTION PITFALLS & REAL-WORLD WARNINGS

6.1 The "Graceful Termination" Delay

If Pod-A (HP) preempts Pod-B (LP), Pod-A is bound to the node but cannot start until Pod-B finishes its terminationGracePeriodSeconds.

  • Pitfall: If Pod-B has a 5-minute grace period, your "Critical" Pod-A is effectively stuck in Pending for 5 minutes.
  • Architect Recommendation: Keep terminationGracePeriodSeconds low for lower-priority workloads.

6.2 Preemption Starvation

If you have a high-priority Deployment that constantly restarts (CrashLoopBackOff), it may repeatedly preempt lower-priority Pods every time it tries to restart, causing cluster-wide instability.

  • Monitor: Watch for high eviction counts in your monitoring system (Prometheus kube_pod_status_reason{reason="Evicted"}).

6.3 Interactions with Cluster Autoscaler

If you use Priority Classes, the Cluster Autoscaler will ignore Pods with very low priority when deciding whether to scale up.

  • The "Overprovisioning" Pattern: Create "Pause" pods with extremely low priority (e.g., -10). They will sit in the cluster taking up space. When a real Pod needs room, it preempts the "Pause" pod. This creates a "Buffer" of compute that is always available.

7. TROUBLESHOOTING & NINJA COMMANDS

7.1 Inspecting Nominated Nodes

If a Pod is stuck pending during preemption, check where it wants to go:

kubectl get pod <pod-name> -o custom-columns=NAME:.metadata.name,NOMINATED_NODE:.status.nominatedNodeName

7.2 Auditing Preemption Events

Preemption is recorded as an event in the namespace.

kubectl get events -A --field-selector reason=Preempted

Sample Output:

LAST SEEN   TYPE      REASON      OBJECT             MESSAGE
12s Normal Preempted pod/lp-app-xyz Preempted by default/hp-api-123 on node worker-1

7.3 Checking All Priority Classes

kubectl get pc

Note: Look for the GLOBAL-DEFAULT column to ensure only one class is acting as the fallback.