Priority Classes: Governing Cluster Entitlement
In a multi-tenant or resource-constrained cluster, not all workloads are equal. A payment processing API is more critical than a nightly log-aggregation Job. Priority Classes allow you to codify this business value into scheduling logic, ensuring that mission-critical Pods are always placed, even if it requires evicting lower-priority "victim" Pods.
1. ARCHITECTURE: THE PREEMPTION LOOP
When a Pod is submitted to the API Server, it enters the Scheduling Queue. If the Scheduler cannot find a Node with sufficient resources (CPU/Mem/Ports) for a high-priority Pod, it triggers the Preemption Logic.
1.1 The Victim Selection Algorithm
The Scheduler does not just kill random Pods. It follows a strict "Victim Selection" process:
- Filtering: It identifies Nodes where evicting a set of lower-priority Pods would allow the high-priority Pod to fit.
- Minimization: It picks the Node that requires the minimum number of evictions.
- Priority Ranking: If multiple Nodes require the same number of evictions, it picks the one with the lowest aggregate priority of victims.
- PDB Awareness: The Scheduler tries to avoid evicting Pods that would violate their PodDisruptionBudget (PDB), though this is a "best-effort" check.
1.2 The Nominated Node (status.nominatedNodeName)
Unlike standard scheduling, preemption is a two-step process:
- Nomination: The Scheduler identifies a Node and sets the
status.nominatedNodeNameon the high-priority Pod. - Termination: The Kubelet on that Node receives the eviction signal for the victims and begins the graceful shutdown (
SIGTERM). - Binding: Once the victims have physically exited, the Scheduler officially binds the high-priority Pod to the Node.
2. PRIORITY VS. QoS (QUALITY OF SERVICE)
Senior Architects must never confuse these two concepts. They are enforced by different components at different lifecycle stages.
| Feature | Pod Priority | QoS Class |
|---|---|---|
| Controlled By | kube-scheduler | kubelet |
| Stage | Placement (Initial Scheduling) | Runtime (Ongoing Execution) |
| Decision Point | Cluster is full; Pod can't start. | Node is OOM/DiskFull; Pod must die. |
| Action | Preemption: Kill LP Pods to start HP Pod. | Eviction: Kill unstable Pods to save the Node. |
| Metric | priorityValue (Integer) | requests vs limits (Logic) |
3. PRIORITY RANGES & RESERVED VALUES
Kubernetes uses a 32-bit integer for priority.
| Range | Usage | Example |
|---|---|---|
| $2 \times 10^9$ + | System-Critical | system-node-critical, system-cluster-critical |
| $1 \times 10^9$ to $2 \times 10^9$ | Internal Components | Reserved for future K8s core use. |
| $-2 \times 10^9$ to $1 \times 10^9$ | User-Defined | Your application workloads. |
3.1 Critical System Classes
system-node-critical: Used for Pods that are required for the Node to function (e.g., CNI plugins, Storage drivers).system-cluster-critical: Used for Pods required for the Cluster to function (e.g., CoreDNS, API Aggregators).- Bible Rule: Never use these for application workloads. It circumvents all resource protections and can destabilize the control plane.
4. BIBLE-GRADE YAML: PRODUCTION HIERARCHY
This manifest establishes a standard hierarchy: Tier-0 (Critical), Tier-1 (Production), Tier-2 (Batch).
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: tier-0-mission-critical
value: 1000000
globalDefault: false
description: "Mission critical APIs. Will preempt anything else."
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: tier-1-production
value: 500000
globalDefault: false
description: "Standard production workloads."
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: tier-2-background
value: 10000
globalDefault: true # All pods without a class get this priority
description: "Default for non-essential or dev workloads."
4.1 Non-Preempting Priority Classes
Sometimes you want a Pod to go to the front of the queue but you don't want it to kill other Pods (e.g., a "nice-to-have" batch job that should run as soon as space is free).
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: high-priority-non-preempting
value: 800000
preemptionPolicy: Never # Move to front of line, but don't kick anyone out
5. RESOURCE QUOTAS & PRIORITY
In multi-tenant clusters, "Priority" is a limited resource. You don't want a "Dev" user creating "Mission-Critical" pods to hijack all cluster resources.
Production Hardening: Use ResourceQuota with scopeSelector to limit who can use which Priority Class.
apiVersion: v1
kind: ResourceQuota
metadata:
name: critical-pods-quota
namespace: dev-team
spec:
hard:
pods: "0" # Disable all pods in this namespace from using Tier-0
scopeSelector:
matchExpressions:
- operator: In
scopeName: PriorityClass
values: ["tier-0-mission-critical"]
6. PRODUCTION PITFALLS & REAL-WORLD WARNINGS
6.1 The "Graceful Termination" Delay
If Pod-A (HP) preempts Pod-B (LP), Pod-A is bound to the node but cannot start until Pod-B finishes its terminationGracePeriodSeconds.
- Pitfall: If Pod-B has a 5-minute grace period, your "Critical" Pod-A is effectively stuck in
Pendingfor 5 minutes. - Architect Recommendation: Keep
terminationGracePeriodSecondslow for lower-priority workloads.
6.2 Preemption Starvation
If you have a high-priority Deployment that constantly restarts (CrashLoopBackOff), it may repeatedly preempt lower-priority Pods every time it tries to restart, causing cluster-wide instability.
- Monitor: Watch for high
evictioncounts in your monitoring system (Prometheuskube_pod_status_reason{reason="Evicted"}).
6.3 Interactions with Cluster Autoscaler
If you use Priority Classes, the Cluster Autoscaler will ignore Pods with very low priority when deciding whether to scale up.
- The "Overprovisioning" Pattern: Create "Pause" pods with extremely low priority (e.g., -10). They will sit in the cluster taking up space. When a real Pod needs room, it preempts the "Pause" pod. This creates a "Buffer" of compute that is always available.
7. TROUBLESHOOTING & NINJA COMMANDS
7.1 Inspecting Nominated Nodes
If a Pod is stuck pending during preemption, check where it wants to go:
kubectl get pod <pod-name> -o custom-columns=NAME:.metadata.name,NOMINATED_NODE:.status.nominatedNodeName
7.2 Auditing Preemption Events
Preemption is recorded as an event in the namespace.
kubectl get events -A --field-selector reason=Preempted
Sample Output:
LAST SEEN TYPE REASON OBJECT MESSAGE
12s Normal Preempted pod/lp-app-xyz Preempted by default/hp-api-123 on node worker-1
7.3 Checking All Priority Classes
kubectl get pc
Note: Look for the GLOBAL-DEFAULT column to ensure only one class is acting as the fallback.