Skip to main content

Scheduler Internals & Static Pods

1. THE KUBERNETES SCHEDULER

The kube-scheduler is the default controller that watches for newly created Pods that have no Node assigned. It is responsible for finding the "best" Node for that Pod.

A. The Scheduling Loop (Internals)

The scheduler does not just pick a random node. It follows a strict two-phase cycle for every unscheduled Pod:

1. Filtering (Predicates) - "Can it run here?" The scheduler removes nodes that do not meet hard constraints.

  • Resources: Does the node have enough free CPU/RAM? (NodeResourcesFit)
  • Taints: Does the pod have the matching toleration? (TaintToleration)
  • Affinity: Does the node match nodeAffinity rules? (NodeAffinity)
  • Ports: Is the requested hostPort available?

2. Scoring (Priorities) - "Should it run here?" The scheduler ranks the remaining eligible nodes to find the best fit.

  • Image Locality: Does the node already have the container image? (Saves bandwidth).
  • Least Allocated: Prefer nodes with lower load (Spread).
  • Most Allocated: Prefer nodes with higher load (Bin packing).

Flow: [ Pod Created ] -> [ API Server ] -> [ Scheduler (Filter -> Score) ] -> [ API Server (Bind) ] -> [ Kubelet (Run) ]


2. MANUAL SCHEDULING (Bypassing the Scheduler)

Sometimes, you need to force a Pod onto a specific node, regardless of taints or resource calculations. You can bypass the Scheduler entirely by setting the nodeName field.

A. The Mechanism

When you set .spec.nodeName:

  1. The Scheduler ignores the Pod (because it looks like it is already scheduled).
  2. The Kubelet on the target node sees the Pod assigned to itself and starts it immediately.

B. Critical Warning

Bypassing Taints: Because the Scheduler is skipped, Taints and Tolerations are ignored. If you assign a Pod to a node via nodeName, it will run even if that node has NoSchedule taints (like the Control Plane).

C. Manifest Implementation

apiVersion: v1
kind: Pod
metadata:
name: sysadmin-debugger
spec:
# HARD ASSIGNMENT: Skips scheduler, affinity, and taints
nodeName: worker-node-01
containers:
- name: nginx
image: nginx
resources:
requests:
memory: "64Mi"
cpu: "250m"

D. Verification

# Verify where the pod landed
kubectl get pod sysadmin-debugger -o wide

Output:

NAME                 READY   STATUS    RESTARTS   AGE   IP           NODE
sysadmin-debugger 1/1 Running 0 5s 10.244.1.5 worker-node-01

3. STATIC PODS (Kubelet-Managed)

Static Pods are the exception to the rule. They are not managed by the API Server or the Scheduler. They are managed directly by the Kubelet daemon on a specific node.

A. Architecture

  • Source of Truth: Files in a specific directory on the Node (default: /etc/kubernetes/manifests).
  • Manager: The Kubelet watches this directory. If you add a file, Kubelet creates the Pod. If you delete the file, Kubelet kills the Pod.
  • Resilience: Static Pods run even if the entire Control Plane (API Server/Etcd) is down.

B. The "Mirror Pod"

The API Server needs to know these Pods exist so users can see them.

  1. Kubelet starts the Static Pod (Docker/Containerd).
  2. Kubelet sends a request to API Server to create a Mirror Pod.
  3. Read-Only: You cannot edit a Mirror Pod. If you delete it via kubectl, the Kubelet immediately recreates it because the file still exists on disk.
  4. Naming Convention: <pod-name>-<hostname>

C. Use Case: Bootstrapping the Control Plane

How does Kubernetes start Kubernetes? If you check /etc/kubernetes/manifests on a Control Plane node, you will see:

  • etcd.yaml
  • kube-apiserver.yaml
  • kube-controller-manager.yaml
  • kube-scheduler.yaml

These core components run as Static Pods. This is why kubectl get pods -n kube-system shows pods named etcd-control-plane.

D. Workflow: Creating a Static Pod

1. Locate the Manifest Directory: Check the Kubelet config (usually /var/lib/kubelet/config.yaml) for the staticPodPath field. Standard path is /etc/kubernetes/manifests.

2. Create the File (SSH into Node required):

# Run this ON THE NODE (worker-1)
cat <<EOF > /etc/kubernetes/manifests/site-reliability.yaml
apiVersion: v1
kind: Pod
metadata:
name: node-guardian
namespace: default
spec:
containers:
- name: nginx
image: nginx:alpine
ports:
- containerPort: 80
EOF

3. Verify via Kubectl: Go back to your workstation.

kubectl get pods

Output:

NAME                       READY   STATUS    RESTARTS   AGE
node-guardian-worker-1 1/1 Running 0 10s

4. Attempting to Delete (The Failure):

kubectl delete pod node-guardian-worker-1
# Output: pod "node-guardian-worker-1" deleted

Wait 2 seconds...

kubectl get pods
# Output: node-guardian-worker-1 0/1 Pending 0 1s

Result: The Mirror Pod was deleted, but the Kubelet realized the file /etc/kubernetes/manifests/site-reliability.yaml still exists, so it restarted the pod and recreated the mirror.

5. The Only Way to Delete: You must SSH into the node and remove the file.

# On the node
rm /etc/kubernetes/manifests/site-reliability.yaml

4. DAEMONSETS VS. STATIC PODS

It is easy to confuse these two concepts as they both run pods on nodes.

FeatureStatic PodDaemonSet
Controlled ByKubelet (Local Node)DaemonSet Controller (API Server)
SchedulingNone (File presence = Run)Scheduler (respects taints/affinity)
Update StrategyManual file edit on every nodeRollingUpdate via API
Use CaseBootstrapping (Etcd, API Server)Cluster Logs (Fluentd), Networking (CNI)
VisibilityMirror PodStandard Pod

5. TROUBLESHOOTING CHEATSHEET

  1. Pod is Pending but nodeName is set?
    • The Node name you typed does not exist. The Scheduler isn't checking, but the Kubelet on "ghost-node" isn't there to start it.
  2. Cannot delete a Pod?
    • Check the name. Does it end with the node name? It's likely a Static Pod. Find the manifest on the node.
  3. Static Pod not starting?
    • Check Kubelet logs on the node: journalctl -u kubelet -f.
    • Syntax errors in the YAML file in /etc/kubernetes/manifests will appear here.