Resource Management: Limits, Quotas & cgroups
In Kubernetes, managing compute resources is not just about avoiding node crashes; it is about predictable scheduling, preventing "noisy neighbors," and determining which Pods survive when a Node runs out of memory.
1. CORE CONCEPTS: REQUESTS & LIMITS
Every container in a Pod can specify two key parameters for CPU and Memory: Requests and Limits.
1.1 CPU and Memory Units
- CPU (Compressible Resource):
- Measured in cores or millicores (m).
1 CPU= 1 AWS vCPU = 1 GCP Core = 1 Intel Hyperthread.1000m= 1 CPU.250m= 1/4 of a CPU core.
- Memory (Incompressible Resource):
- Measured in bytes, usually expressed in Mebibytes (Mi) or Gibibytes (Gi).
- Bible Rule: Always use Power-of-Two units (
Mi,Gi). Do not useMorG(Power-of-Ten), as it causes calculation drift between what you request and what the Linux kernel allocates.
1.2 The "Request" (The Guarantee)
- Who uses it? The Kube-Scheduler.
- What does it do? It determines node placement. If a Pod requests
500mCPU, the Scheduler finds a Node with at least500mof unallocated CPU capacity. - Linux Implementation: Translates to
cpu.sharesin cgroups. It guarantees that during CPU contention, the container gets exactly this ratio of CPU time.
1.3 The "Limit" (The Ceiling)
- Who uses it? The Kubelet (via Linux cgroups) on the Node.
- What does it do? It caps the maximum amount of resources the container can use.
1.4 What happens when you exceed them?
This is the most critical matrix in Kubernetes resource management:
| Resource | Exceeding Request | Exceeding Limit |
|---|---|---|
| CPU | Allowed (if node has idle CPU cycles). | Throttled. The container is paused by the Linux Completely Fair Scheduler (CFS) bandwidth controller until the next period. It does not crash, but latency spikes. |
| Memory | Allowed (if node has free memory). | OOMKilled. The Linux kernel instantly kills the primary process with Exit Code 137 (SIGKILL). Kubelet will restart it (CrashLoopBackOff). |
2. QUALITY OF SERVICE (QoS) CLASSES
Kubernetes automatically assigns a QoS class to every Pod based on its Requests and Limits. This class dictates which Pod gets evicted first when a Node experiences Memory Pressure.
| QoS Class | Condition | Eviction Priority |
|---|---|---|
| Guaranteed | Every container sets Requests EQUAL to Limits for both CPU and Memory. | Lowest Priority (Safest). These are the last pods to be killed. Use for production databases/APIs. |
| Burstable | Requests are LESS THAN Limits, or only some containers have them set. | Medium Priority. Killed if the node runs out of memory and no BestEffort pods exist. |
| BestEffort | NO Requests and NO Limits are set anywhere in the Pod. | Highest Priority (First to die). The kernel's OOM killer will target these instantly. |
Production Pod Manifest (Guaranteed QoS)
apiVersion: v1
kind: Pod
metadata:
name: payment-api
spec:
containers:
- name: app
image: enterprise/payment:v2
resources:
# Requests == Limits ensures "Guaranteed" QoS
requests:
memory: "1Gi"
cpu: "1000m"
limits:
memory: "1Gi"
cpu: "1000m"
3. LIMITRANGE (Namespace Defaulting)
The Problem: Developers forget to set Requests and Limits. If a developer deploys a BestEffort pod with a memory leak, it can crash the Node.
The Solution: A LimitRange automatically injects default CPU/Memory requests and limits into Pods created in a specific namespace. It can also enforce minimum and maximum sizes.
LimitRange Architecture
- Scope: Namespace-level.
- Enforcement Agent: Admission Controller.
- Behavior: If a Pod is submitted without resources, the LimitRange mutates the YAML to inject the
defaultvalues before saving to Etcd.
Production LimitRange Manifest
apiVersion: v1
kind: LimitRange
metadata:
name: standard-tier-limits
namespace: dev-team-alpha
spec:
limits:
- type: Container
# 1. Hard Boundaries (Admission will REJECT pods outside these bounds)
max:
cpu: "2"
memory: "4Gi"
min:
cpu: "100m"
memory: "128Mi"
# 2. Defaults (Injected if developer forgets to specify)
default:
cpu: "500m" # Default Limit
memory: "512Mi" # Default Limit
defaultRequest:
cpu: "250m" # Default Request
memory: "256Mi" # Default Request
# 3. Ratio Control (Prevents someone requesting 100m but setting limit to 10 CPU)
maxLimitRequestRatio:
cpu: "4"
4. RESOURCEQUOTA (Namespace Budgeting)
The Problem: A LimitRange ensures individual Pods are sized correctly, but what stops a developer from deploying 1,000 correctly-sized Pods and bankrupting your AWS account?
The Solution: A ResourceQuota puts a hard ceiling on the aggregate total of resources that can exist in a Namespace.
ResourceQuota Architecture
- Scope: Namespace-level.
- Enforcement Agent: Admission Controller.
- Behavior: If a new Pod causes the Namespace to exceed its quota, the API Server rejects the Pod with an HTTP 403 Forbidden (
Exceeded quota).
Production ResourceQuota Manifest
apiVersion: v1
kind: ResourceQuota
metadata:
name: dev-team-alpha-budget
namespace: dev-team-alpha
spec:
hard:
# Compute Quotas
requests.cpu: "20" # Max 20 total CPU requested across all pods
requests.memory: "40Gi"
limits.cpu: "40"
limits.memory: "80Gi"
# Object Count Quotas
pods: "50" # Max 50 pods total
services.loadbalancers: "2" # Max 2 expensive cloud LoadBalancers
persistentvolumeclaims: "10"
Quota Verification & Debugging
To see how much of the budget a team is using:
kubectl describe quota dev-team-alpha-budget -n dev-team-alpha
Sample Output:
Name: dev-team-alpha-budget
Namespace: dev-team-alpha
Resource Used Hard
-------- ---- ----
limits.cpu 12 40
limits.memory 24Gi 80Gi
pods 15 50
requests.cpu 6 20
requests.memory 12Gi 40Gi
5. MONITORING & TROUBLESHOOTING CHEATSHEET
5.1 Checking Live Usage (kubectl top)
The kubectl top command relies on the Metrics Server (which aggregates data from the Kubelet's cAdvisor).
Note: This shows actual usage, not the requested amount.
# See which Nodes are hottest
kubectl top nodes
# See which Pods are consuming the most CPU
kubectl top pods --sort-by=cpu -A
# See which Containers inside a specific Pod are eating memory
kubectl top pod my-database --containers
5.2 Debugging "Pending" Pods (Insufficient CPU/Memory)
If a Pod is stuck in Pending, the Scheduler cannot find a node with enough unallocated Requests.
- Check Pod Events:
kubectl describe pod <pod-name> | grep -A 5 Events:
# Output: Warning FailedScheduling 0/5 nodes are available: 5 Insufficient cpu. - Check Node Allocation:
Sample Output:
kubectl describe node worker-01 | grep -A 8 "Allocated resources:"Notice in the output above: The Node is "Overcommitted" on Limits (250%), which is fine. But it is at 95% of Requests. If your new pod requestsAllocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 950m (95%) 2500m (250%)
memory 6Gi (80%) 12Gi (150%)100mCPU, it will not fit on this node.
5.3 Debugging OOMKills
If a Pod keeps restarting, check if the kernel killed it for memory usage:
kubectl get pod <pod-name> -o jsonpath='{.status.containerStatuses[0].lastState.terminated.reason}'
# If it returns "OOMKilled", you MUST increase the memory Limit in the Deployment YAML.