Skip to main content

Pod Security: Contexts, Capabilities, and Standards

In the Kubernetes shared-responsibility model, the cluster provides the orchestration, but the Pod Security Context defines the boundary between the container and the host kernel. Hardening this boundary is the primary defense against Container Breakouts and lateral movement.


1. ARCHITECTURE: KERNEL-LEVEL ENFORCEMENT

When you define a securityContext, the API Server doesn't just store YAML; it instructs the Container Runtime (CRI) to invoke specific Linux Kernel features during the clone() and execve() system calls.

1.1 The Security Boundary Layers

  • Namespaces: Isolate resources (PID, Net, Mount).
  • Cgroups: Limit resources (CPU, Mem).
  • Security Context: Configures user identity (UID/GID) and kernel-level permissions (Capabilities/Seccomp).

2. SECURITY CONTEXT: THE SPECIFICATION

Security contexts are defined at two levels. Container-level settings always override Pod-level settings.

2.1 Pod-Level (spec.securityContext)

Applies to all containers, including Init Containers.

  • fsGroup (Recursive Ownership): When a volume is mounted, Kubelet performs a chmod and chown to this GID.
    • Architect Note: On large volumes with millions of files, this can cause "Timeout" errors during Pod startup. Use fsGroupChangePolicy: OnRootMismatch to optimize.
  • runAsNonRoot: The Kubelet will validate the image manifest. If the image is built as USER root, the Pod will be blocked from starting.

2.2 Container-Level (spec.containers[].securityContext)

Applies granular locks to the specific process.

  • allowPrivilegeEscalation: Controls the no_new_privs kernel flag. If false, it prevents setuid binaries (like sudo) from changing the user's effective UID.
  • readOnlyRootFilesystem: Mounts the container's root as read-only. This is the #1 defense against zero-day exploits that attempt to download and execute malware (e.g., /tmp/miner).

2.3 The Comparison Matrix

FeatureLevelLinux PrimitiveProduction Default
runAsUserPod/Contsetuid> 10000
runAsNonRootPod/ContAPI Checktrue
privilegedContFull Device Accessfalse
allowPrivilegeEscalationContno_new_privsfalse
readOnlyRootFilesystemContmount -o rotrue
capabilitiesContcapabilities(7)drop: [ALL]

3. LINUX CAPABILITIES: GRANULAR ROOT

Root (UID 0) is traditionally "all or nothing." Linux Capabilities break this power into ~40 granular units. Kubernetes grants a default set (like NET_RAW and CHOWN).

3.1 The "Drop ALL" Strategy

A "Bible-grade" manifest always drops all default capabilities and adds back exactly what is needed.

  • NET_BIND_SERVICE: Needed to bind to ports < 1024.
  • NET_RAW: Needed for ping. (Danger: Enables ARP spoofing).
securityContext:
capabilities:
drop:
- ALL
add:
- NET_BIND_SERVICE # Allow binding to port 80/443 as non-root

4. POD SECURITY STANDARDS (PSS)

Since the deprecation of PodSecurityPolicy (PSP), Kubernetes uses Pod Security Standards (PSS). This is a built-in Admission Controller that validates Pods against three predefined profiles.

4.1 The Three Profiles

  1. Privileged: Unrestricted. Used for system-level Pods (CNI, Storage Drivers).
  2. Baseline: Prevents known privilege escalations. Minimum standard for internal apps.
  3. Restricted: Heavily hardened. Follows current best practices for multi-tenant isolation.

4.2 Enforcement Logic

4.3 Namespace Configuration

PSS is configured via labels. You can set different levels for enforcement vs. warning.

apiVersion: v1
kind: Namespace
metadata:
name: prod-restricted
labels:
# REJECT any pod that doesn't meet 'restricted'
pod-security.kubernetes.io/enforce: restricted
pod-security.kubernetes.io/enforce-version: latest
# WARN the user if it doesn't meet 'restricted' (useful for migrations)
pod-security.kubernetes.io/warn: restricted

5. BIBLE-GRADE MANIFEST: THE "GOLD STANDARD" POD

This manifest represents a production-hardened Pod that passes the Restricted PSS profile.

apiVersion: v1
kind: Pod
metadata:
name: production-hardened-app
labels:
security: hardened
spec:
# 1. Pod-level identity
securityContext:
runAsNonRoot: true
runAsUser: 10001
runAsGroup: 10001
fsGroup: 10001
seccompProfile:
type: RuntimeDefault # Enforce the runtime's default syscall filter

containers:
- name: app
image: my-app:v1.2.0
securityContext:
# 2. Immutable infrastructure
readOnlyRootFilesystem: true
allowPrivilegeEscalation: false
runAsNonRoot: true

# 3. Capability reduction
capabilities:
drop:
- ALL
# add: [ "NET_BIND_SERVICE" ] # Only if necessary

# 4. Writable paths (since rootfs is RO)
volumeMounts:
- name: tmp-dir
mountPath: /tmp
- name: logs
mountPath: /var/log/app

volumes:
- name: tmp-dir
emptyDir: {}
- name: logs
emptyDir: {}

6. PRODUCTION DEBUGGING & PITFALLS

6.1 Inspecting Active Capabilities

How do you know what capabilities a process actually has?

  1. Exec into the Pod: kubectl exec -it <pod> -- sh
  2. Check Status: cat /proc/1/status | grep Cap
  3. Decode: Use capsh --decode=<HexValue> on your local machine to see the list.

6.2 The nsenter Escape

If a container is privileged: true or has hostPID: true, an attacker can use nsenter to jump from the container namespace into the host namespace.

  • Detection: Audit your cluster for any Pods with hostPID, hostNetwork, or hostIPC set to true.

6.3 Common Pitfall: fsGroup Latency

Problem: A Pod with a large 1TB volume takes 10 minutes to start. Why: The Kubelet is recursively running chown on every file in the volume to match the fsGroup. The Fix:

securityContext:
fsGroup: 2000
fsGroupChangePolicy: "OnRootMismatch" # Only chown if the root directory differs

6.4 Debugging with kubectl debug

To troubleshoot a failing "Restricted" Pod without compromising security:

# Spin up a debug container sharing the same process namespace
kubectl debug -it <pod-name> --image=busybox --target=app

7. SUMMARY: THE SECURITY CHECKLIST

  1. Drop ALL Capabilities: Add back only what is proven necessary.
  2. Read-Only Root: Forced immutability stops most payload drops.
  3. Non-Root UID: Ensure your application does not rely on UID 0.
  4. No Privilege Escalation: Block setuid binaries.
  5. PSS Enforcement: Label namespaces with enforce: restricted to prevent accidental drift into insecure configurations.