Pod Security: Contexts, Capabilities, and Standards

In the Kubernetes shared-responsibility model, the cluster provides the orchestration, but the Pod Security Context defines the boundary between the container and the host kernel. Hardening this boundary is the primary defense against Container Breakouts and lateral movement.

1. ARCHITECTURE: KERNEL-LEVEL ENFORCEMENT

When you define a securityContext, the API Server doesn't just store YAML; it instructs the Container Runtime (CRI) to invoke specific Linux Kernel features during the clone() and execve() system calls.

1.1 The Security Boundary Layers

Namespaces: Isolate resources (PID, Net, Mount).
Cgroups: Limit resources (CPU, Mem).
Security Context: Configures user identity (UID/GID) and kernel-level permissions (Capabilities/Seccomp).

2. SECURITY CONTEXT: THE SPECIFICATION

Security contexts are defined at two levels. Container-level settings always override Pod-level settings.

2.1 Pod-Level (`spec.securityContext`)

Applies to all containers, including Init Containers.

fsGroup (Recursive Ownership): When a volume is mounted, Kubelet performs a chmod and chown to this GID.
- Architect Note: On large volumes with millions of files, this can cause "Timeout" errors during Pod startup. Use fsGroupChangePolicy: OnRootMismatch to optimize.
runAsNonRoot: The Kubelet will validate the image manifest. If the image is built as USER root, the Pod will be blocked from starting.

2.2 Container-Level (`spec.containers[].securityContext`)

Applies granular locks to the specific process.

allowPrivilegeEscalation: Controls the no_new_privs kernel flag. If false, it prevents setuid binaries (like sudo) from changing the user's effective UID.
readOnlyRootFilesystem: Mounts the container's root as read-only. This is the #1 defense against zero-day exploits that attempt to download and execute malware (e.g., /tmp/miner).

2.3 The Comparison Matrix

Feature	Level	Linux Primitive	Production Default
`runAsUser`	Pod/Cont	`setuid`	`> 10000`
`runAsNonRoot`	Pod/Cont	API Check	`true`
`privileged`	Cont	Full Device Access	`false`
`allowPrivilegeEscalation`	Cont	`no_new_privs`	`false`
`readOnlyRootFilesystem`	Cont	`mount -o ro`	`true`
`capabilities`	Cont	`capabilities(7)`	`drop: [ALL]`

3. LINUX CAPABILITIES: GRANULAR ROOT

Root (UID 0) is traditionally "all or nothing." Linux Capabilities break this power into ~40 granular units. Kubernetes grants a default set (like NET_RAW and CHOWN).

3.1 The "Drop ALL" Strategy

A "Bible-grade" manifest always drops all default capabilities and adds back exactly what is needed.

NET_BIND_SERVICE: Needed to bind to ports < 1024.
NET_RAW: Needed for ping. (Danger: Enables ARP spoofing).

securityContext:
  capabilities:
    drop:
      - ALL
    add:
      - NET_BIND_SERVICE # Allow binding to port 80/443 as non-root

4. POD SECURITY STANDARDS (PSS)

Since the deprecation of PodSecurityPolicy (PSP), Kubernetes uses Pod Security Standards (PSS). This is a built-in Admission Controller that validates Pods against three predefined profiles.

4.1 The Three Profiles

Privileged: Unrestricted. Used for system-level Pods (CNI, Storage Drivers).
Baseline: Prevents known privilege escalations. Minimum standard for internal apps.
Restricted: Heavily hardened. Follows current best practices for multi-tenant isolation.

4.2 Enforcement Logic

4.3 Namespace Configuration

PSS is configured via labels. You can set different levels for enforcement vs. warning.

apiVersion: v1
kind: Namespace
metadata:
  name: prod-restricted
  labels:
    # REJECT any pod that doesn't meet 'restricted'
    pod-security.kubernetes.io/enforce: restricted
    pod-security.kubernetes.io/enforce-version: latest
    # WARN the user if it doesn't meet 'restricted' (useful for migrations)
    pod-security.kubernetes.io/warn: restricted

5. BIBLE-GRADE MANIFEST: THE "GOLD STANDARD" POD

This manifest represents a production-hardened Pod that passes the Restricted PSS profile.

apiVersion: v1
kind: Pod
metadata:
  name: production-hardened-app
  labels:
    security: hardened
spec:
  # 1. Pod-level identity
  securityContext:
    runAsNonRoot: true
    runAsUser: 10001
    runAsGroup: 10001
    fsGroup: 10001
    seccompProfile:
      type: RuntimeDefault # Enforce the runtime's default syscall filter

  containers:
  - name: app
    image: my-app:v1.2.0
    securityContext:
      # 2. Immutable infrastructure
      readOnlyRootFilesystem: true
      allowPrivilegeEscalation: false
      runAsNonRoot: true
      
      # 3. Capability reduction
      capabilities:
        drop:
          - ALL
        # add: [ "NET_BIND_SERVICE" ] # Only if necessary

    # 4. Writable paths (since rootfs is RO)
    volumeMounts:
    - name: tmp-dir
      mountPath: /tmp
    - name: logs
      mountPath: /var/log/app

  volumes:
  - name: tmp-dir
    emptyDir: {}
  - name: logs
    emptyDir: {}

6. PRODUCTION DEBUGGING & PITFALLS

6.1 Inspecting Active Capabilities

How do you know what capabilities a process actually has?

Exec into the Pod: kubectl exec -it <pod> -- sh
Check Status: cat /proc/1/status | grep Cap
Decode: Use capsh --decode=<HexValue> on your local machine to see the list.

6.2 The `nsenter` Escape

If a container is privileged: true or has hostPID: true, an attacker can use nsenter to jump from the container namespace into the host namespace.

Detection: Audit your cluster for any Pods with hostPID, hostNetwork, or hostIPC set to true.

6.3 Common Pitfall: `fsGroup` Latency

Problem: A Pod with a large 1TB volume takes 10 minutes to start. Why: The Kubelet is recursively running chown on every file in the volume to match the fsGroup. The Fix:

securityContext:
  fsGroup: 2000
  fsGroupChangePolicy: "OnRootMismatch" # Only chown if the root directory differs

6.4 Debugging with `kubectl debug`

To troubleshoot a failing "Restricted" Pod without compromising security:

# Spin up a debug container sharing the same process namespace
kubectl debug -it <pod-name> --image=busybox --target=app

7. SUMMARY: THE SECURITY CHECKLIST

Drop ALL Capabilities: Add back only what is proven necessary.
Read-Only Root: Forced immutability stops most payload drops.
Non-Root UID: Ensure your application does not rely on UID 0.
No Privilege Escalation: Block setuid binaries.
PSS Enforcement: Label namespaces with enforce: restricted to prevent accidental drift into insecure configurations.

1. ARCHITECTURE: KERNEL-LEVEL ENFORCEMENT​

1.1 The Security Boundary Layers​

2. SECURITY CONTEXT: THE SPECIFICATION​

2.1 Pod-Level (spec.securityContext)​

2.2 Container-Level (spec.containers[].securityContext)​

2.3 The Comparison Matrix​

3. LINUX CAPABILITIES: GRANULAR ROOT​

3.1 The "Drop ALL" Strategy​

4. POD SECURITY STANDARDS (PSS)​

4.1 The Three Profiles​

4.2 Enforcement Logic​

4.3 Namespace Configuration​

5. BIBLE-GRADE MANIFEST: THE "GOLD STANDARD" POD​

6. PRODUCTION DEBUGGING & PITFALLS​

6.1 Inspecting Active Capabilities​

6.2 The nsenter Escape​

6.3 Common Pitfall: fsGroup Latency​

6.4 Debugging with kubectl debug​

7. SUMMARY: THE SECURITY CHECKLIST​