Virtual Machines vs Containers & Linux Primitives
Before diving into Kubernetes, it is essential to understand the underlying technology that makes it possible: Containers. A container is not a "real" object in the Linux kernel; it is a construct created by combining specific Linux kernel features (Namespaces, Cgroups, and Union Filesystems).
1. Virtual Machines vs Containers
The primary difference lies in what they abstract and what they share.
Virtual Machines (Hardware Virtualization)
VMs virtualize the hardware. A Hypervisor (Type 1 like ESXi or Type 2 like VirtualBox) sits between the hardware and the OS.
- Heavyweight: Each VM has its own full Guest Kernel and OS images (GBs in size).
- Isolation: Hard isolation via hardware virtualization instructions (VT-x/AMD-V). A kernel panic in one VM does not affect the host.
- Boot time: Minutes (BIOS -> Bootloader -> Kernel -> Init).
Containers (OS Virtualization)
Containers virtualize the Operating System. They share the Host Kernel.
- Lightweight: No Guest OS. Uses the host's kernel syscalls. Images are MBs in size.
- Isolation: Soft isolation using Kernel Namespaces. A kernel panic triggered by a container crashes the entire host.
- Boot time: Milliseconds (It is just starting a standard Linux process).
+---------------------+ +---------------------+
| App A | | App A |
+---------------------+ +---------------------+
| Guest OS (Lib) | | Bins/Libs |
+---------------------+ +---------------------+
| Hypervisor | | Container Engine |
+---------------------+ +---------------------+
| Host OS | | Host OS (Kernel) |
+---------------------+ +---------------------+
| Hardware | | Hardware |
+---------------------+ +---------------------+
VIRTUAL MACHINE CONTAINER
2. Linux Primitives (The "Magic" Behind Containers)
If you strictly look at the Linux Kernel code, there is no such thing as a "Container". There are only normal processes with restricted views of the system. This restriction is achieved via Namespaces and Cgroups.
A. Namespaces (Isolation - What you SEE)
Namespaces limit what a process can SEE. If you run ps aux inside a container, you only see PID 1 and its children, not the host's processes.
| Namespace | Flag | What it isolates |
|---|---|---|
| PID | CLONE_NEWPID | Process IDs. Inside container, the app is PID 1. On host, it might be PID 12345. |
| NET | CLONE_NEWNET | Network stack (Interfaces, IP, Route table, localhost, iptables). |
| MNT | CLONE_NEWNS | Mount points (Filesystem view). / in container is different from / on host. |
| UTS | CLONE_NEWUTS | Hostname and domain name. |
| IPC | CLONE_NEWIPC | Inter-Process Communication (Shared memory, semaphores). |
| USER | CLONE_NEWUSER | User and Group IDs. (Map root inside container to non-root outside). |
| TIME | CLONE_NEWTIME | (Linux 5.6+) Clock offsets. Allows different time inside container (e.g. testing). |
You can manually create a namespace without Docker using the unshare command:
# Create a process with its own PID namespace
sudo unshare --fork --pid --mount-proc /bin/bash
Inspecting namespaces from the host (debugging):
# List all namespace types for a process (e.g. container PID 12345)
ls -l /proc/12345/ns
# Sample: net -> 'net:[4026532281]' — inode identifies the namespace
# List processes in a given namespace (e.g. PID namespace of container)
lsns -p 12345
# Or list all namespaces on the system
lsns
Sample lsns output (host):
NS TYPE NPROCS PID USER COMMAND
4026531834 pid 128 1 root /sbin/init
4026532281 net 1 12345 root nginx
This confirms the container process has its own net namespace (different from init).
B. Cgroups (Control Groups - Resource Limiting - What you USE)
Cgroups limit what a process can USE. They account for and limit resources like CPU, Memory, I/O, and PIDs.
- Location:
/sys/fs/cgroup/(v1) or unified at/sys/fs/cgroup/(v2). - Key Functions:
- Resource Limiting: "This container can only use 512MB RAM."
- Prioritization: "Give this container more CPU shares (1024) than that one (512)."
- Accounting: "How much CPU has this container used?" (This is where
kubectl topgets data). - Control: Freezing/Pausing processes.
Cgroups v1 vs v2 (cgroup2 unified hierarchy):
- v1: Multiple hierarchies (cpu, memory, blkio, etc. can be mounted separately). Still common on older hosts.
- v2: Single hierarchy under
/sys/fs/cgroup/. All controllers in one tree. Required for rootless containers and preferred for new systems. Kubernetes 1.25+ works with cgroup v2; set kubelet--cgroup-driver=systemdwhen using v2. - Check which you have:
stat -fc %T /sys/fs/cgroup/—cgroup2fs= v2;tmpfswith controllers under it = v1.
Inspecting cgroups for a container (debugging OOM / limits):
# Get container's PID from Docker
docker inspect --format '{{.State.Pid}}' <container_name>
# View cgroup membership (v1: multiple lines; v2: single path)
cat /proc/<pid>/cgroup
Sample output (cgroup v2):
0::/docker/abc123def456...
Sample output (cgroup v1):
5:memory:/docker/abc123...
4:blkio:/docker/abc123...
...
Memory limit is enforced under the memory controller; OOM events appear in dmesg or container exit code 137.
If a container exceeds its Cgroup memory limit, the Kernel invokes the OOM Killer (Out of Memory Killer) to terminate the process based on oom_score. This is why you see OOMKilled (Exit Code 137) in Kubernetes.
C. Union File Systems (OverlayFS)
Containers need to start fast and save space. They use Union Mounts (specifically overlay2 driver in modern Docker).
- LowerDir (Image Layers): Read-Only. Shared among all containers using the image.
- UpperDir (Container Layer): Read-Write (Ephemeral). Unique to the container.
- WorkDir: Internal directory for OverlayFS to merge changes.
- MergedDir: The unified view the container sees.
Copy-On-Write (CoW): If you modify a file that exists in the image layer (LowerDir), the kernel copies it up to the container layer (UpperDir) first, then modifies it. The original image remains untouched. This minimizes I/O and storage.
Inspecting overlay mounts (debugging storage):
# From host: find mount for a container's merged view
docker inspect --format '{{range .Mounts}}{{.Source}} -> {{.Destination}}{{"\n"}}{{end}}' <name>
# Overlay layers (LowerDir/UpperDir) are visible in /var/lib/docker/overlay2/<id>/
mount | grep overlay
3. Container Security Primitives (The "Attack Surface")
Because containers share the kernel, "Isolation" is not enough. We need "Restriction" to prevent a compromised container from attacking the kernel.
A. Capabilities (man 7 capabilities)
By default, "root" inside a container is NOT "true root". Docker drops dangerous capabilities.
- Dropped Capabilities:
CAP_SYS_ADMIN(broad system control),CAP_NET_ADMIN(network configuration),CAP_SYS_MODULE(load kernel modules). - Retained Capabilities:
CAP_CHOWN,CAP_NET_BIND_SERVICE(Bind ports < 1024),CAP_KILL.
Setting privileged: true in Kubernetes restores ALL capabilities. It effectively turns the container into a process running on the host, just with a different mount namespace. Never use this in production unless absolutely necessary.
B. Seccomp (Secure Computing Mode)
Seccomp acts as a firewall for Syscalls. It restricts which system calls a process can make to the kernel.
- Default Profile: Docker blocks about 44 syscalls (out of ~300+), including
reboot(),swapoff(), andsysctl(). - Kubernetes: Blocks dangerous calls to prevent kernel exploits.
C. AppArmor / SELinux
Mandatory Access Control (MAC) systems that add another layer of file system and network access control.
- AppArmor: Path-based profiles (e.g., "Nginx can only read /etc/nginx").
- SELinux: Label-based system (contexts).
D. Read-Only Root Filesystem
- Concept: Mount the container's root filesystem as read-only (
readOnlyRootFilesystem: truein Kubernetes). Writable only where volumes are mounted. - Benefit: Limits persistence of malware or tampering; many production Pods use this plus an
emptyDiror volume for the single directory that must be writable.
4. Container Runtimes (CRI)
Kubernetes doesn't speak "Docker". It speaks CRI (Container Runtime Interface).
- High-Level Runtime: Management (Image pulling, unpacking, API).
- Examples: containerd, CRI-O, Docker (via dockershim - removed in v1.24).
- Low-Level Runtime (OCI): The actual execution (Creating namespaces/cgroups).
- Examples: runc (Standard), gVisor (Secure sandbox), Kata Containers (VM-based isolation).
The Flow:
Kubelet → CRI Plugin → containerd → runc → Kernel
OCI (Open Container Initiative): runc implements the OCI Runtime Spec (how to create namespaces, cgroups, and run the process). Container images follow the OCI Image Spec (layers, config, manifest). Kubernetes does not depend on Docker; it depends on any CRI-compatible runtime (containerd, CRI-O) that can run OCI bundles.