Docker & Kubernetes Internals: Under the Hood¶

Sources: CI/CD with Docker and Kubernetes (Semaphore), Everything Kubernetes (Stratoscale), Docker and Kubernetes for Java Developers, Cloud Container Engine Kubernetes Basics (Huawei), Container Management: Kubernetes vs Docker Swarm vs Mesos vs Amazon ECS

1. What Makes a Container: Linux Kernel Primitives¶

A container is not a virtual machine. It is a group of processes isolated by Linux kernel features — no hypervisor, no separate kernel, shared host OS.

block-beta
  columns 3
  block:vm["Virtual Machine"]:1
    columns 1
    A["App A"]
    B["Guest OS (full kernel)"]
    C["Hypervisor (KVM/VMware)"]
    D["Host Hardware"]
  end
  space
  block:ct["Container"]:1
    columns 1
    E["App B (process)"]
    F["Linux Namespaces + cgroups"]
    G["Host Kernel (shared)"]
    H["Host Hardware"]
  end

Linux Namespaces — Isolation Boundaries¶

Each container process gets its own view of system resources through namespace isolation:

flowchart TD
  HOST["Host Kernel"]
  HOST --> PID["PID Namespace\nisolated process tree\ncontainer PID 1 = init"]
  HOST --> NET["NET Namespace\nisolated network stack\nvirtual eth pair (veth)"]
  HOST --> MNT["MNT Namespace\nisolated mount points\nrootfs overlay"]
  HOST --> UTS["UTS Namespace\nisolated hostname\n/etc/hostname per container"]
  HOST --> IPC["IPC Namespace\nisolated SysV IPC\nPOSIX message queues"]
  HOST --> USER["USER Namespace\nisolated UID/GID mapping\nrootless containers"]
  HOST --> CG["cgroups (not a namespace)\nCPU/memory/IO resource limits\nenforced by kernel scheduler"]

cgroups enforce resource budgets. The kernel's CFS scheduler enforces CPU quota: if a container is limited to 500m CPU (half a core), the kernel accumulates run time and throttles when the quota period (typically 100ms) is exhausted.

2. Docker Image Layers: OverlayFS Union Filesystem¶

Docker images are content-addressed stacks of read-only layers merged by OverlayFS (or AUFS on older systems) into a single unified view.

block-beta
  columns 1
  block:ul["upperdir (writable container layer — CoW)"]
    W["writes, new files, modifications go here"]
  end
  block:l4["layer 4 (read-only): APP entrypoint binary"]
    L4["sha256:a1b2c3... (content hash)"]
  end
  block:l3["layer 3 (read-only): pip/npm packages"]
    L3["sha256:d4e5f6..."]
  end
  block:l2["layer 2 (read-only): runtime (JDK/Python)"]
    L2["sha256:g7h8i9..."]
  end
  block:l1["layer 1 (read-only): base OS (debian/alpine)"]
    L1["sha256:j0k1l2..."]
  end
  block:merge["OverlayFS merged view"]
    M["union of all layers: upperdir shadows lowerdir on write"]
  end

Copy-on-Write (CoW) Semantics¶

When a container writes to a file that exists in a lower read-only layer:

sequenceDiagram
  participant P as Container Process
  participant OFS as OverlayFS
  participant U as upperdir (writable)
  participant L as lowerdir (read-only layer)

  P->>OFS: open("/etc/nginx/nginx.conf", O_RDWR)
  OFS->>L: stat file in lowerdir
  L-->>OFS: file found (inode, blocks)
  OFS->>U: copy file blocks to upperdir
  U-->>OFS: copy complete
  OFS-->>P: fd pointing to upperdir copy
  P->>U: write new content
  Note over L: original unchanged forever
  Note over U: modified version lives in container layer

When the container is destroyed, upperdir is discarded. The lower layers (image) are immutable and shared across all containers using the same image — this is why 10 containers from the same image share layer storage.

Build Cache Invalidation¶

flowchart LR
  D["Dockerfile instruction"] --> H["compute instruction hash\n(command text + parent layer hash)"]
  H --> C{cache hit?}
  C -- yes --> REUSE["reuse cached layer\nno rebuild"]
  C -- no --> BUILD["execute instruction\ncreate new layer\ninvalidate all downstream layers"]
  BUILD --> STORE["store layer in\n/var/lib/docker/overlay2/"]

COPY instructions invalidate cache when file content changes (checksum comparison). This is why COPY requirements.txt . + RUN pip install should precede COPY . . — changing app source code won't re-run slow dependency installs.

3. Container Runtime Stack¶

flowchart TD
  DC["docker CLI / kubectl"] --> DS["dockerd daemon (Docker API)"]
  DS --> CT["containerd (OCI lifecycle manager)"]
  CT --> RU["runc (low-level OCI runtime)"]
  RU --> CL["clone() syscall\nLinux namespaces created"]
  RU --> CG2["cgroups v2 hierarchy\nresource limits applied"]
  CL --> FS["OverlayFS mount\nimage layers + upperdir"]
  FS --> C["Container process running\nas isolated pid 1"]

containerd manages image pulls (from registry), snapshot management (OverlayFS layers), and delegates actual process spawning to runc via the OCI runtime spec. Kubernetes communicates with containerd via the Container Runtime Interface (CRI) gRPC protocol.

4. Kubernetes Control Plane: Full Internal Flow¶

flowchart TD
  K["kubectl apply -f deployment.yaml"]
  K --> API["API Server\n(kube-apiserver)\nHTTPS REST endpoint\nadmission webhooks\nOPA/Gatekeeper validation"]
  API --> ETCD["etcd\ndistributed KV store\nRaft consensus\nsource of truth for all cluster state"]
  ETCD --> CM["Controller Manager\n(kube-controller-manager)\nwatches etcd via list/watch\nDeployment controller, RS controller\nEndpoint controller, etc."]
  CM --> SCHED["Scheduler\n(kube-scheduler)\nwatches unbound pods\nscores nodes via predicates+priorities\nwrites spec.nodeName to etcd"]
  SCHED --> KL["kubelet on selected node\nwatches pod spec via API Server\ncalls containerd via CRI gRPC"]
  KL --> RT["containerd → runc\nnamespace + cgroup setup\nOverlayFS mount"]
  RT --> POD["Pod running on node\ncontainers started"]

Admission Webhook Chain¶

sequenceDiagram
  participant U as kubectl
  participant API as API Server
  participant MUT as Mutating Webhook (e.g., Istio sidecar injector)
  participant VAL as Validating Webhook (e.g., OPA Gatekeeper)
  participant ETCD as etcd

  U->>API: POST /apis/apps/v1/deployments
  API->>API: authentication (mTLS/OIDC token)
  API->>API: authorization (RBAC check)
  API->>MUT: MutatingAdmissionWebhook (can modify object)
  MUT-->>API: patched object (e.g., sidecar container injected)
  API->>VAL: ValidatingAdmissionWebhook (can only approve/reject)
  VAL-->>API: 200 OK / 403 Forbidden
  API->>ETCD: persist object
  ETCD-->>API: resourceVersion assigned
  API-->>U: 201 Created

5. etcd: The Cluster's Single Source of Truth¶

stateDiagram-v2
  [*] --> Follower
  Follower --> Candidate: election timeout (150-300ms)\nno heartbeat from leader
  Candidate --> Leader: majority votes (quorum = N/2+1)
  Candidate --> Follower: higher term discovered
  Leader --> Follower: higher term or partition
  Leader --> Leader: heartbeat AppendEntries every 50ms

All Kubernetes API writes go to etcd as watch events. Controllers don't poll — they register list/watch streams. When etcd records a change, the API Server streams the delta to all watchers (controllers, kubelet, kube-proxy) in real time. This is the level-triggered reconciliation model: every controller continuously tries to make currentState == desiredState.

6. ReplicaSet Label Selector Mechanics¶

A ReplicaSet does NOT track which pods it created by UUID. It performs a label selector query continuously:

flowchart LR
  RS["ReplicaSet\nspec.replicas: 3\nselector:\n  matchLabels:\n    app: nginx\n    version: v2"]
  RS --> Q["LIST pods WHERE\napp=nginx AND version=v2\n(like SQL SELECT)"]
  Q --> COUNT["count matching pods"]
  COUNT --> CMP{count == 3?}
  CMP -- "count < 3" --> CREATE["create new pod\nfrom spec.template"]
  CMP -- "count > 3" --> DELETE["delete oldest extra pod"]
  CMP -- "count == 3" --> IDLE["no action — desired state met"]

This label-based ownership means: if you manually label an unrelated pod with app: nginx, version: v2, the ReplicaSet will adopt it and potentially delete one of your intentional pods to maintain count=3.

7. Rolling Update: Deployment Controller State Machine¶

stateDiagram-v2
  direction LR
  [*] --> Stable_v1: initial state\n3 pods on RS-v1
  Stable_v1 --> Transitioning: kubectl set image\nnew RS-v2 created (0 replicas)
  Transitioning --> Progressing: scale RS-v2 up by MaxSurge\nscale RS-v1 down by MaxUnavailable
  Progressing --> Progressing: repeat until\nRS-v2=3, RS-v1=0
  Progressing --> Stable_v2: all pods Ready\nRS-v1 kept at 0 (for rollback)
  Stable_v2 --> Stable_v1: kubectl rollout undo\nRS-v1 scaled back up

MaxSurge=1, MaxUnavailable=0 (zero-downtime): - At no point can the total Ready pods drop below desired (3) - One extra pod created (4 total briefly), then one old pod deleted - Each new pod must pass readiness probe before proceeding

sequenceDiagram
  participant DC as Deployment Controller
  participant RSv1 as ReplicaSet v1 (3 pods)
  participant RSv2 as ReplicaSet v2 (0 pods)

  DC->>RSv2: scale to 1
  RSv2->>RSv2: pod v2-1 starts, passes readinessProbe
  DC->>RSv1: scale to 2
  RSv1->>RSv1: pod v1-3 terminated
  DC->>RSv2: scale to 2
  RSv2->>RSv2: pod v2-2 starts, passes readinessProbe
  DC->>RSv1: scale to 1
  DC->>RSv2: scale to 3
  RSv2->>RSv2: pod v2-3 passes readinessProbe
  DC->>RSv1: scale to 0
  Note over RSv1: RS-v1 kept (revision history for rollback)

8. Service Networking: kube-proxy iptables/IPVS¶

A Service is a stable virtual IP (ClusterIP) that load-balances to a dynamic set of pod IPs. There is no kernel load-balancer process — it's implemented via iptables DNAT rules (or IPVS in proxy mode).

flowchart TD
  PKT["packet to ClusterIP:80\n(e.g., 10.96.43.21:80)"]
  PKT --> PREROUTING["iptables PREROUTING chain"]
  PREROUTING --> KS["KUBE-SERVICES chain\nmatch destination IP:port"]
  KS --> SVC["KUBE-SVC-XXXXX chain\n(per-Service chain)\nstatistical load balance\n(1/N probability each rule)"]
  SVC --> SEP1["KUBE-SEP-AAAA\nDNAT to pod-1-IP:8080\n(e.g., 192.168.1.5:8080)"]
  SVC --> SEP2["KUBE-SEP-BBBB\nDNAT to pod-2-IP:8080"]
  SVC --> SEP3["KUBE-SEP-CCCC\nDNAT to pod-3-IP:8080"]
  SEP1 --> POD["Pod receives packet\non real IP:port"]

The Endpoint controller continuously watches pod events. When a pod fails readiness, its IP is removed from the Endpoints object, and kube-proxy removes that DNAT rule — traffic stops before pod termination.

IPVS Mode (high pod count)¶

At 10,000+ services, iptables linear-scan becomes O(N). IPVS uses kernel hash tables for O(1) lookup:

block-beta
  columns 2
  block:ip["iptables mode"]:1
    I1["rule 1: match SVC-A → pod-1"]
    I2["rule 2: match SVC-A → pod-2"]
    I3["rule ...10000 rules scanned linearly"]
  end
  block:ipv["IPVS mode"]:1
    V1["IPVS virtual server table (hash)"]
    V2["O(1) lookup → backend real server"]
    V3["LB algorithms: rr, lc, sh, dh, wlc"]
  end

9. Ingress Controller: L7 HTTP Routing Internals¶

flowchart TD
  EXT["External traffic\nHTTPS :443"]
  EXT --> ING["Ingress Controller\n(nginx/envoy pod)\nTLS termination\n(cert-manager managed TLS secret)"]
  ING --> RR["nginx upstream routing rules\ngenerated from Ingress resource"]
  RR --> SVC_A["Service A (ClusterIP)\n/api/* → service-api:8080"]
  RR --> SVC_B["Service B (ClusterIP)\n/static/* → service-static:3000"]
  SVC_A --> PA["Pod A instances\n(kube-proxy DNAT)"]
  SVC_B --> PB["Pod B instances\n(kube-proxy DNAT)"]

The nginx ingress controller runs a watch loop on Ingress objects. When an Ingress is created/modified, nginx-ingress calls nginx -s reload (hot reload via Unix socket) — no dropped connections — updating its upstream blocks.

10. Persistent Volumes: CSI Driver Architecture¶

The Container Storage Interface (CSI) decouples Kubernetes from storage vendor implementations:

flowchart TD
  PVC["PVC: request 10Gi ReadWriteOnce\nstorageClassName: fast-ssd"]
  PVC --> SC["StorageClass\nprovisioner: ebs.csi.aws.com\nreclaimPolicy: Delete\nvolumeBindingMode: WaitForFirstConsumer"]
  SC --> PROV["CSI external-provisioner sidecar\ncalls CreateVolume RPC"]
  PROV --> DRIVER["CSI driver (aws-ebs-csi-driver)\ncreates EBS volume via AWS API"]
  DRIVER --> PV["PV object created\nspec.csi.volumeHandle: vol-0abc123\nstatus: Available"]
  PV --> BIND["PVC bound to PV\n(1:1 binding, immutable)"]
  BIND --> POD["Pod spec: volumeMounts\nCSI attaches EBS to node\n(NodeStage + NodePublish RPCs)\nblock device mounted at /data"]

PV access modes map to storage system capabilities: - ReadWriteOnce (RWO): one node mounts read/write — EBS, local SSD - ReadWriteMany (RWX): multiple nodes mount read/write — NFS, CephFS - ReadOnlyMany (ROX): multiple nodes read-only — shared config data

11. RBAC: Subject → Role → Resource Binding¶

flowchart LR
  SA["ServiceAccount: app-reader\n(namespace: production)"]
  SA --> RB["RoleBinding: app-reader-binding\nsubject: ServiceAccount/app-reader\nroleRef: Role/pod-reader"]
  RB --> R["Role: pod-reader\nrules:\n- apiGroups: [\"\"]\n  resources: [pods]\n  verbs: [get, list, watch]"]
  R --> AUTH["API Server RBAC authorizer\nrequest: GET /api/v1/namespaces/production/pods\n→ ALLOW"]
  R --> DENY["request: DELETE /api/v1/namespaces/production/pods/foo\n→ DENY 403"]

ClusterRole vs Role: Role is namespace-scoped; ClusterRole applies cluster-wide (e.g., node access, PV management). ClusterRoleBinding grants cluster-wide permissions; RoleBinding scopes a ClusterRole to a namespace.

12. StatefulSet vs Deployment: Identity Preservation¶

stateDiagram-v2
  direction LR
  state "Deployment (stateless)" as DEP {
    p1: pod-abc12
    p2: pod-def34
    p3: pod-ghi56
    note: random names, any pod replaceable
  }
  state "StatefulSet (stateful)" as STS {
    s0: mysql-0 (persistent identity)
    s1: mysql-1
    s2: mysql-2
    note: ordered creation 0→1→2\nordered deletion 2→1→0\nstable DNS: mysql-0.mysql.ns.svc.cluster.local\npersistent PVC bound to each ordinal
  }

StatefulSets guarantee: 1. Stable network identity: $(podname).$(servicename).$(namespace).svc.cluster.local 2. Stable storage: PVC data-mysql-0 persists across pod restarts (not deleted on pod delete) 3. Ordered rolling updates: pod N+1 not updated until pod N is Running+Ready

13. Kubernetes vs Docker Swarm vs Mesos: Scheduler Architecture¶

block-beta
  columns 3
  block:K8s["Kubernetes"]:1
    columns 1
    KCP["Control Plane:\nAPI Server + etcd + Scheduler\n+ Controller Manager"]
    KN["Worker Nodes: kubelet + kube-proxy"]
    KS["Scheduling: predicate filter\n(resource fit, taints, affinity)\n+ priority scoring\n(bin-packing vs spreading)"]
  end
  block:SW["Docker Swarm"]:1
    columns 1
    SM["Manager Nodes:\nRaft consensus\norchestrateService tasks"]
    SN["Worker Nodes: receive tasks"]
    SS["Scheduling: spread strategy\nby default (even distribution)\nnaive compared to K8s"]
  end
  block:MS["Mesos + Marathon"]:1
    columns 1
    MM["Mesos Master:\ntwo-level scheduling\noffer-based resource delegation"]
    MN["Mesos Agents: report resources"]
    MSS["Marathon Scheduler:\nreceives resource offers\nlaunches Docker executor"]
  end

Kubernetes two-phase scheduling (predicates → priorities): 1. Predicates (hard filters): NodeResourcesFit, PodFitsHostPorts, NodeAffinity, TaintToleration — eliminates ineligible nodes 2. Priorities (soft scoring): LeastRequestedPriority (bin-pack), BalancedResourceAllocation, InterPodAffinity — scores remaining nodes 0-100, highest wins

14. Blue/Green and Canary Deployments: Label Selector Switch¶

Both advanced deployment strategies exploit K8s label selector mechanics — no special controller needed.

Blue/Green¶

sequenceDiagram
  participant SVC as Service (selector: version=blue)
  participant BLUE as Deployment-blue (3 pods, version=blue)
  participant GREEN as Deployment-green (3 pods, version=green)

  Note over SVC,BLUE: 100% traffic → blue
  Note over GREEN: green deployed in parallel, not receiving traffic
  SVC->>SVC: patch selector: version=green
  Note over SVC,GREEN: 100% traffic switches to green instantly
  Note over BLUE: blue kept for instant rollback\ndelete when green stable

Rollback: patch selector back to version=blue — instantaneous, no pod restart.

Canary¶

flowchart LR
  SVC["Service selector:\napp: frontend"] 
  SVC --> ST["Stable deployment\napp: frontend\n9 replicas\n→ 90% traffic"]
  SVC --> CN["Canary deployment\napp: frontend\n1 replica\n→ 10% traffic\n(proportional to replica count)"]
  CN --> MON["Monitor error rates\nlatency via Prometheus"]
  MON -- "healthy" --> SCALE["scale canary to 10\nscale stable to 0"]
  MON -- "bad metrics" --> ROLLBACK["delete canary deployment"]

15. CI/CD Pipeline: Container Lifecycle in Automation¶

flowchart LR
  GIT["git push\nfeature branch"]
  GIT --> CI["CI pipeline\n(Semaphore/Jenkins/GitHub Actions)"]
  CI --> BUILD["docker build -t app:$GIT_SHA .\n(layer cache from registry)"]
  BUILD --> TEST["docker run --rm app:$GIT_SHA\nnpm test / pytest / go test"]
  TEST --> PUSH["docker push registry/app:$GIT_SHA\n(push only new/changed layers)"]
  PUSH --> STAGING["kubectl set image deployment/app\napp=registry/app:$GIT_SHA\n--namespace=staging"]
  STAGING --> SMOKE["smoke tests\nreadiness probe gate"]
  SMOKE -- "pass" --> PROD["kubectl set image deployment/app\napp=registry/app:$GIT_SHA\n--namespace=production\n(rolling update)"]
  SMOKE -- "fail" --> RB["kubectl rollout undo\ndeployment/app"]

Build-once, promote principle: the same image SHA ($GIT_SHA) flows through staging → production. No rebuilds between environments — content-addressed image hash guarantees bit-identical deployment.

16. Pod Lifecycle State Machine¶

stateDiagram-v2
  [*] --> Pending: pod created, scheduled to node
  Pending --> Init: init containers start (sequential)
  Init --> Running: all init containers exit 0\nmain containers start
  Running --> Succeeded: all containers exit 0 (Job)
  Running --> Failed: container exits non-zero\nrestartPolicy=Never
  Running --> Running: container restarts\n(restartPolicy=Always/OnFailure)\nExponential backoff: 10s→20s→40s→...→5min
  Running --> Terminating: SIGTERM sent\nterminationGracePeriodSeconds countdown
  Terminating --> [*]: SIGKILL if grace period exceeded

Probe types and failure effects: - livenessProbe fails → container killed and restarted (CrashLoopBackOff if repeated) - readinessProbe fails → pod IP removed from Endpoints (no traffic, pod stays Running) - startupProbe fails → container killed (prevents liveness from firing during slow startup)

17. Network Policy: eBPF/iptables Packet Filter Architecture¶

flowchart TD
  POD_A["Pod A (namespace: prod)\nip: 192.168.1.5"]
  POD_B["Pod B (namespace: prod)\nip: 192.168.1.6"]
  POD_C["Pod C (namespace: test)\nip: 192.168.2.7"]

  NP["NetworkPolicy on Pod B:\nspec.ingress:\n- from:\n  - podSelector: {app: trusted}\n  namespaceSelector: {env: prod}"]

  POD_A -- "app=trusted label → ALLOW" --> NP
  NP --> POD_B
  POD_C -- "namespace=test → DENY" --> NP

NetworkPolicy is enforced by the CNI plugin (Calico, Cilium, WeaveNet). Cilium uses eBPF programs attached to network interfaces — no iptables rules, O(1) verdict lookup via BPF hash maps. Calico uses iptables chains injected per-NetworkPolicy.

18. Auto Scaling: HPA Control Loop¶

flowchart TD
  HPA["HPA Controller\n(kube-controller-manager)\nchecks every 15s"]
  HPA --> MS["Metrics Server\naggregates kubelet /metrics/resource"]
  MS --> CPU["current avg CPU utilization\nacross all pods"]
  CPU --> CALC["desired replicas =\nceil(currentReplicas × (currentUtil / targetUtil))\ne.g., 3 × (80% / 50%) = ceil(4.8) = 5"]
  CALC --> CMP{within min/max bounds?}
  CMP -- yes --> SCALE["patch Deployment.spec.replicas = 5"]
  CMP -- no --> CLAMP["clamp to minReplicas or maxReplicas"]

Scale-down stabilization: HPA waits 5 minutes before scaling down to prevent thrashing (yo-yo scaling). Scale-up has no stabilization delay — it acts immediately.

Summary: Data Flow Through the Full Stack¶

flowchart TD
  DEV["Developer: git push"]
  DEV --> CICD["CI/CD pipeline\nbuilds Docker image layers\n(OverlayFS content-addressed)"]
  CICD --> REG["Container Registry\nstores layer blobs by sha256"]
  REG --> KUBCTL["kubectl apply\nDeployment spec → API Server"]
  KUBCTL --> ETCD["etcd (Raft)\ndesired state persisted"]
  ETCD --> CTRL["Deployment/RS Controller\nwatches etcd list/watch stream"]
  CTRL --> SCHED["Scheduler\npredicate filter + priority score\nselects node, writes nodeName"]
  SCHED --> KUBELET["kubelet on node\nwatches pod spec"]
  KUBELET --> CRI["containerd via CRI gRPC\npull image layers from registry"]
  CRI --> OFS["OverlayFS\nstacks read-only layers\n+ writable upperdir"]
  OFS --> NS["Linux namespaces created\n(PID, NET, MNT, UTS, IPC)"]
  NS --> CG["cgroups applied\n(CPU quota, memory limit)"]
  CG --> RUN["Container process running\nas isolated pid 1"]
  RUN --> SVC["kube-proxy iptables DNAT rules\nClusterIP → pod IPs\nload balanced"]
  SVC --> ING["Ingress controller (nginx)\nL7 HTTP routing\nTLS termination"]
  ING --> USER["User request served"]

Every layer of abstraction — from kubectl apply to packets reaching a container — traverses this precise path: API Server admission, etcd Raft commit, controller reconciliation, scheduler placement, kubelet CRI invocation, containerd OverlayFS setup, namespace/cgroup isolation, and kube-proxy iptables routing.