Kubernetes Orchestration Internals: Under the Hood¶

Sources: Container Management: Kubernetes vs Docker Swarm, Mesos + Marathon, Amazon ECS (eBook); Everything Kubernetes: A Practical Guide (Stratoscale); Cloud Container Engine — Kubernetes Basics (Huawei CCE, 2025)

1. The Control Plane: etcd as the Ground Truth¶

Every decision Kubernetes makes flows through a single source of truth: etcd, a distributed key-value store implementing the Raft consensus algorithm. When you kubectl apply a manifest, the journey begins not at the scheduler or kubelet — it begins at etcd.

flowchart TD
    CLI["kubectl apply -f pod.yaml"] -->|HTTPS/TLS| API["kube-apiserver\n:6443"]
    API -->|Authenticate + Authorize| AUTHN["RBAC / ServiceAccount\nToken Validation"]
    AUTHN -->|Admission Controllers| ADM["MutatingWebhook\nValidatingWebhook\nResourceQuota"]
    ADM -->|Write desired state| ETCD[("etcd\nRaft Cluster\n:2379")]
    ETCD -->|Watch notification| CTRL["kube-controller-manager\nDeployment Controller"]
    CTRL -->|Create ReplicaSet/Pod objects| ETCD
    ETCD -->|Watch: unscheduled pods| SCHED["kube-scheduler"]
    SCHED -->|Binding decision| ETCD
    ETCD -->|Watch: pod bound to node| KUBELET["kubelet (node agent)"]
    KUBELET -->|Pull image + start container| CRI["CRI: containerd / CRI-O"]
    CRI --> CGROUP["Linux cgroup namespace\nPID/Net/Mount isolation"]

The apiserver never writes to the scheduler or kubelet directly. Everything is event-driven watch loops — each component watches etcd for objects whose state it is responsible for reconciling.

etcd Raft Internals¶

etcd stores Kubernetes objects as serialized protobuf under keys like /registry/pods/default/nginx-abc123. Raft ensures that writes are committed to a quorum (⌊n/2⌋ + 1) before returning success to the API server.

sequenceDiagram
    participant API as kube-apiserver
    participant L as etcd Leader
    participant F1 as etcd Follower 1
    participant F2 as etcd Follower 2

    API->>L: PUT /registry/pods/default/nginx (proto bytes)
    L->>L: Append to local WAL (Write-Ahead Log)
    par Raft AppendEntries
        L->>F1: AppendEntries RPC (log index N)
        L->>F2: AppendEntries RPC (log index N)
    end
    F1-->>L: ACK (success)
    F2-->>L: ACK (success)
    L->>L: Commit entry (quorum reached: 2/3)
    L-->>API: 200 OK (etcd revision R)
    L->>F1: Commit notification
    L->>F2: Commit notification

If the leader dies mid-write, Raft guarantees the partially-written entry is rolled back — the cluster elects a new leader and replays only committed log entries.

2. Scheduler Internals: Predicate and Priority Pipeline¶

The scheduler watches etcd for Pending pods (pods with no spec.nodeName). When found, it runs a two-phase pipeline to select a node.

flowchart LR
    subgraph FILTER["Phase 1: Filter (Predicates)"]
        P1["NodeResourcesFit\n(CPU/Memory requests)"]
        P2["NodeAffinity\n(label selectors)"]
        P3["PodTopologySpread\n(zone distribution)"]
        P4["TaintToleration\n(node taints)"]
        P5["VolumeBinding\n(PVC nodeAffinity)"]
    end
    subgraph SCORE["Phase 2: Score (Priorities)"]
        S1["LeastAllocated\n(spread load)"]
        S2["NodeAffinity score\n(preferred weight)"]
        S3["InterPodAffinity\n(co-location bonus)"]
        S4["ImageLocality\n(image already pulled)"]
    end
    PENDING["Pending Pod"] --> FILTER
    FILTER -->|Feasible nodes| SCORE
    SCORE -->|Highest score wins| BIND["Binding:\nPatch pod.spec.nodeName"]
    BIND --> ETCD[("etcd")]

Filter phase is O(nodes) — each predicate runs against all nodes. Infeasible nodes are eliminated immediately. Score phase normalizes each plugin's scores 0–100 and applies configured weights. The final score is a weighted sum.

Resource Bin-Packing vs. Spreading¶

LeastAllocated scores nodes higher when they have more free resources — this spreads pods. MostAllocated scores nodes with less free resources — this bins-packs. The scheduler plugin framework lets you swap these.

stateDiagram-v2
    [*] --> PodCreated: kubectl apply
    PodCreated --> Pending: Pod object in etcd\nspec.nodeName = ""
    Pending --> Scheduled: Scheduler writes\nbinding to etcd
    Scheduled --> ContainerCreating: kubelet picks up pod\nstarts image pull
    ContainerCreating --> Running: All containers started
    Running --> Succeeded: All containers exit 0
    Running --> Failed: Container exit code != 0\nrestartPolicy=Never
    Running --> CrashLoopBackOff: Repeated failures\nexponential backoff
    Running --> Terminating: kubectl delete / preStop hook
    Terminating --> [*]: SIGTERM → grace period → SIGKILL

3. kubelet: The Node Agent's Internal Loop¶

The kubelet is the most complex component — it runs on every node and bridges the Kubernetes API with the container runtime (containerd/CRI-O) and Linux kernel.

flowchart TD
    WATCH["kubelet watches\nAPI server for pod specs\n(bound to this node)"] --> ADMIT["Pod Admission\n- resource limits check\n- QoS class assignment"]
    ADMIT --> CGROUPMGR["cgroup Manager\nCreate cgroup hierarchy:\n/kubepods/burstable/podUID/containerUID"]
    CGROUPMGR --> CRI_CALL["CRI gRPC call:\nRunPodSandbox (pause container)\nCreateContainer\nStartContainer"]
    CRI_CALL --> CNI["CNI Plugin Call:\nip netns create\nveth pair creation\nbridge/overlay attachment"]
    CNI --> PROBES["Probe Manager\nliveness: HTTP/TCP/Exec\nreadiness: HTTP/TCP/Exec\nstartup: HTTP/TCP/Exec"]
    PROBES --> STATUS["Status Manager\nPatch pod.status back\nto API server"]
    STATUS --> EVICT["Eviction Manager\nMonitor memory.available\nnodefs.available\nimagefs.available"]

cgroup Hierarchy for Pod QoS¶

Kubernetes assigns each pod a QoS class based on resource requests/limits:

/sys/fs/cgroup/memory/kubepods/
├── guaranteed/          ← requests == limits for ALL containers
│   └── pod<UID>/
│       └── <containerID>/    memory.limit_in_bytes = N
├── burstable/           ← some containers have requests < limits
│   └── pod<UID>/
│       └── <containerID>/    memory.limit_in_bytes = limit
└── besteffort/          ← no requests or limits set
    └── pod<UID>/
        └── <containerID>/    memory.limit_in_bytes = node max

OOM kill order: BestEffort pods are killed first (OOM score adj = 1000), Burstable next (score adj proportional to limit/request ratio), Guaranteed last (score adj = -998).

block-beta
    columns 3
    A["Guaranteed QoS\nOOM adj: -998\nEviction: LAST"]:1
    B["Burstable QoS\nOOM adj: 2..999\nEviction: MIDDLE"]:1
    C["BestEffort QoS\nOOM adj: 1000\nEviction: FIRST"]:1

4. Container Runtime Interface (CRI): The Abstraction Layer¶

kubelet speaks gRPC to the CRI shim — it never calls Docker or containerd directly. The CRI defines two services: RuntimeService (pods/containers) and ImageService (pull/list/remove).

sequenceDiagram
    participant KL as kubelet
    participant CRI as containerd shim (CRI-O)
    participant RUNC as runc (OCI runtime)
    participant KERNEL as Linux Kernel

    KL->>CRI: RunPodSandbox(PodSandboxConfig)
    CRI->>KERNEL: clone(CLONE_NEWNET | CLONE_NEWUTS | CLONE_NEWIPC)
    KERNEL-->>CRI: pause container PID
    CRI-->>KL: PodSandboxID

    KL->>CRI: PullImage(ImageSpec)
    CRI->>CRI: Pull OCI layers → overlay2 mount
    CRI-->>KL: ImageRef

    KL->>CRI: CreateContainer(PodSandboxID, ContainerConfig)
    CRI->>RUNC: runc create (OCI spec JSON)
    RUNC->>KERNEL: mount overlay filesystem\nsetup cgroups\nsetup seccomp/apparmor
    RUNC-->>CRI: container ID

    KL->>CRI: StartContainer(ContainerID)
    CRI->>RUNC: runc start
    RUNC->>KERNEL: execve(entrypoint)
    KERNEL-->>CRI: PID 1 in container namespace

Container Image Layers: Copy-on-Write Filesystem¶

flowchart BT
    subgraph OVERLAYfs["OverlayFS Mount"]
        UPPER["upperdir (read-write layer)\n/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/\nsnapshots/42/fs/"]
        WORK["workdir (atomic ops)"]
        LOWER3["lower layer 3: App binaries\n(sha256:abc...)"]
        LOWER2["lower layer 2: pip packages\n(sha256:def...)"]
        LOWER1["lower layer 1: base OS\n(sha256:ghi...)"]
    end
    LOWER1 --> LOWER2 --> LOWER3 --> UPPER
    UPPER -->|"merged view"| CONTAINER["Container sees unified /"]

When a container writes to a read-only lower layer, the kernel performs a copy-up: the file is copied to upperdir before modification. This means first-write latency includes the copy cost.

5. Kubernetes Networking: CNI and kube-proxy¶

CNI Plugin Execution Flow¶

When the CRI creates a pod sandbox, it calls CNI plugins via exec (not gRPC):

sequenceDiagram
    participant CRI as containerd
    participant CNI as CNI Plugin (Calico/Flannel)
    participant NETNS as Linux netns
    participant BRIDGE as cni0 bridge / VXLAN

    CRI->>CNI: exec (ADD, netns path, pod name)
    CNI->>NETNS: ip netns exec <podNS> ip link add eth0 type veth
    CNI->>BRIDGE: Add veth peer to bridge / VTEP
    CNI->>NETNS: Assign pod CIDR IP to eth0
    CNI->>CNI: Install iptables/IPVS rules for pod IP
    CNI-->>CRI: Result JSON (IP, gateway, routes)

Each pod gets its own network namespace — a complete isolated TCP/IP stack. The pause container holds the namespace open so application containers can join it via --net=container:<pause-PID>.

kube-proxy: Service to Pod Load Balancing¶

kube-proxy translates abstract Service VIPs into real pod endpoints using either iptables or IPVS:

flowchart LR
    CLIENT["Pod X\n10.0.0.5"] -->|dst: 10.247.124.252:8080| IPTABLES["iptables PREROUTING\nDNAT chain"]
    IPTABLES -->|Statistically select endpoint\n33% each| EP1["Pod 1: 172.16.3.6:80"]
    IPTABLES --> EP2["Pod 2: 172.16.2.132:80"]
    IPTABLES --> EP3["Pod 3: 172.16.3.10:80"]
    subgraph RULES["iptables rules (kube-proxy maintains)"]
        R1["-A KUBE-SVC-xxx -m statistic --mode random\n--probability 0.33 -j KUBE-SEP-1"]
        R2["-A KUBE-SVC-xxx -m statistic --mode random\n--probability 0.5 -j KUBE-SEP-2"]
        R3["-A KUBE-SVC-xxx -j KUBE-SEP-3"]
    end

IPVS mode (production recommended): instead of O(n²) iptables rules, IPVS maintains a hash table in the Linux kernel virtual server module — O(1) lookup per connection regardless of service count.

DNS Resolution Internals (CoreDNS)¶

sequenceDiagram
    participant POD as Pod
    participant STUB as /etc/resolv.conf\nsearch default.svc.cluster.local
    participant COREDNS as CoreDNS (10.96.0.10)
    participant ETCD_DNS as etcd (Service objects)

    POD->>STUB: gethostbyname("nginx")
    STUB->>COREDNS: Query: nginx.default.svc.cluster.local A?
    COREDNS->>ETCD_DNS: Get Service nginx in namespace default
    ETCD_DNS-->>COREDNS: ClusterIP = 10.247.124.252
    COREDNS-->>STUB: A 10.247.124.252 (TTL 30s)
    STUB-->>POD: 10.247.124.252

6. Controllers: The Reconciliation Loop¶

All Kubernetes controllers share the same architectural pattern: informers (cached watches) feeding work queues, with reconcilers running in goroutines.

flowchart TD
    ETCD[("etcd")] -->|Watch stream| INFORMER["Informer (shared cache)\nList+Watch API objects\nlocal in-memory store"]
    INFORMER -->|Add/Update/Delete events| QUEUE["Rate-limited Work Queue\n(per controller)"]
    QUEUE --> RECONCILE["Reconcile Loop\nactualState vs desiredState"]
    RECONCILE -->|Create/Update/Delete objects| API["kube-apiserver"]
    API --> ETCD
    RECONCILE -->|Requeue on transient error| QUEUE

Deployment Controller Deep Dive¶

When you update a Deployment's image, the Deployment Controller orchestrates a rolling update by managing ReplicaSets:

sequenceDiagram
    participant DC as Deployment Controller
    participant RS_OLD as ReplicaSet v1 (3 replicas)
    participant RS_NEW as ReplicaSet v2 (0 replicas)
    participant ETCD as etcd

    Note over DC: maxSurge=1, maxUnavailable=0
    DC->>RS_NEW: Scale up to 1 replica
    RS_NEW-->>DC: 1 pod Running (v2)
    DC->>RS_OLD: Scale down to 2 replicas
    RS_OLD-->>DC: 2 pods Running (v1)
    DC->>RS_NEW: Scale up to 2 replicas
    RS_NEW-->>DC: 2 pods Running (v2)
    DC->>RS_OLD: Scale down to 1 replica
    DC->>RS_NEW: Scale up to 3 replicas
    RS_NEW-->>DC: 3 pods Running (v2)
    DC->>RS_OLD: Scale down to 0 replicas
    Note over DC: Rolling update complete

The old ReplicaSet is retained (scaled to 0) to enable kubectl rollout undo — which simply scales the old RS back up.

7. StatefulSets: Stable Identity for Stateful Workloads¶

StatefulSets differ from Deployments in three critical ways: 1. Stable network identity: pod-0, pod-1, pod-2 — names are deterministic 2. Ordered operations: pods start/stop in strict order (0→1→2 up, 2→1→0 down) 3. Persistent volume binding: each pod gets its own PVC bound permanently

stateDiagram-v2
    [*] --> pod0_Pending: StatefulSet created
    pod0_Pending --> pod0_Running: pod-0 scheduled + started
    pod0_Running --> pod1_Pending: pod-0 Ready → start pod-1
    pod1_Pending --> pod1_Running: pod-1 scheduled + started
    pod1_Running --> pod2_Pending: pod-1 Ready → start pod-2
    pod2_Pending --> pod2_Running: pod-2 Ready
    pod2_Running --> [*]: All replicas ready

    state pod0_Running {
        [*] --> VolumeMount: PVC data-pod-0 bound
        VolumeMount --> NetworkID: DNS: pod-0.svc.ns.svc.cluster.local
    }

Headless Service DNS for StatefulSets¶

A StatefulSet requires a headless service (clusterIP: None). CoreDNS creates A records for each pod individually:

pod-0.kafka.kafka-ns.svc.cluster.local → 172.16.0.10
pod-1.kafka.kafka-ns.svc.cluster.local → 172.16.0.11
pod-2.kafka.kafka-ns.svc.cluster.local → 172.16.0.12

Kafka brokers use these stable DNS names in their advertised.listeners — this is why StatefulSets are essential for stateful distributed systems.

8. Persistent Volume Subsystem: The Binding Protocol¶

sequenceDiagram
    participant DEV as Developer (PVC)
    participant CTRL as PersistentVolume Controller
    participant BINDER as Volume Binder (Scheduler)
    participant CSI as CSI Driver (EBS/Ceph/NFS)
    participant KUBELET as kubelet (node)

    DEV->>API: Create PVC (storage: 10Gi, ReadWriteOnce)
    CTRL->>CTRL: Find matching PV (capacity ≥ 10Gi,\naccessMode match, storageClass match)
    alt Static Binding
        CTRL->>PVC: Bind to existing PV
    else Dynamic Provisioning
        CTRL->>CSI: CreateVolume (10Gi, zone=us-east-1a)
        CSI-->>CTRL: VolumeID, access endpoint
        CTRL->>PV: Create PV object with VolumeID
        CTRL->>PVC: Bind PVC → PV
    end
    BINDER->>SCHEDULER: VolumeBinding predicate:\nnodes compatible with PV topology
    KUBELET->>CSI: NodeStageVolume (format if needed)
    KUBELET->>CSI: NodePublishVolume (bind-mount into pod)
    CSI-->>KUBELET: Volume mounted at /var/lib/kubelet/pods/<UID>/volumes/

9. Kubernetes vs Competing Orchestrators¶

Architecture Comparison Matrix¶

block-beta
    columns 4
    H1["Feature"]:1 H2["Kubernetes"]:1 H3["Docker Swarm"]:1 H4["Mesos+Marathon"]:1
    R1["State Store"]:1 E1["etcd (Raft)"]:1 E2["Raft (Managers)"]:1 E3["ZooKeeper"]:1
    R2["Scheduling"]:1 S1["Predicate+Priority\nplugin framework"]:1 S2["Spread by default\nsimple constraints"]:1 S3["2-level: Mesos offers\n→ Marathon accepts"]:1
    R3["Service Discovery"]:1 D1["CoreDNS +\nkube-proxy"]:1 D2["DNS + VIP\ningress LB"]:1 D3["Marathon-LB\nMesos-DNS"]:1
    R4["Networking"]:1 N1["CNI plugins\n(Calico/Flannel)"]:1 N2["Overlay VXLAN\nbuilt-in"]:1 N3["No built-in\nuser-defined"]:1
    R5["Config Format"]:1 C1["YAML (rich types)\nCRD extensible"]:1 C2["docker-compose\nYAML"]:1 C3["Marathon JSON\nAPI"]:1

Mesos Two-Level Scheduling¶

Mesos uses a resource offer model — fundamentally different from Kubernetes centralized scheduling:

sequenceDiagram
    participant MESOS as Mesos Master
    participant AGENT as Mesos Agent (node)
    participant MARATHON as Marathon Framework
    participant APP as App Task

    AGENT->>MESOS: RegisterSlave (CPUs=8, MEM=16G)
    MESOS->>MARATHON: ResourceOffer (4 CPUs, 8G, node-1)
    MARATHON->>MARATHON: Does any pending task fit?
    MARATHON->>MESOS: LaunchTask (2 CPUs, 4G, docker image)
    MESOS->>AGENT: LaunchTask
    AGENT->>APP: docker run (with cgroup limits)
    APP-->>AGENT: RUNNING
    AGENT-->>MESOS: StatusUpdate: RUNNING
    MESOS-->>MARATHON: StatusUpdate: RUNNING

The two-level model allows multiple independent frameworks (Marathon, Spark, Flink) to share the same cluster resources — Mesos is a datacenter-level resource abstraction.

10. Auto Scaling Internals: HPA and VPA¶

Horizontal Pod Autoscaler (HPA) Control Loop¶

flowchart TD
    METRICS["metrics-server\nCPU/memory from kubelet\nCustom metrics from Prometheus adapter"] --> HPA["HPA Controller\n(reconcile every 15s)"]
    HPA -->|desiredReplicas = ceil(current × metric/target)| CALC["Scale Decision\nmin/max clamp applied"]
    CALC -->|scale up immediately| DEPLOY["Deployment / ReplicaSet"]
    CALC -->|scale down: wait 5min cooldown| DEPLOY
    DEPLOY --> PODS["Pod count changes"]
    PODS --> METRICS

    subgraph FORMULA["HPA Scaling Formula"]
        F1["desiredReplicas =\nceil(currentReplicas × (currentMetricValue / desiredMetricValue))"]
        F2["e.g.: 3 pods × (80% CPU / 50% target) = ceil(4.8) = 5 pods"]
    end

RBAC Authorization: Token Flow¶

sequenceDiagram
    participant POD as Pod
    participant MOUNT as /var/run/secrets/kubernetes.io/serviceaccount/token
    participant API as kube-apiserver
    participant AUTHZ as RBAC Authorizer

    Note over MOUNT: TokenRequest API issues\nbound service account token\n(projected volume, TTL=1hr)

    POD->>API: GET /api/v1/namespaces/default/pods\nAuthorization: Bearer <token>
    API->>API: TokenReview: verify JWT signature\n(bound to pod UID + node)
    API->>AUTHZ: SubjectAccessReview:\nuser=system:serviceaccount:default:my-sa\nverb=get, resource=pods
    AUTHZ->>AUTHZ: Walk RoleBinding → Role → PolicyRule
    AUTHZ-->>API: allowed=true
    API-->>POD: 200 OK (pod list)

11. Ingress: External Traffic Routing¶

flowchart LR
    INTERNET["External Client"] -->|443/TLS| LB["Cloud LoadBalancer\n(NodePort 30443)"]
    LB --> NGINX_POD["nginx-ingress-controller pod\n(DaemonSet or Deployment)"]
    NGINX_POD -->|Watch Ingress objects| ETCD[("etcd")]
    NGINX_POD -->|Reload nginx.conf| NGINX["nginx process\nupstream blocks\nSSL termination"]
    NGINX -->|/api → svc-api:8080| SVC_A["Service: svc-api"]
    NGINX -->|/web → svc-web:80| SVC_B["Service: svc-web"]
    SVC_A --> PODS_A["API Pods"]
    SVC_B --> PODS_B["Web Pods"]

    subgraph RELOAD["nginx.conf upstream generation"]
        U1["upstream svc-api {\n  server 172.16.0.5:8080;\n  server 172.16.0.6:8080;\n}"]
    end

The ingress controller watches Ingress objects via informer; each add/update triggers nginx config regeneration and a graceful reload (nginx -s reload — no connection drops via master/worker hot-reload).

12. Node Failure and Pod Rescheduling¶

sequenceDiagram
    participant NODE as Worker Node
    participant KL as kubelet (on node)
    participant API as kube-apiserver
    participant NCM as Node Controller

    KL->>API: NodeHeartbeat (every 10s)
    Note over NODE: Node crashes / network partition
    NCM->>NCM: No heartbeat for 40s (node-monitor-grace-period)
    NCM->>API: Patch node.status.conditions:\nReady=Unknown
    NCM->>NCM: Wait 5min (pod-eviction-timeout)
    NCM->>API: Delete pods on Unknown node
    API->>ETCD: Delete pod objects
    SCHED["kube-scheduler"]-->API: Watch: pods Pending (no node)
    SCHED->>SCHED: Filter/Score healthy nodes
    SCHED->>API: Bind pod → new node
    API->>ETCD: Update pod.spec.nodeName
    KUBELET2["kubelet (new node)"]-->API: Watch: pod bound to me
    KUBELET2->>CRI2["containerd (new node)"]: Start containers

The 5-minute default eviction timeout means a node failure takes ~6 minutes to detect + reschedule. Tuning node-monitor-period, node-monitor-grace-period, and pod-eviction-timeout trades false-positive evictions against recovery speed.

Summary: Data Flow Map¶

flowchart TD
    USER["User / CI System"] -->|kubectl / GitOps| API["kube-apiserver\n(validation + auth)"]
    API <-->|Read/Write proto| ETCD[("etcd cluster\nRaft consensus")]
    ETCD -->|Watch events| CTRL["Controller Manager\n(Deployment/RS/StatefulSet/Job)"]
    ETCD -->|Watch unscheduled pods| SCHED["Scheduler\n(filter + score)"]
    SCHED -->|Binding| ETCD
    ETCD -->|Watch node's pods| KUBELET["kubelet (per node)"]
    KUBELET -->|gRPC CRI| CONTAINERD["containerd"]
    CONTAINERD -->|OCI spec| RUNC["runc → Linux namespaces\ncgroups, seccomp"]
    KUBELET -->|exec| CNI["CNI plugin\n(network namespace)"]
    KUBELET -->|Probe HTTP/TCP| APP["Application containers"]
    KUBELET -->|Status patch| API
    API -->|Watch services| KPROXY["kube-proxy\n(iptables/IPVS)"]
    KPROXY -->|Route packets| APP

Every API call is authenticated, authorized, admitted, persisted to etcd, and then propagated through watch streams to the relevant controllers and node agents — no component holds authoritative state except etcd.