Linux Kernel Development — Under the Hood¶
Based on Linux Kernel Development, 3rd Edition by Robert Love
This document maps the internal machinery of the Linux kernel: how processes flow through memory, how the scheduler dispatches CPU time, how system calls cross the privilege boundary, how synchronization primitives prevent races, and how the memory subsystem carves up physical RAM. The goal is to visualize every data path exactly — through struct fields, kernel stacks, CPU registers, and memory zones.
1. Process Model: task_struct and the Task List¶
Every runnable entity in the kernel — process or thread — is represented by struct task_struct, defined in <linux/sched.h>. At ~1.7 KB on 32-bit, it contains the complete runtime state of a process.
block-beta
columns 3
A["task_struct\n(~1.7KB, slab-allocated)"] B["thread_info\n(bottom of kernel stack)"] C["kernel stack\n(8KB default)"]
A --> B
B --> C
D["pid_t pid"] E["long state"] F["struct mm_struct *mm"]
G["struct files_struct *files"] H["struct signal_struct *signal"] I["struct list_head tasks"]
A --> D
A --> E
A --> F
A --> G
A --> H
A --> I
task_struct Allocation: Slab Allocator + thread_info¶
Before kernel 2.6, task_struct lived at the end of the kernel stack. Now it's slab-allocated, and struct thread_info anchors to the stack bottom:
flowchart TB
subgraph "Kernel Stack (8KB)"
direction TB
TI["struct thread_info\n(at stack bottom/top depending on arch)"]
SP["↑ stack grows here\n(local vars, saved regs)"]
KF["kernel frame N\nkernel frame N-1\n..."]
end
TI -->|"thread_info.task →"| TS["struct task_struct\n(slab heap, anywhere in mem)"]
SP --> KF
On x86 (register-poor), current is computed by masking the stack pointer:
movl $-8192, %eax ; andl %esp, %eax
On PowerPC (register-rich), current is simply register r2.
Process Lifecycle State Machine¶
stateDiagram-v2
[*] --> TASK_RUNNING_QUEUED : fork() / clone() creates child
TASK_RUNNING_QUEUED --> TASK_RUNNING_CPU : schedule() → context_switch()
TASK_RUNNING_CPU --> TASK_RUNNING_QUEUED : preempted by higher-priority task
TASK_RUNNING_CPU --> TASK_INTERRUPTIBLE : wait for event (e.g., I/O)
TASK_RUNNING_CPU --> TASK_UNINTERRUPTIBLE : wait, must not be interrupted
TASK_INTERRUPTIBLE --> TASK_RUNNING_QUEUED : event occurs OR signal received
TASK_UNINTERRUPTIBLE --> TASK_RUNNING_QUEUED : event occurs (signals ignored)
TASK_RUNNING_CPU --> __TASK_TRACED : ptrace attach
TASK_RUNNING_CPU --> __TASK_STOPPED : SIGSTOP / SIGTSTP received
TASK_RUNNING_CPU --> EXIT_ZOMBIE : do_exit()
EXIT_ZOMBIE --> [*] : parent calls wait() / wait4()
The TASK_UNINTERRUPTIBLE state is why you see processes in state D in ps that cannot be SIGKILLed — they hold a semaphore waiting for a kernel event (typically I/O), and waking them prematurely would corrupt kernel state.
fork() / clone() Internal Data Flow¶
sequenceDiagram
participant U as User Space
participant SC as sys_clone() / sys_fork()
participant CF as copy_process()
participant SA as Slab Allocator
participant MM as mm_subsystem
U->>SC: fork() syscall → int 0x80 / syscall
SC->>CF: copy_process(clone_flags, ...)
CF->>SA: dup_task_struct() → alloc new task_struct + thread_info
SA-->>CF: child task_struct
CF->>MM: copy_mm() — copy or share address space
CF->>CF: copy_files(), copy_fs(), copy_sighand()
CF->>CF: assign new PID (alloc_pid())
CF-->>SC: child task_struct ready
SC->>SC: wake_up_new_task(child) → place on runqueue
SC-->>U: returns child PID to parent, 0 to child
fork() uses copy-on-write (COW) — the child shares parent's page tables marked read-only. Only on first write does the kernel copy the actual page.
2. Process Scheduling: CFS and the Runqueue¶
The Completely Fair Scheduler (CFS) replaced O(1) scheduler in 2.6.23. It uses a red-black tree (rbtree) keyed by virtual runtime (vruntime) — how long the task has run, weighted by priority.
CFS Data Structures¶
flowchart LR
subgraph "Per-CPU runqueue (struct rq)"
RBT["rbtree root\n(ordered by vruntime)"]
CURR["curr → task_struct"]
CFS_RQ["struct cfs_rq\n.min_vruntime\n.nr_running"]
end
RBT -->|"leftmost = next task"| T1["task A\nvruntime=100ns"]
RBT --> T2["task B\nvruntime=150ns"]
RBT --> T3["task C\nvruntime=200ns"]
T1 -->|"sched_entity"| TS1["task_struct A"]
T2 -->|"sched_entity"| TS2["task_struct B"]
vruntime Calculation¶
Every task gets a time slice proportional to its weight. vruntime accumulates at rate:
Higher-priority tasks have higher task_weight, so their vruntime grows slower — they appear at the left of the rbtree and get selected more often.
flowchart TD
A["schedule() called\n(voluntary or preemption)"] --> B["Pick leftmost node\nfrom CFS rbtree"]
B --> C["context_switch(prev, next)"]
C --> D["switch_mm() — load CR3\nwith new page table base"]
C --> E["switch_to() — save/restore\nregs, stack pointer, FPU state"]
E --> F["next task runs"]
F --> G["update_curr()\naccumulates vruntime"]
G --> H{"vruntime exceeds\nideal_runtime?"}
H -->|yes| I["set TIF_NEED_RESCHED flag"]
H -->|no| F
I --> A
Context Switch: What Actually Happens¶
sequenceDiagram
participant K as Kernel (schedule())
participant MMU as MMU / TLB
participant CPU as CPU Registers
participant FPU as FPU State
K->>K: preempt_disable()
K->>MMU: switch_mm(old_mm, new_mm)\nloads CR3 → flushes TLB (unless ASID)
K->>CPU: switch_to(prev, next)\nsaves: ESP, EBP, EBX, ESI, EDI\nrestores: next task's saved regs
K->>CPU: jump to next task's saved EIP
note over CPU: CPU now executing new task
K->>FPU: lazy FPU save — only on first FP instruction\n(via #NM exception)
K->>K: preempt_enable()
3. System Calls: The Privilege Boundary Crossing¶
System calls are the only legitimate way user-space crosses into kernel mode. On x86-64, the syscall instruction is used (replacing slower int 0x80).
flowchart TD
U["User space: read(fd, buf, count)\n→ glibc wrapper\n→ mov eax, __NR_read\n→ syscall instruction"]
U --> MSR["CPU: saves RIP, RFLAGS to MSR\nloads kernel RIP from LSTAR MSR\nsets CPL=0 (ring 0)"]
MSR --> SE["system_call entry point\n(arch/x86/entry/entry_64.S)\nSAVE_ALL registers on kernel stack"]
SE --> DT["sys_call_table[rax]\nlookup — 64-bit indexed array\nof function pointers"]
DT --> SC["sys_read(fd, buf, count)\nkernel implementation"]
SC --> VR["verify_area()\ncheck user pointer validity\ncopy_from_user() / copy_to_user()"]
VR --> RET["syscall_return\nRESTORE_ALL\nsysret instruction\nreturns to user CPL=3"]
Kernel Stack During a System Call¶
block-beta
columns 1
A["pt_regs (saved user registers)\nRSS, CS, RFLAGS, RSP, SS"]
B["sys_read() stack frame\nlocal vars: file *, ssize_t, etc."]
C["vfs_read() frame"]
D["file->f_op->read() frame\n(e.g., ext4_file_read_iter)"]
E["→ stack grows downward toward thread_info"]
Each task_struct has exactly one kernel stack (8KB, or 4KB on CONFIG_4KSTACKS). With interrupt stacks enabled, IRQ handlers get their own per-CPU stack, preventing overflow.
4. Interrupts and Bottom Halves¶
Hardware interrupts must be handled in two phases: the top half (urgent, runs in IRQ context, can't sleep) and bottom half (deferred work, runs later in softer context).
flowchart TD
HW["Hardware IRQ fires\n(e.g., NIC receives packet)"]
HW --> IDT["CPU: saves state\nlookups IDT → irq_handler entry\ndisables same IRQ line"]
IDT --> TH["Top Half: irq_handler_t\n(registered via request_irq())\nACK hardware, minimal work\nmark bottom half pending"]
TH --> BH{"Bottom Half\nMechanism?"}
BH --> SQ["Softirq\n(static, fixed priorities)\nruns in ksoftirqd or do_softirq()\nNET_RX_SOFTIRQ, TASKLET_SOFTIRQ"]
BH --> TK["Tasklet\n(built on softirq)\ndynamic, serialized per-CPU\ntasklet_schedule()"]
BH --> WQ["Work Queue\n(runs in kernel thread context)\ncan sleep, uses kworker threads\nschedule_work()"]
Interrupt Context vs. Process Context¶
flowchart LR
subgraph "Interrupt Context (IRQ)"
I1["Cannot sleep"]
I2["Cannot call schedule()"]
I3["Cannot acquire mutex"]
I4["Uses per-CPU interrupt stack"]
I5["preempt_count > 0"]
end
subgraph "Process Context (syscall, kthread)"
P1["Can sleep"]
P2["Can call schedule()"]
P3["Can acquire mutex"]
P4["Uses task's kernel stack"]
P5["current is valid"]
end
5. Kernel Synchronization¶
The kernel faces concurrent access from: multiple CPUs (SMP), kernel preemption, and interrupts. Three layers of primitives protect shared state.
Atomic Operations¶
atomic_t wraps a volatile int. On x86, operations like atomic_inc compile to lock addl $1, (%eax) — the LOCK prefix makes the memory bus operation indivisible.
sequenceDiagram
participant CPU1
participant CPU2
participant MEM as Memory Bus
CPU1->>MEM: LOCK prefix → assert LOCK# signal
note over MEM: bus locked — CPU2 stalls
CPU1->>MEM: read-modify-write (atomic_inc)
CPU1->>MEM: release LOCK# signal
MEM-->>CPU2: bus released → CPU2 can proceed
Atomicity vs. Ordering distinction: Atomic ops guarantee indivisibility but NOT ordering across CPUs. For ordering, memory barriers (smp_mb(), smp_rmb(), smp_wmb()) are needed — they compile to mfence/lfence/sfence on x86.
Spinlocks¶
Used when the critical section is short and the lock holder will release quickly. On SMP, the non-holder busy-waits in a tight loop. On UP (uniprocessor), spinlocks compile to nothing (preemption disable only).
flowchart TD
A["spin_lock(lock)"]
A --> B{"lock->slock == 0?"}
B -->|no| C["CPU spins: busy-wait\n(disable interrupts on local CPU\nif spin_lock_irqsave)"]
C --> B
B -->|yes| D["atomic compare-and-swap\nset lock->slock = 1"]
D --> E["critical section\nno sleep allowed"]
E --> F["spin_unlock(lock)\nset lock->slock = 0\nmemory barrier"]
Reader-Writer Spinlocks and Semaphores¶
flowchart LR
subgraph "rwlock_t"
R["Multiple concurrent readers\nrwlock_t.lock == readers_count"]
W["Single exclusive writer\nrwlock_t.lock == 0x80000000"]
R -->|"contention"| W
end
subgraph "struct semaphore"
SEM["count (atomic)\n= 0 → blocked\n> 0 → available"]
WQ_S["wait_queue_head_t\n(sleeping waiters)"]
SEM --> WQ_S
end
subgraph "struct mutex"
MX["atomic_t count\n= 1 → unlocked\n= 0 → locked\n< 0 → locked + waiters"]
MX_WQ["wait_list\n(MCS-style fair queue)"]
MX --> MX_WQ
end
Key difference: spinlocks = busy-wait (never sleep, IRQ/atomic-safe); semaphores/mutexes = sleep (process context only, can be held across schedule()).
RCU (Read-Copy-Update)¶
RCU is used for data that is read far more than written (e.g., routing tables, process credential structures):
sequenceDiagram
participant R1 as Reader CPU1
participant R2 as Reader CPU2
participant W as Writer
R1->>R1: rcu_read_lock() — preempt_disable
R1->>R1: p = rcu_dereference(gp) — reads pointer
W->>W: new_p = kmalloc(...)
W->>W: populate new_p
W->>W: rcu_assign_pointer(gp, new_p) — memory barrier + store
R2->>R2: rcu_read_lock()
R2->>R2: sees new_p (or old_p — both valid)
R1->>R1: rcu_read_unlock()
W->>W: synchronize_rcu() — wait for grace period\n(all CPUs pass quiescent state)
W->>W: kfree(old_p) — safe, no readers hold it
6. Memory Management: Zones, Pages, and the Allocator Hierarchy¶
Linux divides physical RAM into zones based on DMA constraints and high-memory architecture limits.
Memory Zone Layout (x86 32-bit example)¶
block-beta
columns 1
A["ZONE_DMA: 0–16MB\n(legacy ISA DMA devices)"]
B["ZONE_NORMAL: 16MB–896MB\n(direct-mapped to kernel virtual space\n0xC0000000 + phys_addr)"]
C["ZONE_HIGHMEM: 896MB–end of RAM\n(not permanently mapped in kernel VA space\nrequires kmap() / kmap_atomic())"]
Allocator Hierarchy¶
flowchart TD
APP["kmalloc(size, GFP_KERNEL)"]
APP --> SL["Slab / SLUB Allocator\n(object caches, cache coloring,\nper-CPU magazines)"]
SL --> PA["Page Allocator (Buddy System)\nalloc_pages(gfp_mask, order)\nallocates 2^order contiguous pages"]
PA --> ZA["Zone Allocator\nselects ZONE_NORMAL / ZONE_DMA\nbased on gfp_mask flags"]
ZA --> PF["Page Frame (struct page)\n20 bytes per page (4KB page = 0.5% overhead)"]
Buddy Allocator Internal State¶
The buddy system tracks free pages in 11 free lists, for orders 0–10 (1, 2, 4, 8, ... 1024 pages):
flowchart LR
subgraph "free_area[order]"
O0["order 0: list of 4KB free pages"]
O1["order 1: list of 8KB free pairs"]
O2["order 2: 16KB quads"]
O9["order 9: 2MB blocks"]
O10["order 10: 4MB blocks"]
end
REQ["alloc_pages(GFP_KERNEL, 0)\nwants 1 page (order 0)"]
REQ --> O0
O0 -->|"empty → split from order 1"| O1
O1 -->|"split: give 1 page, buddy goes to order 0 free list"| O0
On free, if the freed page's buddy (adjacent same-size block) is also free, they are coalesced into a higher-order block — preventing fragmentation.
Slab Allocator: Per-Type Object Caches¶
flowchart TD
subgraph "kmem_cache for task_struct"
CP["Cache: 'task_struct'\nobj_size = 1.7KB\nalignment = 64B (cache-colored)"]
SL1["Slab 1: 4 pages\n2 objects in-use, 2 free"]
SL2["Slab 2: 4 pages\n4 objects in-use, 0 free"]
SL3["Slab 3: 4 pages\n0 objects in-use (free to release)"]
CP --> SL1
CP --> SL2
CP --> SL3
end
subgraph "Per-CPU array_cache"
AC["array_cache[cpu]\nbatch of recently freed objects\n(hot cache line objects)"]
end
SL1 --> AC
AC -->|"alloc: pop from array"| NEW["new task_struct\n(same cache line as last free)"]
Cache coloring: slab start offsets are varied so that objects from different slabs land on different L1 cache sets — preventing cache line thrashing.
High Memory Mapping¶
sequenceDiagram
participant K as Kernel code
participant PA as Page Allocator
participant PKMAP as Permanent Kernel Map\n(pkmap area 0xFFC00000)
participant TMP as kmap_atomic fixmap
K->>PA: alloc_pages(__GFP_HIGHMEM, 0)
PA-->>K: struct page * (no virtual address)
K->>PKMAP: kmap(page)\n→ find free pkmap slot\n→ set PTE in pkmap page table
PKMAP-->>K: virtual address in 0xFFC00000–0xFFFFFFFF
note over K: use the memory
K->>PKMAP: kunmap(page) — clear PTE
note over K: For interrupt context:
K->>TMP: kmap_atomic(page, KM_USER0)\ndisables preemption\nuses fixmap reserved slot
TMP-->>K: virtual address
K->>TMP: kunmap_atomic(addr, KM_USER0)
7. Virtual Filesystem (VFS): Abstraction Layer¶
VFS provides a uniform interface between system calls (open, read, write) and concrete filesystems (ext4, btrfs, tmpfs, NFS).
VFS Object Hierarchy¶
flowchart TD
SC["open(path, flags)\nread(fd, buf, n)"]
SC --> FD["fd → struct file\n(per-open-file instance)\nf_pos, f_flags, f_op"]
FD --> IN["struct inode\n(per filesystem object)\ni_ino, i_size, i_mode\ni_op, i_fop, i_mapping"]
IN --> SB["struct super_block\n(per mounted filesystem)\ns_op, s_type, s_blocksize"]
IN --> DE["struct dentry\n(path component cache)\nd_name, d_inode, d_parent\nd_op, d_subdirs"]
DE --> DCA["dentry cache (dcache)\nhash table keyed by (parent, name)\nLRU eviction on memory pressure"]
read() System Call — Full Data Path¶
sequenceDiagram
participant U as User Buffer (ring 3)
participant SC as sys_read()
participant VFS as vfs_read()
participant FOP as file->f_op->read_iter()
participant PC as Page Cache (radix tree)
participant BL as Block Layer
participant DISK as Block Device
U->>SC: syscall: read(fd, ubuf, count)
SC->>SC: fget(fd) → struct file *
SC->>VFS: vfs_read(file, ubuf, count, &pos)
VFS->>FOP: call_read_iter() → file->f_op->read_iter()
FOP->>PC: generic_file_read_iter()\nlook up page in address_space radix tree
PC-->>FOP: page found (cache hit) → copy_to_user()
note over PC: on cache miss:
FOP->>BL: submit_bio() — build struct bio\n(block I/O descriptor: sectors, pages, direction)
BL->>BL: I/O scheduler (CFQ/deadline/noop)\nmerge/sort requests in dispatch queue
BL->>DISK: issue to driver → DMA transfer
DISK-->>BL: IRQ on completion
BL-->>PC: bio_endio() → mark page uptodate\nwake up waiting process
PC-->>FOP: page now in cache → copy_to_user()
FOP-->>U: data in user buffer
Page Cache: Radix Tree Structure¶
flowchart TD
subgraph "address_space (per inode)"
RT["radix_tree_root\n(keyed by page index = file_offset / PAGE_SIZE)"]
P0["page index 0\n(bytes 0–4095)"]
P1["page index 1\n(bytes 4096–8191)"]
PN["page index N"]
RT --> P0
RT --> P1
RT --> PN
end
P0 --> FLAGS["page flags:\nPG_uptodate (data valid)\nPG_dirty (needs writeback)\nPG_locked (I/O in progress)\nPG_lru (on LRU list)"]
8. Block I/O Layer¶
The block layer abstracts storage devices and provides scheduling. The key structure is struct bio — a scatter-gather I/O descriptor.
flowchart LR
subgraph "struct bio"
BD["bi_bdev → block_device"]
SE["bi_sector: starting LBA"]
VC["bi_vcnt: number of segments"]
IO["bi_io_vec[]: array of\n{page *, offset, len}"]
END["bi_end_io: completion callback"]
end
BD --> DQ["struct request_queue\n(per device)\ndispatch_queue, elevator_queue"]
DQ --> EL["I/O Scheduler (elevator)\nCFQ: per-process fair queues\nDeadline: prevents starvation\nNoop: FIFO for SSDs"]
EL --> DRV["Device Driver\n(e.g., ahci, nvme)\nDMA descriptor ring"]
DRV --> HW["Hardware\n(disk, SSD, RAID)"]
9. Kernel Data Structures¶
Intrusive Linked List (list_head)¶
Unlike userspace linked lists where the list contains data, the kernel embeds list_head inside the data structure:
flowchart LR
subgraph "fox struct 1"
F1D["fox.data"]
F1L["fox.list\n(list_head)\n.next → "]
end
subgraph "fox struct 2"
F2L[".prev ←\nfox.list\n(list_head)\n.next → "]
F2D["fox.data"]
end
subgraph "fox struct 3"
F3L[".prev ←\nfox.list\n(list_head)"]
F3D["fox.data"]
end
F1L --> F2L
F2L --> F3L
F3L -->|"circular"| F1L
list_entry(ptr, type, member) recovers the containing struct via container_of(): subtracts the compile-time offset of member from ptr.
kfifo: Lock-Free SPSC Queue¶
flowchart LR
subgraph "struct kfifo (power-of-2 size)"
BUF["buffer[size]"]
IN["in offset (enqueue position)"]
OUT["out offset (dequeue position)"]
SZ["size (mask = size-1)"]
end
IN -->|"modulo size via bitmask"| WRAP["wraps around automatically\nno conditional needed"]
OUT --> WRAP
NOTE["invariant: out ≤ in\nfree = size - (in - out)\nused = in - out"]
10. Per-CPU Data and Cache Architecture¶
Per-CPU data eliminates locks for CPU-local counters and caches by ensuring each processor works on its own copy.
flowchart TD
subgraph "System with 4 CPUs"
CPU0["CPU 0\nmy_var @ .data.percpu + 0×base0"]
CPU1["CPU 1\nmy_var @ .data.percpu + 1×base1"]
CPU2["CPU 2\nmy_var @ .data.percpu + 2×base2"]
CPU3["CPU 3\nmy_var @ .data.percpu + 3×base3"]
end
ACCESS["get_cpu_var(my_var)\n1. preempt_disable()\n2. smp_processor_id() → cpu_id\n3. &__get_cpu_var + per_cpu_offset[cpu_id]"]
ACCESS --> CPU0
ACCESS --> CPU1
ACCESS --> CPU2
ACCESS --> CPU3
CACHE["L1 cache lines:\nEach CPU hits only its own\ndata → no false sharing\n(percpu interface ensures\ncache-line alignment)"]
Why it matters: Without per-CPU data, a global counter requires an atomic op or spinlock — on a 64-CPU system, the cache line bounces between CPUs (cache thrashing). Per-CPU: each CPU's L1 cache holds its own copy permanently → near-zero overhead.
11. Timers and Time Management¶
flowchart TD
HW_TIMER["Hardware: PIT / HPET / APIC Timer\nfires at HZ frequency (100–1000 Hz)"]
HW_TIMER --> TI["timer_interrupt()\n(hardirq context)\nincrement jiffies\nincrement xtime (wall clock)\ncheck expired kernel timers"]
TI --> KT["Kernel Timer subsystem\n(struct timer_list)\nhashed into timer wheels:\ntv1 (soon) → tv2 → tv3 → tv4 → tv5 (far)"]
KT --> EXP["Expired timers: run handler\nin softirq context (TIMER_SOFTIRQ)"]
TI --> HR["hrtimer subsystem\n(nanosecond precision)\nred-black tree ordered by expiry"]
The jiffy is the atomic unit of kernel time (HZ jiffies per second). High-resolution timers (hrtimer) bypass jiffies entirely, reading hardware directly via clocksource abstraction.
12. Summary: Data Flow Map¶
flowchart TD
USR["User Process\n(ring 3, virtual address space)"]
USR -->|"syscall instruction"| KRN["Kernel\n(ring 0, kernel virtual space)"]
KRN --> SCH["Scheduler\n(CFS rbtree, vruntime)"]
KRN --> MEM["Memory Manager\n(buddy + slab + page cache)"]
KRN --> VFS["VFS Layer\n(dentry, inode, file)"]
KRN --> IRQ["Interrupt Subsystem\n(top half → bottom half)"]
KRN --> SYNC["Synchronization\n(spinlock, mutex, RCU, atomic)"]
SCH -->|"context_switch()"| CPU["CPU Registers + TLB"]
MEM -->|"alloc_pages()"| PHYS["Physical RAM\n(zones: DMA, NORMAL, HIGHMEM)"]
VFS -->|"bio submission"| BLK["Block Layer\n(I/O scheduler → DMA)"]
SYNC -->|"memory barriers"| CPU
IRQ -->|"softirq / workqueue"| KRN
Every read() call traces a path: user registers → syscall → sys_read frame on kernel stack → VFS → page cache radix tree lookup → (on miss) bio construction → I/O scheduler → DMA → IRQ → bio_endio → copy_to_user → sysret → user registers. The kernel stack, current, page frame flags, and zone allocator are the connective tissue throughout.