Operating Systems Internals: Kernel Data Structures and Scheduling¶
Under the Hood — How the kernel manages processes, schedules threads, handles memory, and services I/O requests. Internals of task_struct, the CFS scheduler, virtual memory areas, page fault handling, VFS inode/dentry caches, and system call dispatch. Sources: Understanding the Linux Kernel (Bovet & Cesati), Operating System Concepts (Silberschatz/Galvin — "Dinosaur Book"), The Linux Programming Interface (Kerrisk), Learning the UNIX Operating System.
1. Linux Kernel Process Model: task_struct¶
Every process/thread in Linux is represented by a task_struct — a ~1000-field structure in include/linux/sched.h.
flowchart TD
subgraph TASK_STRUCT["task_struct (partial)"]
STATE["state: TASK_RUNNING | TASK_INTERRUPTIBLE\n| TASK_UNINTERRUPTIBLE | TASK_ZOMBIE"]
PID["pid: pid_t (process ID)\ntgid: pid_t (thread group ID = main thread's pid)"]
PARENT["real_parent: *task_struct\nparent: *task_struct\nchildren: list_head\nsibling: list_head"]
MM["mm: *mm_struct\n(process memory descriptor\nor NULL for kernel threads)"]
FILES["files: *files_struct\n(file descriptor table)"]
SIGNALS["sighand: *sighand_struct\nsignal: *signal_struct\npending: sigpending"]
SCHED_INFO["sched_entity se\npolicy: SCHED_NORMAL | SCHED_FIFO | SCHED_RR\nprio: int (nice value -20..+19 mapped to 0..139)"]
CREDS["cred: *cred\n(uid, gid, capabilities, LSM labels)"]
end
STATE & PID & PARENT & MM & FILES & SIGNALS & SCHED_INFO & CREDS
Process State Machine¶
stateDiagram-v2
[*] --> CREATED: fork() / clone()
CREATED --> RUNNING: scheduler selects
RUNNING --> READY: preempted (timer IRQ\nor higher-prio task)
READY --> RUNNING: scheduler selects
RUNNING --> SLEEPING_I: wait for I/O\nor event (interruptible)
RUNNING --> SLEEPING_U: kernel wait\n(uninterruptible)
SLEEPING_I --> READY: event arrives\nor signal received
SLEEPING_U --> READY: I/O complete
RUNNING --> ZOMBIE: exit() called\nparent not wait()ed yet
RUNNING --> STOPPED: SIGSTOP / SIGTSTP
STOPPED --> READY: SIGCONT
ZOMBIE --> [*]: parent calls wait()
2. CFS Scheduler: Completely Fair Scheduler Internals¶
flowchart TD
subgraph CFS_INTERNALS["CFS Scheduler (kernel/sched/fair.c)"]
RB_TREE["Per-CPU rq.cfs_rq.tasks_timeline\nRed-black tree ordered by vruntime\nO(log N) enqueue/dequeue"]
VRUNTIME["vruntime = virtual runtime\n= actual_runtime × (NICE_0_LOAD / weight)\nLower-prio tasks accumulate vruntime faster\nAlways schedule leftmost node (lowest vruntime)"]
SCHED_LAT["sched_latency_ns = 6ms (default)\n= scheduling period for all runnable tasks\nTime slice per task = sched_latency / nr_running"]
MIN_GRAN["sched_min_granularity_ns = 0.75ms\nFloor on time slice\n(prevents starvation with many tasks)"]
end
subgraph WAKEUP["Wakeup Preemption"]
WAKE["Task wakes up after sleeping\nvruntime catchup:\n vruntime = max(vruntime, min_vruntime - sched_latency)\nPreempt current if new task's vruntime\n< current.vruntime - sched_min_granularity"]
end
RB_TREE --> VRUNTIME --> SCHED_LAT --> MIN_GRAN
CFS Timer Interrupt Flow¶
sequenceDiagram
participant HW as Hardware Timer (HPET/LAPIC)
participant IRQ as IRQ Handler
participant CFS as CFS Scheduler
participant SWITCH as Context Switch
HW->>IRQ: Timer interrupt (every ~1ms tick or tickless)
IRQ->>CFS: update_curr(cfs_rq)
CFS->>CFS: delta_exec = now - curr.exec_start
CFS->>CFS: curr.vruntime += delta_exec × (NICE_0_LOAD / curr.weight)
CFS->>CFS: check_preempt_tick: vruntime > ideal_runtime?
alt Preempt condition met
CFS->>IRQ: set_tsk_need_resched(curr)
IRQ->>SWITCH: schedule() called
SWITCH->>SWITCH: pick_next_task: leftmost rbtree node
SWITCH->>SWITCH: context_switch:\n switch_mm (CR3 reload if different process)\n switch_to (__switch_to: save/restore regs)
end
3. Virtual Memory: mm_struct and VMA Tree¶
flowchart TD
subgraph MM_STRUCT["mm_struct (process memory descriptor)"]
MMAP["mmap: *vm_area_struct\nLinked list + rbtree of VMAs"]
MM_MMAP_TREE["mm_rb: rb_root\nRed-black tree for fast VMA lookup"]
PGD["pgd: pgd_t*\nPointer to page global directory\n(loaded into CR3 on context switch)"]
MM_STATS["total_vm: unsigned long\nrss: resident set size\nmmap_sem: rwsemaphore"]
end
subgraph VMA["vm_area_struct (one per mapping)"]
VM_START["vm_start: unsigned long\nvm_end: unsigned long"]
VM_FLAGS["vm_flags: VM_READ | VM_WRITE | VM_EXEC\n| VM_SHARED | VM_MAYWRITE"]
VM_FILE["vm_file: *file\n(NULL for anonymous mappings)"]
VM_PGOFF["vm_pgoff: unsigned long\noffset in file (in pages)"]
VM_OPS["vm_ops: *vm_operations_struct\n.fault, .map_pages, .close"]
end
MM_STRUCT --> VMA
Process Address Space Layout (x86-64 Linux)¶
flowchart TD
subgraph VAS["Virtual Address Space (128TB user space)"]
KERNEL["0xFFFF800000000000 +\nKernel virtual address space\n(not user-accessible)"]
STACK["0x7FFFFFFFFFFF\n↓ Stack grows down\nMain thread stack (8MB default)"]
MMAP_REGION["Memory-mapped region\nmmap() calls, .so libraries\nThread stacks"]
HEAP["Heap\nbrk() / mmap(MAP_ANON)\nmalloc allocates here"]
BSS["BSS segment\nUninit globals (zero pages)"]
DATA["Data segment\nInitialized globals"]
TEXT["0x400000\nText segment\nRead-only code"]
end
4. Page Fault Handler: Demand Paging Internals¶
sequenceDiagram
participant CPU
participant MMU
participant PF as do_page_fault()
participant VMA_FIND as find_vma()
participant FAULT as handle_mm_fault()
participant SWAP as Swap/File I/O
CPU->>MMU: Access virtual address va
MMU->>MMU: PTE.Present = 0
MMU->>CPU: #PF exception (error_code: write, user/kernel)
CPU->>PF: do_page_fault(regs, error_code)
PF->>VMA_FIND: find_vma(mm, va)
alt va not in any VMA
VMA_FIND-->>PF: NULL
PF->>CPU: SIGSEGV
end
alt Permission violation (write to read-only VMA)
PF->>CPU: SIGSEGV
end
PF->>FAULT: handle_mm_fault(vma, va, flags)
alt Anonymous page (stack/heap)
FAULT->>FAULT: Allocate physical frame (alloc_page)
FAULT->>FAULT: Zero-fill page (security)
FAULT->>FAULT: Update PTE: physical addr | Present | RW
end
alt File-backed page (mmap of .so or file)
FAULT->>SWAP: Read page from file/swap
SWAP-->>FAULT: Page data in page cache
FAULT->>FAULT: Update PTE
end
alt Copy-on-Write (fork child writes)
FAULT->>FAULT: Allocate new frame
FAULT->>FAULT: Copy parent's page
FAULT->>FAULT: Update child PTE to new frame | RW
FAULT->>FAULT: Update parent PTE: clear dirty (refcount--if 1, just make RW)
end
FAULT-->>CPU: Return, retry faulting instruction
5. File System: VFS Inode/Dentry Cache¶
flowchart TD
subgraph VFS["Virtual File System Layer"]
SYSCALL["open(path, flags) system call"]
NAMEI["path_lookup / filename_lookup\nWalks dentry cache"]
DCACHE["Dentry Cache (dcache)\nHashtable: {parent_inode, name} → dentry\n~O(1) lookup for cached paths"]
ICACHE["Inode Cache\nRB-tree: inode number → inode struct\nContains: size, mtime, uid, gid, block map"]
FILE_OBJ["struct file\nf_pos (seek position)\nf_op (read, write, mmap ops)\nf_inode pointer"]
FD_TABLE["process fd table\nfd[0]=stdin, fd[1]=stdout,\nfd[2]=stderr, fd[n]=file*"]
end
subgraph CONCRETE_FS["Concrete FS (ext4/xfs/btrfs)"]
EXT4_INODE["ext4 inode\nDirect blocks: 12 × 4KB = 48KB\nIndirect block: 12+1024 blocks\nDouble indirect: 12+1024+1024²\nExtent tree (modern)"]
end
SYSCALL --> NAMEI --> DCACHE --> ICACHE --> FILE_OBJ --> FD_TABLE
ICACHE --> CONCRETE_FS
read() System Call: Path from Syscall to Disk¶
sequenceDiagram
participant APP as Application
participant KERN as Kernel (VFS)
participant PAGECACHE as Page Cache
participant BLOCK as Block Layer
participant DISK as NVMe / HDD
APP->>KERN: read(fd, buf, count)
KERN->>KERN: copy fd → file* → inode*
KERN->>KERN: file->f_op->read_iter()
KERN->>PAGECACHE: find_get_page(mapping, page_index)
alt Page in cache
PAGECACHE-->>KERN: struct page* (data ready)
KERN->>APP: copy_to_user(buf, page_data, count)
end
alt Page not in cache (cache miss)
PAGECACHE->>KERN: miss
KERN->>PAGECACHE: add_to_page_cache_locked(new_page)
KERN->>BLOCK: submit_bio(BIO_READ, sector, page)
BLOCK->>BLOCK: I/O scheduler: merge + reorder requests
BLOCK->>DISK: NVMe queue submission
DISK-->>BLOCK: DMA complete, interrupt
BLOCK->>KERN: bio_end_io callback
KERN->>KERN: unlock page, wake waiters
KERN->>APP: copy_to_user()
end
6. System Call Dispatch: User Space → Kernel¶
flowchart TD
subgraph USER["User Space"]
LIBC_FUNC["libc: read(fd, buf, count)"]
SYSCALL_INSTR["SYSCALL instruction (x86-64)\nor INT 0x80 (legacy 32-bit)"]
LIBC_FUNC --> SYSCALL_INSTR
end
subgraph KERNEL["Kernel Mode Transition"]
SWAPGS["SWAPGS: swap GS with MSR_KERNEL_GS_BASE\n(load kernel per-CPU data)"]
SAVE_REGS["Save user regs to kernel stack\n(pt_regs structure)"]
LOOKUP["syscall_table[rax]\nRax = syscall number (e.g., 0=read, 1=write)"]
DISPATCH["Call sys_read / sys_write / etc."]
RESTORE["Restore user regs\nSWAPGS again\nSYSRET instruction"]
SWAPGS --> SAVE_REGS --> LOOKUP --> DISPATCH --> RESTORE
end
SYSCALL_INSTR --> SWAPGS
RESTORE --> USER
Syscall Entry via VDSO (Virtual Dynamic Shared Object)¶
flowchart LR
GETTIMEOFDAY["gettimeofday() in user code"]
VDSO_PAGE["VDSO page\n(mapped by kernel into every process)\nContains: clock_gettime, gettimeofday, vgettimeofday\nReads directly from kernel-exported vsyscall page"]
CLOCKSOURCE["Kernel clocksource\n(TSC, HPET)\nUpdated by kernel, mapped read-only to user"]
NO_SYSCALL["No kernel mode transition!\nDirect memory read from vsyscall area\n~20 cycles vs 1000 cycles for real syscall"]
GETTIMEOFDAY --> VDSO_PAGE --> CLOCKSOURCE --> NO_SYSCALL
7. IPC Mechanisms: Pipes, Sockets, Shared Memory¶
flowchart TD
subgraph PIPE["Anonymous Pipe"]
WRITE_END["fd[1]: write end\nfile→inode→pipe_inode_info"]
RING_BUF["Kernel ring buffer\n65536 bytes (16 pages)\nCircular buffer of struct pipe_buffer"]
READ_END["fd[0]: read end\nBlocks in TASK_INTERRUPTIBLE if empty"]
WRITE_END --> RING_BUF --> READ_END
end
subgraph SOCKET["Unix Domain Socket"]
SRV_SOCK["Server socket\nunix_proto_ops\nbind path in /tmp/"]
CLI_SOCK["Client socket\nconnect() to path"]
SK_BUFF["sk_buff chain\nSocket kernel buffer\nskmem_alloc"]
SRV_SOCK <--> SK_BUFF <--> CLI_SOCK
end
subgraph SHM["POSIX Shared Memory"]
SHM_OPEN["shm_open(): creates tmpfs file\nin /dev/shm/"]
MMAP_BOTH["Both processes mmap() same file\nPTEs in both mm_structs\npoint to same physical pages"]
SEM["sem_post/wait\nfutex in shared memory\nfor synchronization"]
SHM_OPEN --> MMAP_BOTH --> SEM
end
8. Linux Kernel Synchronization Primitives¶
flowchart TD
subgraph SPIN["Spinlocks (spin_lock/spin_unlock)"]
SP_IMPL["Atomic test-and-set on ticket\nBusy-waits (polling)\nNO sleep allowed\nUsed in interrupt context\nor critical sections < few µs"]
SP_IRQ["spin_lock_irqsave:\nAlso disables interrupts\nPrevents IRQ handler deadlock"]
end
subgraph MUTEX["Mutex (mutex_lock/unlock)"]
MX_IMPL["Sleeping lock\nPuts contending task in TASK_UNINTERRUPTIBLE\nContext switch to another task\nWakeup when lock released\nOptimistic spinning before sleeping (adaptive)"]
end
subgraph RCU["RCU (Read-Copy-Update)"]
RCU_IMPL["Readers: rcu_read_lock/unlock\nZero-overhead (just preempt disable)\nWriters:\n 1. Copy old object\n 2. Modify copy\n 3. Atomic pointer swap (rcu_assign_pointer)\n 4. synchronize_rcu(): wait for all old readers\n 5. Free old object"]
RCU_USE["Used for: networking (struct sock),\nrouting tables, VFS dentry cache\nRead-heavy workloads: readers never block"]
end
subgraph SEMAPHORE["Semaphore (down/up)"]
SEM_IMPL["Counting semaphore\ndown(): decrement count, sleep if 0\nup(): increment count, wake waiter\nUsed for: access count limits\n(e.g., max N concurrent requests)"]
end
9. Memory Allocators: slab/slub Allocator Internals¶
flowchart TD
subgraph PAGE_ALLOC["Buddy System (page allocator)"]
BUDDY["buddy_system:\nFree pages grouped in 2^order blocks\nAlloc 2^n pages: find smallest suitable block\nSplit larger blocks\nMerge free buddies on free"]
ORDER["Order 0 = 4KB (1 page)\nOrder 1 = 8KB (2 pages)\n...\nOrder 11 = 8MB (2048 pages)"]
end
subgraph SLAB["SLAB/SLUB Allocator (kmalloc)"]
SLUB_CACHE["kmem_cache for each object size\n(e.g., task_struct, dentry, sk_buff)"]
SLAB_PER_CPU["Per-CPU slab freelist\nLock-free fast path\nNo global lock for common case"]
PARTIAL["Partial pages list\nPages with some free slots\nFilled from global list when per-CPU empty"]
FULL["Full pages list\nAll slots allocated"]
SLUB_CACHE --> SLAB_PER_CPU --> PARTIAL --> FULL
end
subgraph BUDDY_TO_SLAB["Integration"]
CONNECT["slab allocates buddy pages\nSlices them into fixed-size objects\nkmalloc(size) → find appropriate kmem_cache\nPop from freelist (O(1))"]
end
PAGE_ALLOC --> SLAB
SLAB --> BUDDY_TO_SLAB
10. Process Creation: fork() and exec() Internals¶
sequenceDiagram
participant PARENT
participant KERNEL
participant CHILD
PARENT->>KERNEL: fork() / clone()
KERNEL->>KERNEL: Allocate task_struct (from slab)
KERNEL->>KERNEL: Copy mm_struct (shallow: CoW)
Note over KERNEL: All PTEs marked read-only\nfor CoW page sharing
KERNEL->>KERNEL: Copy file descriptor table (dup fds)
KERNEL->>KERNEL: Copy signal handlers
KERNEL->>KERNEL: Assign new PID
KERNEL-->>CHILD: Return 0 (child)
KERNEL-->>PARENT: Return child PID
CHILD->>KERNEL: execve("/bin/ls", argv, envp)
KERNEL->>KERNEL: Open ELF binary
KERNEL->>KERNEL: Load PT_LOAD segments → new VMAs
KERNEL->>KERNEL: Map ld.so (program interpreter)
KERNEL->>KERNEL: Setup new stack (argv, envp, auxv)
KERNEL->>KERNEL: Reset signal handlers
KERNEL->>KERNEL: Replace mm (munmap old address space)
KERNEL-->>CHILD: Jump to ld.so entry point
11. Linux Scheduling Classes (Priority Order)¶
flowchart TD
subgraph SCHED_CLASSES["Scheduling Classes (stop → dl → rt → cfs → idle)"]
STOP["stop_sched_class\nHighest priority\nMigration threads\nstop_machine() calls"]
DL["dl_sched_class (SCHED_DEADLINE)\nEDF: earliest deadline first\nBandwidth admission control\nReal-time tasks with deadlines"]
RT["rt_sched_class (SCHED_FIFO, SCHED_RR)\nFixed priority 1-99\nRuns until yields/blocks\nPre-empts all lower classes"]
CFS["fair_sched_class (SCHED_NORMAL, SCHED_BATCH)\nCFS with vruntime rbtree\nMost user processes"]
IDLE["idle_sched_class\nCPU idle loop\nhlt instruction\nC-states for power saving"]
STOP --> DL --> RT --> CFS --> IDLE
end
12. Kernel Module Loading: insmod Internals¶
sequenceDiagram
participant USER as insmod/modprobe
participant KERNEL
participant VERIFIER
USER->>KERNEL: init_module(module_image, len, params)
KERNEL->>KERNEL: Allocate kernel memory (vmalloc)
KERNEL->>KERNEL: Copy module ELF from user space
KERNEL->>VERIFIER: Verify ELF (magic, sections)
KERNEL->>KERNEL: Apply relocations (R_X86_64_PC32 etc.)
KERNEL->>KERNEL: Resolve symbols from kernel symbol table
Note over KERNEL: ksymtab: exported symbols (EXPORT_SYMBOL)
KERNEL->>KERNEL: Check module license (GPL required for some symbols)
KERNEL->>KERNEL: Run module->init() function
KERNEL->>KERNEL: Register in modules list
KERNEL-->>USER: 0 (success)