Systems Programming Internals: Memory Models, Syscalls & Concurrency Primitives¶

Under the Hood: How the CPU memory model orders stores/loads, how system calls cross the user/kernel boundary, how mutexes and futexes implement blocking in the kernel, how Rust's borrow checker enforces memory safety at compile time, how POSIX threads map to kernel scheduler entities.

1. CPU Memory Model: Store/Load Ordering¶

Modern CPUs reorder memory operations for performance. The programmer's sequential mental model does not match hardware reality.

flowchart TD
    subgraph "x86-TSO Memory Model"
        CPU1["CPU 1\nStore buffer: [x=1]\n(not yet visible to CPU2!)"]
        CPU2["CPU 2\nStore buffer: [y=1]\n(not yet visible to CPU1!)"]
        L1_1["L1 Cache (CPU1)"]
        L1_2["L1 Cache (CPU2)"]
        LLC["L3 Cache (Shared)"]
        MEM["Main Memory"]
        CPU1 -->|STORE| L1_1
        CPU2 -->|STORE| L1_2
        L1_1 --> LLC --> MEM
        L1_2 --> LLC --> MEM
    end
    subgraph "Classic Reordering Bug (TSO allows)"
        T1["Thread 1:\n  x = 1\n  r1 = y"]
        T2["Thread 2:\n  y = 1\n  r2 = x"]
        RESULT["Possible outcome: r1=0, r2=0\n(both threads read stale value\nbefore other's store drains)\nImpossible on sequential model!"]
        T1 --> RESULT
        T2 --> RESULT
    end

Memory Fences and Barriers¶

flowchart LR
    MFENCE["MFENCE (x86):\n  Drain store buffer\n  Wait for all pending stores\n  to be globally visible\n  ← → full barrier"]
    SFENCE["SFENCE:\n  Store barrier only\n  (stores ordered before fence\n  visible before stores after)"]
    LFENCE["LFENCE:\n  Load barrier only\n  (serializes loads\n  also prevents speculation)"]
    LOCK["LOCK prefix:\n  Implicit full barrier\n  (LOCK ADD, XCHG, CMPXCHG)\n  → Read-modify-write atomic"]
    MFENCE --> SFENCE
    MFENCE --> LFENCE
    MFENCE --> LOCK

2. System Call: User Space → Kernel Crossing¶

sequenceDiagram
    participant APP as User Space Process
    participant CPU as CPU Hardware
    participant KERN as Linux Kernel

    APP->>CPU: SYSCALL instruction\n(x86-64: rax=syscall_number\nrdi,rsi,rdx,r10,r8,r9 = args)
    Note over CPU: Save registers to kernel stack\nSwitch CS to kernel segment (ring 0)\nLoad kernel stack pointer\nJump to LSTAR MSR (syscall entry)

    CPU->>KERN: entry_SYSCALL_64:\n  1. Save user regs to pt_regs\n  2. Check seccomp filters\n  3. sys_call_table[rax](args)\n     → e.g., sys_write(fd,buf,count)
    Note over KERN: Execute in kernel mode\n(full memory access, all privileges)
    KERN->>CPU: SYSRET instruction\n  Restore user registers\n  Switch CS back to user segment (ring 3)
    CPU-->>APP: Return value in rax

vDSO: Avoiding Syscall Overhead¶

For frequently called syscalls with no side effects (gettimeofday, clock_gettime), Linux maps kernel code directly into user process address space via vDSO (virtual Dynamic Shared Object):

gettimeofday() call in user space:
  NORMAL: SYSCALL → kernel → read clock → SYSRET (~100ns)
  vDSO:   Call vDSO function directly (no ring switch!)
          Reads time from shared memory page (mapped by kernel)
          ~10ns — 10× faster

3. Mutex Implementation: Futex¶

A futex (fast userspace mutex) avoids syscalls in the uncontended case:

sequenceDiagram
    participant T1 as Thread 1 (holder)
    participant T2 as Thread 2 (waiter)
    participant FUTEX as Kernel Futex Table

    Note over T1: Lock (uncontended):
    T1->>T1: CAS(futex_word, 0, 1)\n[user space — no syscall!]
    Note over T1: Success → lock acquired

    Note over T2: Lock (contended):
    T2->>T2: CAS(futex_word, 0, 1)\nFails! (T1 holds lock)
    T2->>T2: CAS(futex_word, 1, 2)\n(mark contention: value=2)
    T2->>FUTEX: syscall: futex(FUTEX_WAIT, &futex_word, 2)\n[enter kernel, add to wait queue, sleep]

    Note over T1: Unlock:
    T1->>T1: XCHG(futex_word, 0)\nReturns 2 (contended)
    T1->>FUTEX: syscall: futex(FUTEX_WAKE, &futex_word, 1)\n[only needed when contended!]
    FUTEX->>T2: Wake from wait queue
    Note over T2: Retry CAS → acquire lock

Fast path: Lock/unlock with no waiters = 1 atomic CAS + 0 syscalls.
Slow path: Only when contended, syscall enters kernel to sleep/wake.

Spinlock vs Mutex Tradeoff¶

flowchart LR
    SPIN["Spinlock\n  while(!CAS(lock,0,1)) { PAUSE }\n  Pros: No context switch overhead\n  Cons: Burns CPU while waiting\n  Use: Short critical sections (<5μs)\n       Under interrupt handlers (no sleep!)"]
    MUTEX["Mutex (via futex)\n  CAS fast path → sleep if contended\n  Pros: Yields CPU to other threads\n  Cons: Context switch overhead (~1-5μs)\n  Use: Long critical sections\n       When contention is common"]
    SPIN --> MUTEX

4. Rust Borrow Checker: Ownership Rules at Compile Time¶

flowchart TD
    subgraph "Rust Ownership Rules"
        R1["Rule 1: Each value has exactly one owner\n  let x = String::from('hello');\n  x is owner of heap memory"]
        R2["Rule 2: Owner drop → memory freed\n  (no GC: RAII destructor at scope end\n  → compiler inserts Drop::drop())"]
        R3["Rule 3: Only one &mut ref OR\n  multiple & refs (never both)\n  Enforced at compile time!"]
        R1 --> R2 --> R3
    end
    subgraph "Move vs Copy Semantics"
        MOVE["Move (heap types: String, Vec, Box):\n  let y = x;  // x MOVED to y\n  println!({x}) // ERROR: x moved!\n  → zero-cost, just copy pointer bits"]
        COPY["Copy (stack types: i32, bool, f64, refs):\n  let y = x;  // x COPIED to y\n  println!({x}) // OK: x still valid\n  → impl Copy trait"]
        MOVE --> COPY
    end

Lifetime Annotations: Borrow Scope Enforcement¶

sequenceDiagram
    participant CODE as Source Code
    participant BC as Borrow Checker

    Note over CODE: fn longest<'a>(x: &'a str, y: &'a str) -> &'a str
    CODE->>BC: What lifetime does the return value have?
    Note over BC: 'a = intersection of lifetimes of x and y\nReturn ref valid as long as BOTH x and y valid
    Note over BC: Verify all call sites:\n  returned ref used only within that intersection
    BC-->>CODE: OK or Error: "does not live long enough"

Borrow checker eliminates at compile time: - Use-after-free: owner dropped while borrow exists → compile error - Data races: &mut reference while any other reference active → compile error - Dangling references: reference to value that left scope → compile error

5. POSIX Threads: Kernel Mapping and Scheduling¶

flowchart TD
    subgraph "Thread Implementation: Linux"
        PTHREAD["pthread_create()\n→ clone() syscall:\n  CLONE_VM: share address space\n  CLONE_FS: share file system\n  CLONE_FILES: share file descriptors\n  CLONE_SIGHAND: share signal handlers"]
        TASK["kernel: New task_struct\n(same process, new stack, new TID)\nShares mm_struct (page tables)\nScheduled independently by CFS"]
        SCHED["CFS (Completely Fair Scheduler):\n  vruntime = actual_runtime × (weight_nice0 / weight_this)\n  Always run task with lowest vruntime\n  Red-black tree ordered by vruntime"]
        PTHREAD --> TASK --> SCHED
    end
    subgraph "Thread-Local Storage (TLS)"
        TLS["__thread int x; (C/C++)\nor: thread_local int x; (C++11)\n→ each thread gets own x variable\nAccessed via FS register offset (x86-64)\nFS segment base = thread's TCB address\nx at offset [TCB + N]"]
    end

Context Switch: What Gets Saved¶

flowchart LR
    subgraph "Registers Saved on Context Switch"
        CALLEE["Callee-saved (compiler):\nrbx, rbp, r12-r15\n(caller responsible for rax, rcx, rdx, rsi, rdi, r8-r11)"]
        SPECIAL["Special registers:\nrip (instruction pointer)\nrsp (stack pointer)\nrflags (condition codes)"]
        FPU["FPU/SIMD (lazy save):\nxmm0-xmm15, ymm0-ymm15\nOnly saved if FPU used since last switch\n(XSAVE instruction — slow, avoid if possible)"]
        CALLEE --> SPECIAL --> FPU
    end

6. Memory Allocator Internals: glibc malloc¶

flowchart TD
    subgraph "glibc ptmalloc Heap Layout"
        FASTBIN["Fastbins (8-160 bytes):\n  LIFO singly-linked list per size class\n  No coalescing — fastest path\n  Stays in CPU cache (recently freed)"]
        SMALLBIN["Small bins (16-512 bytes):\n  Doubly-linked sorted by size\n  First-fit within size class"]
        LARGEBIN["Large bins (>512 bytes):\n  Sorted by size + address\n  Best-fit with skip list for O(log N) search"]
        TOPCHUNK["Top chunk:\n  Remainder at wilderness\n  Extended via sbrk()/mmap()\n  Coalesce freed chunks into top"]
        FASTBIN --> SMALLBIN --> LARGEBIN --> TOPCHUNK
    end
    subgraph "Chunk Header"
        HEADER["Each allocation has 16-byte header:\n  size | PREV_INUSE | IS_MMAPPED | NON_MAIN_ARENA\n  prev_size (if previous chunk free)\nUser data starts here (returned pointer)\nFooter: size repeated (for backward coalescing)"]
    end

TCMalloc: Thread-Caching for Low Contention¶

flowchart LR
    subgraph "TCMalloc Levels"
        THREAD_CACHE["Thread Cache:\n  Per-thread free lists (0-256KB)\n  No locks needed!\n  256 size classes (8B, 16B, 32B...)"]
        CENTRAL_CACHE["Central Cache:\n  Spans of pages\n  Lock only when thread cache empty/full\n  Transfer batch of objects at once"]
        PAGE_HEAP["Page Heap:\n  mmap() from OS\n  256KB+ allocations bypass thread cache\n  Pagemap: addr → span metadata"]
        THREAD_CACHE --> CENTRAL_CACHE --> PAGE_HEAP
    end

7. POSIX File I/O: Kernel Paths¶

sequenceDiagram
    participant APP as Application
    participant VFS as VFS Layer
    participant FS as ext4 Filesystem
    participant BCACHE as Block Cache (page cache)
    participant DISK as NVMe SSD

    APP->>VFS: read(fd, buf, 4096)
    VFS->>BCACHE: Lookup page cache:\n  (inode, page_offset) → page?
    BCACHE->>VFS: Cache HIT → copy to user buf
    VFS-->>APP: Returns 4096 bytes (no I/O!)

    Note over APP: Cache MISS scenario:
    VFS->>FS: readpage(inode, page_index)
    FS->>DISK: bio_submit (block number)
    Note over APP: Process sleeps (IO_WAIT)
    DISK-->>FS: DMA transfer → page cache page
    FS->>BCACHE: Page ready
    BCACHE->>VFS: copy_to_user(buf, page)
    VFS-->>APP: Returns (woken up)

O_DIRECT: Bypassing Page Cache¶

// O_DIRECT: bypass page cache entirely
fd = open("data.bin", O_RDWR | O_DIRECT);
// Requires aligned buffer (aligned to 512 or 4096 bytes)
// Transfers directly: user_buf ↔ disk (via DMA)
// Used by: databases (manage their own cache)
//          backup tools (avoid evicting hot pages)

8. Signal Handling: Delivery and Async-Safety¶

sequenceDiagram
    participant APP as User Process
    participant KERN as Kernel
    participant HANDLER as Signal Handler

    Note over APP: Running user code
    KERN->>APP: Deliver SIGINT:\n  1. Save user register state to stack\n  2. Modify stack to call signal handler\n  3. Return to signal handler
    Note over KERN,APP: Push signal frame on user stack:\n  pt_regs (saved registers)\n  signal info\n  ucontext (FPU state)
    APP->>HANDLER: Execute SIGINT handler
    Note over HANDLER: ASYNC-SIGNAL-SAFE functions only!\n  (malloc is NOT safe: re-entrant deadlock)\n  Safe: write(), _exit(), signal(), kill()\nUnsafe: printf(), malloc(), pthread_mutex_lock()
    HANDLER->>APP: sigreturn() syscall\n  → kernel restores registers from stack
    Note over APP: Resume from exact instruction\nwhere SIGINT was delivered

9. NUMA Architecture: Memory Locality¶

flowchart TD
    subgraph "NUMA System (2-socket server)"
        NODE0["NUMA Node 0\nCPU 0-15\nLocal RAM: 64GB\n(Access: ~70ns)"]
        NODE1["NUMA Node 1\nCPU 16-31\nLocal RAM: 64GB\n(Access: ~70ns)"]
        INTERCO["QPI/UPI Interconnect\n(Remote access: ~140ns\n= 2× penalty!)"]
        NODE0 <--> INTERCO <--> NODE1
    end
    subgraph "NUMA-Aware Allocation"
        POLICY["numactl --membind=0 ./process\n  Allocate memory only on node 0\n  (process pinned to node 0 CPUs)\n→ Always local access, no cross-node\nnuma_alloc_local(): first-touch policy"]
    end

Summary: Systems Programming Primitives¶

Primitive	Implementation	Kernel Involvement
Mutex lock (uncontended)	CAS in user space	None (fast path)
Mutex lock (contended)	futex(FUTEX_WAIT)	Yes — process blocked
Thread create	clone(CLONE_VM\|...)	Yes — new task_struct
Context switch	save regs, load regs	Yes — CFS scheduler
Memory alloc (<256B)	Thread cache LIFO	None (tcmalloc)
Memory alloc (large)	mmap(MAP_ANON)	Yes — page table update
File read (cached)	copy from page cache	Minimal
File read (uncached)	bio submit + sleep	Yes — I/O scheduler
Signal delivery	Modify stack → handler	Yes — return to usermode
System call	SYSCALL instruction	Yes — ring 0 transition