Database Systems Internals: Under the Hood¶

Synthesized from: Elmasri & Navathe Fundamentals of Database Systems 6^th ed, Korotkevitch Pro SQL Server Internals 2^nd ed, MySQL database design references, and supporting comp(85/230/305-322) database references.

1. Storage Engine Architecture — Pages, Extents, and Buffer Pool¶

Every relational database manages storage through a page-based abstraction layer. Understanding page anatomy is the foundation for all performance analysis.

InnoDB Page Layout (16 KB default)¶

+------------------+ offset 0
| File Header      | 38 bytes: page type, LSN, space_id, page_no, checksum
+------------------+
| Page Header      | 56 bytes: slot count, free space, garbage ptr, level
+------------------+
| Infimum Record   | virtual lower bound record (fixed)
+------------------+
| User Records     | actual row data (grows toward free space)
|   ↓              |
+------------------+
| Free Space       | unallocated area between records and directory
|   ↑              |
+------------------+
| Page Directory   | 2-byte slots, each points to record (grows upward)
| (Slot Array)     | binary searchable: O(log n) slot scan
+------------------+
| File Trailer     | 8 bytes: LSN checksum verification
+------------------+ offset 16383

Page types: FIL_PAGE_INDEX (B+Tree node), FIL_PAGE_UNDO_LOG, FIL_PAGE_INODE, FIL_PAGE_IBUF_BITMAP, FIL_PAGE_TYPE_SYS, FIL_PAGE_BLOB.

Buffer Pool Architecture¶

flowchart TD
    subgraph Buffer_Pool["InnoDB Buffer Pool (e.g. 8 GB)"]
        direction TB
        LRU_NEW["LRU New Sublist (5/8)\nHot pages - recently accessed"]
        LRU_OLD["LRU Old Sublist (3/8)\nCold pages - aging out"]
        FREE["Free Page List"]
        FLUSH["Flush List\n(dirty pages ordered by LSN)"]
        LRU_NEW <-->|"midpoint insertion\nnewly read → old head"| LRU_OLD
        LRU_OLD -->|"page not re-accessed\nwithin innodb_old_blocks_time (1s)"| LRU_NEW
        LRU_OLD -->|"eviction"| FREE
    end

    SQL["SQL Query"] --> BUF_LOOKUP{"Page in\nbuffer pool?"}
    BUF_LOOKUP -->|"hit"| RETURN["Return page\nno disk I/O"]
    BUF_LOOKUP -->|"miss"| FREE
    FREE -->|"read page from\ntablespace file"| LRU_OLD

Double-write buffer: Before flushing dirty pages to tablespace, InnoDB writes them sequentially to the doublewrite buffer (2 MB, 128 pages). This protects against partial page writes (torn pages) during crash — recovery reads from doublewrite if page checksum fails.

2. B+Tree Index Internals¶

Node Structure and Split Algorithm¶

flowchart TD
    Root["Root Node\nPage 4\n[K1=50, K2=150]\nPtrs: [P1, P2, P3]"] 

    Root -->|"P1: key < 50"| L1["Leaf Page\n[10,20,30,40]\nPrev←→Next ptrs"]
    Root -->|"P2: 50 ≤ key < 150"| L2["Leaf Page\n[50,80,100,120]\nPrev←→Next ptrs"]
    Root -->|"P3: key ≥ 150"| L3["Leaf Page\n[150,200,250]\nPrev←→Next ptrs"]

    L1 <-->|"sibling links\nfor range scans"| L2
    L2 <-->| | L3

Page Split on INSERT¶

When a leaf page reaches capacity (fill factor ~69% to leave room for updates):

sequenceDiagram
    participant TX as Transaction
    participant BP as Buffer Pool
    participant BT as B+Tree

    TX->>BT: INSERT (key=75)
    BT->>BP: Find leaf page containing 75
    Note over BP: Page has 15/15 records — FULL
    BT->>BP: Allocate new page N
    BT->>BT: Split: move upper half to N\nInsert separator key 88 into parent
    Note over BT: Parent also full? → recursive split upward
    BT->>BP: Write both pages to flush list (dirty)
    BT->>TX: Insert complete (record in new page N)

Clustered index (InnoDB primary key): Entire row stored in B+Tree leaf. Physical order = PK order. Fragmentation occurs on random PK inserts (UUID PKs = worst case).

Secondary index: Leaf stores (secondary_key, primary_key). Point lookup: secondary index B+Tree → PK value → clustered index B+Tree (two B+Tree traversals = "bookmark lookup").

Index Fill Factor and Fragmentation¶

flowchart LR
    A["Sequential INSERT\n(AUTO_INCREMENT PK)\nPages fill left-to-right\nFill factor ~95%"] -->|"ANALYZE TABLE"| B["Fragmentation: ~0%"]
    C["Random INSERT\n(UUID PK or random hash)\nPage splits everywhere\nFill factor ~50-69%"] -->|"ANALYZE TABLE"| D["Fragmentation: 30-50%\nALTER TABLE FORCE or\nOPTIMIZE TABLE to rebuild"]

3. InnoDB MVCC — Undo Log Chain¶

MVCC (Multi-Version Concurrency Control) enables readers to never block writers. Every row version is chained through undo logs.

Row Version Chain¶

flowchart LR
    CURRENT["Clustered Index Leaf\nROW: id=5, salary=75000\nDB_TRX_ID=1005\nDB_ROLL_PTR → undo"]

    UNDO1["Undo Log Segment\nOld version: salary=70000\nDB_TRX_ID=998\nROLL_PTR → prev undo"]

    UNDO2["Undo Log Segment\nOld version: salary=65000\nDB_TRX_ID=750\nROLL_PTR → null"]

    CURRENT -->|"DB_ROLL_PTR\n(7-byte rollback ptr)"| UNDO1
    UNDO1 -->|"prev rollback ptr"| UNDO2

Read View Mechanism¶

sequenceDiagram
    participant TX100 as Transaction 100 (long-running read)
    participant TX1005 as Transaction 1005 (writer)
    participant TRX_SYS as trx_sys (active list)

    Note over TX100: BEGIN, ReadView created\nup_limit_id=999, low_limit_id=1000\nids_list=[998,999]

    TX1005->>TRX_SYS: UPDATE row (trx_id=1005)
    TX1005->>TRX_SYS: COMMIT

    TX100->>TRX_SYS: SELECT salary FROM employees WHERE id=5
    Note over TX100: Row has DB_TRX_ID=1005\n1005 >= low_limit_id(1000) → INVISIBLE\nWalk undo chain → find DB_TRX_ID=750 < up_limit_id(999)\n→ VISIBLE: return salary=65000

Purge thread: Background thread (srv_purge_coordinator_thread) cleans undo logs when no active ReadView needs them. Long-running transactions prevent purge → undo tablespace grows unboundedly (classic ibdata1 bloat problem).

4. Transaction Log (WAL) and Crash Recovery¶

Write-Ahead Logging Protocol¶

flowchart TD
    TX["Transaction: UPDATE row"] --> UNDO_WRITE["1. Write UNDO log record\n(before-image of row)"]
    UNDO_WRITE --> BUFFER["2. Modify page in buffer pool\n(dirty page, not written yet)"]
    BUFFER --> REDO_WRITE["3. Write REDO log (WAL)\nLog record: {LSN, space_id, page_no,\noffset, before, after}\nfsync to redo log file"]
    REDO_WRITE --> COMMIT["4. COMMIT: write commit log record\nfsync (innodb_flush_log_at_trx_commit=1)\n→ durability guaranteed"]
    COMMIT --> FLUSH["5. Background: flush dirty\npages from buffer pool to .ibd\n(checkpoint advances LSN)"]

LSN (Log Sequence Number): monotonically increasing byte offset into redo log. Every page header stores FIL_PAGE_LSN = LSN of last modification. Pages with page_LSN < checkpoint_LSN are guaranteed durable.

Crash Recovery — ARIES Algorithm¶

sequenceDiagram
    participant Recovery as InnoDB Recovery
    participant RedoLog as Redo Log
    participant UndoLog as Undo Log

    Note over Recovery: Phase 1: ANALYSIS\nScan redo log from last checkpoint\nBuild dirty page table, active TX table

    Recovery->>RedoLog: Phase 2: REDO (Roll Forward)\nReplay ALL log records from checkpoint LSN\nEven uncommitted TXs are redone\n(brings DB to crash-moment state)

    Recovery->>UndoLog: Phase 3: UNDO (Roll Back)\nFor each uncommitted TX in active TX table\nApply undo records in reverse LSN order\n(atomicity: partial TXs rolled back)

    Note over Recovery: Database consistent\nNormal operation resumes

5. Query Execution Pipeline¶

Query Lifecycle¶

flowchart TD
    A["SQL String:\nSELECT u.name, COUNT(o.id)\nFROM users u JOIN orders o ON u.id=o.user_id\nWHERE u.region='US' GROUP BY u.id"] 

    A --> B["Parser\nLex/Yacc → AST\nSyntax validation\nIdentifier resolution"]
    B --> C["Semantic Analyzer\nTable/column existence\nPermission check\nType coercion"]
    C --> D["Query Rewriter\nView expansion\nSubquery → JOIN\nIN → EXISTS transformation"]
    D --> E["Cost-Based Optimizer\nEnumerate join orders\nIndex access path selection\nCardinality estimation"]
    E --> F["Execution Plan\nIterator tree of operators\nEach operator: open/next/close"]
    F --> G["Execution Engine\nVolcano/Iterator model\nPull-based evaluation"]
    G --> H["Result rows to client"]

Cost-Based Optimizer — Index Selection¶

flowchart TD
    A["Predicate: WHERE region='US' AND created_at > '2024-01-01'"] 

    A --> B["Statistics lookup\ninnodb_index_stats table\ncardinality per index"]

    B --> C["Option 1: Full table scan\nCost = n_rows × row_read_cost\n= 1,000,000 × 1.0 = 1,000,000"]

    B --> D["Option 2: idx_region\nSELECTIVITY = 200k/1M = 20%\nRange scan cost = 200,000 + 200,000 bookmark lookups\n= 400,000"]

    B --> E["Option 3: idx_region_created (composite)\nSELECTIVITY = 2k/1M = 0.2%\nCost = 2,000 (index only, no bookmark lookup)\n= 2,000 ✓ CHOSEN"]

    E --> F["Execution plan: index range scan on idx_region_created\nCovering index if SELECT columns ⊆ index columns"]

Volcano Iterator Model¶

sequenceDiagram
    participant Client
    participant HashAgg as HashAggregate.next()
    participant HashJoin as HashJoin.next()
    participant Scan as IndexScan.next()

    Client->>HashAgg: next()
    HashAgg->>HashJoin: next() [loop: build hash table from orders]
    HashJoin->>Scan: next() [probe side: fetch user rows]
    Scan-->>HashJoin: row {id=1, name="Alice", region="US"}
    HashJoin-->>HashAgg: joined row {name="Alice", order_count=5}
    HashAgg-->>Client: aggregated row {name="Alice", count=5}

Hash join internals: Build phase reads smaller table into in-memory hash table (hash(join_key) → bucket → row). Probe phase reads larger table, hashes join key, probes buckets. If hash table exceeds join_buffer_size → spill to disk (grace hash join with partitioning).

6. SQL Server Storage Internals (Pro SQL Server Internals)¶

Data Page (8 KB)¶

+-------------------+ 0
| Page Header       | 96 bytes: pageID, type, freeCount, slotCount, nextPage, prevPage
+-------------------+
| Row 0             | Variable-length: null bitmap + fixed cols + var-length ptr array + var data
| Row 1             |
| ...               |
| Row N             |
+-------------------+
| Free Space        |
+-------------------+
| Row Offset Array  | 2 bytes per row, grows from bottom
| [N offset]        | slot[i] = byte offset of row i within page
+-------------------+ 8191

Row Structure (Variable Length)¶

flowchart LR
    A["Status Bits (1B)\nhas_nulls, has_var_cols"] --> B["Fixed-length data\ncols in schema order\ne.g. int(4B) + tinyint(1B)"]
    B --> C["Null bitmap\n1 bit per nullable col"]
    C --> D["Variable col count (2B)"]
    D --> E["Variable col offset array\n2B per var col\n→ end offset of each var col"]
    E --> F["Variable-length data\nVARCHAR/NVARCHAR contents"]

Forwarded records: When UPDATE increases row size beyond page capacity, row moved to new page. Original slot gets an 8-byte forwarding pointer. Heap scans follow forwarding pointers — performance degrades. Fix: ALTER TABLE REBUILD (clustered index) eliminates forwarded records.

SQL Server Lock Hierarchy¶

flowchart TD
    DB["Database Lock\n(IS, S, IX, SIX, X)"] --> TABLE["Table Lock\n(IS, S, IX, SIX, X)"]
    TABLE --> PAGE["Page Lock\n(IS, S, IX, SIX, X)"]
    PAGE --> ROW["Row (Key) Lock\n(S, U, X, RangeS-S, RangeI-N, ...)"]

Lock escalation: SQL Server escalates row/page locks to table lock when lock count exceeds ~5000 per transaction (to reduce memory overhead). Can cause blocking. Disable with ALTER TABLE ... SET (LOCK_ESCALATION = DISABLE).

NOLOCK hint / READ UNCOMMITTED: Reads pages without acquiring shared locks → dirty reads possible (sees uncommitted data, phantom rows, even rolled-back data mid-flight).

7. Transaction Isolation Levels and Anomaly Prevention¶

flowchart TD
    subgraph Isolation_Levels
        RU["READ UNCOMMITTED\nNo locks acquired on read\nDirty read ✓, Phantom ✓"]
        RC["READ COMMITTED\nShared lock acquired + released after read\nDirty read ✗, Non-repeatable read ✓"]
        RR["REPEATABLE READ (default MySQL)\nShared lock held until TX end\nDirty ✗, Non-repeatable ✗, Phantom ✓\nInnoDB: gap locks prevent phantoms too"]
        SER["SERIALIZABLE\nRange locks / predicate locks\nAll anomalies ✗"]
        SER_SI["SNAPSHOT ISOLATION (SQL Server)\nMVCC ReadView per TX\nDirty ✗, Non-repeatable ✗, Phantom ✗\nWrite skew still possible"]
    end

Gap locks (InnoDB RR): Lock the gap before a record, preventing INSERT into range. If WHERE id BETWEEN 10 AND 20, gap locks on all gaps in that range prevent concurrent INSERTs. Eliminates phantoms without full serializable isolation.

Deadlock detection: Each lock manager maintains a "waits-for graph". Background thread runs cycle detection (DFS). On cycle found, victim selection: smallest transaction (by undo log size) rolled back. Deadlock information written to INFORMATION_SCHEMA.INNODB_TRX.

8. Index Types and Access Patterns¶

flowchart TD
    subgraph Index_Types
        BTREE["B+Tree Index\nOrdered, range-scannable\nInnoDB default\nO(log n) point, O(log n + k) range"]
        HASH["Hash Index\nMemory engine only (InnoDB adaptive hash)\nO(1) exact match\nNo range scan support"]
        FULLTEXT["Full-Text Index\nInverted index: term → {docid, position}\nMySQL: FTS_DOC_ID column\nTFIDF/BM25 ranking"]
        SPATIAL["Spatial Index (R-Tree)\nMBR nesting, bounding box queries\nMySQL geometry types\nST_Contains, ST_Distance"]
        BITMAP["Bitmap Index (Oracle/columnar)\nBit vector per distinct value\nEfficient for low-cardinality cols\nAND/OR = bitwise ops"]
    end

Covering Index — Eliminating Bookmark Lookup¶

flowchart LR
    A["Query: SELECT name, email\nFROM users WHERE region='US'"]

    A --> B{Index covers\nname, email, region?}
    B -->|"No (idx_region only)"| C["Index Range Scan → 200k PKs\n200k Bookmark Lookups to clustered index\n200k random I/Os ← SLOW"]
    B -->|"Yes (idx_region_name_email)"| D["Index Only Scan\nAll data in index leaf\n0 bookmark lookups ← FAST"]

9. Join Algorithms — Memory and CPU Paths¶

flowchart TD
    subgraph Nested_Loop
        A["For each row R in outer table\n  For each row S in inner table\n    IF R.key == S.key → emit\nCost: O(|R| × |S|)\nGood when inner table small or indexed"]
    end

    subgraph Hash_Join
        B["Build phase: load smaller table\ninto hash table keyed by join col\nProbe phase: scan larger table\nhash lookup per row\nCost: O(|R| + |S|)\nRequires join_buffer_size memory"]
    end

    subgraph Sort_Merge_Join
        C["Sort both inputs on join key\nTwo-pointer merge scan\nCost: O(|R|log|R| + |S|log|S|)\nGood if inputs already sorted (index)"]
    end

Block Nested Loop (MySQL pre-8.0): Read outer table in chunks of join_buffer_size, scan inner table once per chunk. Reduces inner table reads from |outer| to |outer| / buffer_chunk_size.

Hash Anti-Join (for NOT IN / NOT EXISTS): Build hash set from subquery result; for each probe row, emit only if no hash match found.

10. Write Path — Checkpoint and Redo Log Cycle¶

flowchart LR
    subgraph Memory
        BP["Buffer Pool\nDirty Pages\nFlush List (LSN ordered)"]
        LOGBUF["Log Buffer\n(in-memory redo ring)"]
    end
    subgraph Disk
        REDO["ib_logfile0\nib_logfile1\n(circular ring, e.g. 2×512MB)"]
        IBD[".ibd tablespace files"]
        DWB["Doublewrite Buffer\n(sequential write area)"]
    end

    TX["TX COMMIT"] --> LOGBUF
    LOGBUF -->|"fsync every commit\nor group commit batch"| REDO
    BP -->|"Page Cleaner thread\nflushes dirty pages\nwhen free page shortage\nor checkpoint age threshold"| DWB
    DWB -->|"atomic write\ncopy to actual page location"| IBD

    REDO -->|"space reclaimed\nafter checkpoint\nadvances past log records"| REDO

Checkpoint age: innodb_log_file_size × 2 × 0.75 = maximum dirty data before forced flushing. If redo log fills (checkpoint age = log size), all writes STALL while checkpoint completes. Symptom: "InnoDB: page_cleaner: 1000ms intended loop took 5000ms" in error log.

11. Partitioning Internals¶

flowchart TD
    A["INSERT INTO orders (id, region, amount)\nVALUES (1001, 'APAC', 500)"]

    A --> B["Partition function evaluation\nRANGE: PARTITION BY RANGE(YEAR(created_at))\nHASH: PARTITION BY HASH(customer_id) PARTITIONS 16\nLIST: PARTITION BY LIST(region)"]

    B --> C["Partition pruning at query time\nWHERE region='APAC'\n→ only scan p_APAC partition\nOther partitions skipped entirely"]

    C --> D["Each partition:\nIndependent tablespace (.ibd)\nSeparate B+Tree root\nSeparate buffer pool pages\nSeparate statistics"]

Partition pruning requires predicates on partition column in WHERE clause. Without pruning, all partitions scanned — worse than non-partitioned table (more B+Tree roots to descend). Partition key must be part of every unique/primary key.

12. Column Store vs Row Store Internals¶

flowchart LR
    subgraph Row_Store["Row Store (OLTP)"]
        direction TB
        R1["Page: [id=1,name=Alice,age=25,dept=Eng]"]
        R2["Page: [id=2,name=Bob,age=30,dept=Mkt]"]
        R3["Page: [id=3,name=Carol,age=28,dept=Eng]"]
        R1 --- R2 --- R3
        N1["SELECT name WHERE id=1\n→ read 1 page, return name\nGood for point lookups"]
    end

    subgraph Col_Store["Column Store (OLAP)"]
        direction TB
        C1["Column file: id=[1,2,3,4,5...]"]
        C2["Column file: age=[25,30,28,35,22...]"]
        C3["Column file: dept=[Eng,Mkt,Eng,Eng,Mkt...]"]
        C1 --- C2 --- C3
        N2["SELECT AVG(age) WHERE dept='Eng'\n→ read only dept + age columns\nVectorized SIMD scan\nDict encoding: Eng=1,Mkt=2\nRun-length: [1,1,1,2,2...]"]
    end

Dictionary encoding: Replace string column values with integer codes. dept column stored as uint8 array, dictionary: {0: "Engineering", 1: "Marketing"}. Compression ratio 10-100x for low-cardinality columns.

Vectorized execution: Process 1024 rows at once using SIMD (AVX-256: 8 × 32-bit ops per instruction). Columnar layout enables this — all values of one column contiguous in memory → CPU prefetch efficient, L1/L2 cache utilized.

Database Performance Numbers Reference¶

Operation	InnoDB	SQL Server	Notes
Buffer pool hit	~100 ns	~100 ns	DRAM access
Page read from NVMe SSD	~50 µs	~50 µs	Sequential: ~10 µs
B+Tree point lookup (3 levels)	3 × 50 µs = 150 µs	similar	Cache may save most
Full table scan (1M rows)	0.5-5 sec	similar	Depends on I/O bandwidth
Hash join (2 × 1M rows)	1-10 sec	similar	Spill to disk if no memory
Index rebuild (100M rows)	10-60 min	similar	Online rebuild: longer
Checkpoint flush stall	0-5 sec	0-5 sec	Tunable via log file size
Row lock acquisition	~1 µs	~1 µs	In-memory hash lookup
Deadlock detection	~1 ms	~1 ms	DFS cycle detection

Summary — Data Flow Map¶

flowchart TD
    SQL["SQL Query arrives"] --> PARSE["Parse → AST"]
    PARSE --> OPT["Cost-based optimizer\nStatistics → access path"]
    OPT --> EXEC["Iterator execution tree\nVolcano pull model"]
    EXEC --> BP["Buffer Pool lookup\n(page granularity)"]
    BP -->|"miss"| DISK["Disk read\n.ibd page → buffer pool"]
    BP -->|"hit"| ROW["Row extraction\nslot array → row offset\nnull bitmap → field values"]
    ROW -->|"MVCC"| UNDO["Undo chain walk\nReadView visibility check"]
    UNDO --> RESULT["Return row to iterator"]

    WRITE["DML: INSERT/UPDATE/DELETE"] --> UNDOW["Write undo log"]
    UNDOW --> MODP["Modify buffer pool page\n(dirty)"]
    MODP --> REDOW["Append redo log record"]
    REDOW --> COMMIT["COMMIT → fsync redo log"]
    COMMIT --> BGFLUSH["Background: page cleaner\nflushes dirty pages"]

The complete lifecycle of every byte: SQL text → AST → cost-estimated plan → iterator pull → buffer pool page → MVCC undo chain → field bytes. On write: undo log (before-image) → dirty page (in-memory) → redo log (WAL, durable) → background flush to .ibd. The WAL guarantee is the bedrock: data is durable the moment redo log is fsynced, regardless of whether the actual data page is on disk.