Computer Architecture Internals: From Gates to Pipeline Hazards¶
Under the Hood — How instructions flow through pipeline stages, how caches exploit spatial/temporal locality, how memory hierarchies hide latency, and how multiprocessors maintain coherence. Sources: Computer Organization and Architecture (Stallings, 9th ed.), The Architecture of Computer Hardware, System Software, and Networking (Englander), Digital Logic and Computer Design (Mano), PC Hardware: A Beginner's Guide, Code: The Hidden Language of Computer Hardware and Software (Petzold).
1. Instruction Pipeline: The 5-Stage RISC Model¶
The fundamental insight of pipelining: while one instruction executes, the next is being decoded, and the one after is being fetched. Throughput approaches 1 instruction/cycle asymptotically.
flowchart LR
subgraph CYCLE_1["Clock Cycle 1"]
IF1["IF\nInstr 1"]
end
subgraph CYCLE_2["Clock Cycle 2"]
ID2["ID\nInstr 1"]
IF2["IF\nInstr 2"]
end
subgraph CYCLE_3["Clock Cycle 3"]
EX3["EX\nInstr 1"]
ID3["ID\nInstr 2"]
IF3["IF\nInstr 3"]
end
subgraph CYCLE_4["Clock Cycle 4"]
MEM4["MEM\nInstr 1"]
EX4["EX\nInstr 2"]
ID4["ID\nInstr 3"]
IF4["IF\nInstr 4"]
end
subgraph CYCLE_5["Clock Cycle 5"]
WB5["WB\nInstr 1"]
MEM5["MEM\nInstr 2"]
EX5["EX\nInstr 3"]
ID5["ID\nInstr 4"]
IF5["IF\nInstr 5"]
end
CYCLE_1 --> CYCLE_2 --> CYCLE_3 --> CYCLE_4 --> CYCLE_5
Pipeline Register Content Between Stages¶
flowchart LR
PC["PC\nProgram Counter"] --> IF_STAGE["IF Stage\nInstruction Memory\nPC → IR"]
IF_STAGE --> IF_ID["IF/ID\nRegister\nIR, PC+4"]
IF_ID --> ID_STAGE["ID Stage\nRegister File Read\nControl Signal Decode"]
ID_STAGE --> ID_EX["ID/EX\nRegister\nRS, RT, Imm,\nControl bits"]
ID_EX --> EX_STAGE["EX Stage\nALU Operation\nAddress Calc"]
EX_STAGE --> EX_MEM["EX/MEM\nRegister\nALU_result,\nBranch target,\nZero flag"]
EX_MEM --> MEM_STAGE["MEM Stage\nData Memory\nLoad/Store"]
MEM_STAGE --> MEM_WB["MEM/WB\nRegister\nRead data,\nALU result"]
MEM_WB --> WB_STAGE["WB Stage\nWrite to\nRegister File"]
2. Pipeline Hazards: Data, Control, Structural¶
Data Hazard: RAW (Read After Write)¶
sequenceDiagram
participant IF
participant ID
participant EX
participant MEM
participant WB
Note over IF,WB: ADD R1, R2, R3 — writes R1 in WB (cycle 5)
Note over IF,WB: SUB R4, R1, R5 — reads R1 in ID (cycle 3) ← STALE!
rect rgb(80, 20, 20)
Note over ID: Read R1 before ADD writes it
Note over ID: RAW Hazard detected
end
Note over IF,WB: STALL: Insert 2 bubble NOPs
Note over IF,WB: OR: Forwarding from EX/MEM → EX input
Forwarding (bypassing) paths:
flowchart TD
EX_MEM_REG["EX/MEM Register\nALU_result (from ADD)"] -->|"Forward to EX input"| MUX_A["MUX A\nEX stage ALU input"]
MEM_WB_REG["MEM/WB Register\nALU_result or Mem_data"] -->|"Forward to EX input"| MUX_A
REG_FILE["Register File\nNormal read path"] --> MUX_A
MUX_A --> ALU["ALU\nCurrent instruction"]
HAZARD_UNIT["Hazard Detection Unit\nCompares EX/MEM.Rd, MEM/WB.Rd\nwith ID/EX.Rs, ID/EX.Rt"] -->|"ForwardA, ForwardB\nselect signals"| MUX_A
Control Hazard: Branch Penalty¶
stateDiagram-v2
[*] --> Fetch_Branch: BEQ instruction fetched
Fetch_Branch --> Decode_Branch: Pipeline progresses
Decode_Branch --> Evaluate_Branch: EX stage - compute condition
Evaluate_Branch --> Branch_Taken: Zero=1, jump to target
Evaluate_Branch --> Branch_Not_Taken: Zero=0, continue
Branch_Taken --> Flush_2: Flush 2 instructions already fetched
Branch_Not_Taken --> Continue: No flush needed
Flush_2 --> [*]: 2-cycle branch penalty
Branch prediction strategies:
flowchart TD
subgraph STATIC["Static Prediction"]
ST1["Predict not-taken\nAlways fall through\nAccuracy: ~50%"]
ST2["Predict backward taken\nForward not taken\nLoop optimization\nAccuracy: ~65%"]
end
subgraph DYNAMIC["Dynamic Prediction (BTB + BHT)"]
BTB["Branch Target Buffer\nPC-indexed cache\nStores target address"]
BHT["Branch History Table\n2-bit saturating counter\n00=strong NT, 11=strong T"]
PHT["Pattern History Table\nGlobal/local history shift reg\nMaps history→prediction"]
BTB --> BHT --> PHT
end
subgraph TOURNAMENT["Tournament Predictor (modern)"]
LOCAL["Local predictor\nPer-branch history"]
GLOBAL["Global predictor\nGlobal history register"]
META["Meta-predictor\nChooses which to trust"]
LOCAL --> META
GLOBAL --> META
end
3. Cache Architecture: SRAM Sets, Ways, Tags¶
flowchart TD
subgraph CACHE_STRUCT["4-Way Set-Associative Cache"]
INDEX["Address bits [11:6]\nIndex → Select set (64 sets)"]
TAG_BITS["Address bits [31:12]\nTag → Compare all 4 ways"]
OFFSET["Address bits [5:0]\nBlock offset (64B line)"]
subgraph SET_N["Set N (4 ways)"]
W0["Way 0\nValid | Tag | Data[64B]"]
W1["Way 1\nValid | Tag | Data[64B]"]
W2["Way 2\nValid | Tag | Data[64B]"]
W3["Way 3\nValid | Tag | Data[64B]"]
end
INDEX --> SET_N
TAG_BITS -->|"Compare"| W0 & W1 & W2 & W3
HIT["Tag match + Valid\n= Cache Hit\nReturn data[offset]"]
MISS["No match\n= Cache Miss\nFetch from L2/L3/DRAM"]
W0 & W1 & W2 & W3 --> HIT
W0 & W1 & W2 & W3 --> MISS
end
Cache Miss Penalty and Memory Hierarchy¶
flowchart LR
CPU["CPU\nRegisters\n~0 cycles"]
L1["L1 Cache\n32KB I + 32KB D\n4 cycles\nSRAM"]
L2["L2 Cache\n256KB unified\n12 cycles\nSRAM"]
L3["L3 Cache\n8MB shared\n40 cycles\nSRAM"]
DRAM["Main Memory\nDRAM\n100-300 cycles\n4ns access + tRCD + tCL"]
DISK["NVMe SSD\n50-100µs\nor HDD: 5-10ms"]
CPU -->|"L1 hit 95%"| L1
L1 -->|"L1 miss → L2"| L2
L2 -->|"L2 miss → L3"| L3
L3 -->|"LLC miss → DRAM"| DRAM
DRAM -->|"Page fault → Disk"| DISK
DRAM Timing Internals: Why Access is 100+ Cycles¶
sequenceDiagram
participant CPU
participant MC as Memory Controller
participant DRAM
CPU->>MC: Read address 0x1A2B3C4D
MC->>DRAM: ACTIVATE (RAS): Row 0x1A2B (tRCD=14ns)
Note over DRAM: Charge sense amplifiers latch row into row buffer
MC->>DRAM: READ (CAS): Column 0x3C (CL=14ns)
Note over DRAM: Column amplifiers drive data onto bus
DRAM-->>MC: Data burst (8 transfers × 8B = 64B cache line)
MC-->>CPU: Cache line loaded (total ~100ns = ~400 cycles @ 4GHz)
MC->>DRAM: PRECHARGE (tRP=14ns): Close row, restore sense amps
4. Virtual Memory: TLB, Page Tables, Page Faults¶
flowchart TD
subgraph VA_TRANSLATION["Virtual Address Translation (x86-64 4-level paging)"]
VA["Virtual Address\n[63:48] sign extend\n[47:39] PML4 index\n[38:30] PDP index\n[29:21] PD index\n[20:12] PT index\n[11:0] page offset"]
CR3["CR3 register\nPhysical addr of PML4"]
PML4["PML4 Table\nPhysical page, 512 entries"]
PDP["PDP Table\nPhysical page, 512 entries"]
PD["Page Directory\nPhysical page, 512 entries"]
PT["Page Table\nPTE: Physical addr\n+ Present/RW/User/NX bits"]
PHYS["Physical Address\nPT.physical_page + offset"]
CR3 --> PML4
VA --> PML4 --> PDP --> PD --> PT --> PHYS
end
subgraph TLB["TLB (Translation Lookaside Buffer)"]
TLB_ENTRY["Fully associative SRAM\nVPN → PPN + flags\n~64 L1-TLB entries\n~1024 L2-TLB entries"]
TLB_HIT["TLB Hit\n~1 cycle"]
TLB_MISS["TLB Miss\nHardware Page Table Walk\n~100 cycles + DRAM accesses"]
TLB_ENTRY --> TLB_HIT
TLB_ENTRY --> TLB_MISS
end
Page Fault Handler Flow¶
sequenceDiagram
participant CPU
participant MMU
participant OS as OS Kernel
participant DISK
CPU->>MMU: Access virtual address VA
MMU->>MMU: TLB miss → page table walk
MMU->>MMU: PTE.Present = 0 → Page Not Present
MMU->>CPU: #PF Page Fault Exception (vector 14)
CPU->>OS: Save context, enter kernel mode
OS->>OS: Find VMA for VA (is access legal?)
OS->>DISK: Read page from swap/file (blocking I/O)
DISK-->>OS: Page data loaded
OS->>OS: Allocate physical frame
OS->>OS: Update PTE: physical addr + Present=1
OS->>MMU: TLB shootdown (INVLPG)
OS->>CPU: Return from exception, retry instruction
5. CPU Microarchitecture: Out-of-Order Execution Internals¶
flowchart TD
subgraph FRONTEND["Front End"]
FETCH["Fetch Unit\nI-cache line\n→ 4-8 instructions"]
DECODE["Decode Unit\nx86 → micro-ops (µops)\n1 CISC → 1-4 µops"]
RENAME["Register Rename\nPhysical RF (168 entries)\nEliminate WAR/WAW hazards\nROB allocation"]
end
subgraph BACKEND["Back End (OoO Engine)"]
ROB["Reorder Buffer (ROB)\nCircular queue\nIn-order commit\n~224 entries"]
RS["Reservation Stations\n~97 entries\nWait for operands"]
DISPATCH["Dispatch\nIssue µops to\nexecution units\nwhen operands ready"]
subgraph EXEC_UNITS["Execution Units"]
ALU1["ALU 0\nInteger"]
ALU2["ALU 1\nInteger"]
FPU["FPU\nFloating Point\nAVX/SSE"]
LOAD["Load Unit\nD-cache read"]
STORE["Store Unit\nStore buffer"]
end
end
FETCH --> DECODE --> RENAME --> ROB & RS
RS --> DISPATCH --> EXEC_UNITS
EXEC_UNITS -->|"Complete\nWrite result"| ROB
ROB -->|"In-order retire\nCommit to ARF"| ARF["Architectural\nRegister File"]
6. Multiprocessor Cache Coherence: MESI Protocol¶
stateDiagram-v2
[*] --> Invalid: Cache line not cached
Invalid --> Exclusive: Local read miss\nLoad from memory\nNo other caches have it
Invalid --> Shared: Local read miss\nAnother cache also has it
Invalid --> Modified: Local write miss\nInvalidate others\nOwn copy exclusively
Exclusive --> Modified: Local write\nNo bus transaction needed
Exclusive --> Shared: Another CPU reads\nDowngrade to Shared
Exclusive --> Invalid: Another CPU writes\nBus invalidation
Shared --> Modified: Local write\nBus invalidates others\nBecome sole owner
Shared --> Invalid: Another CPU writes\nBus invalidation
Modified --> Invalid: Another CPU reads/writes\nWrite-back to memory first\nThen transition
Modified --> Shared: Another CPU reads\nWrite-back to memory\nShare clean copy
Cache Coherence Bus Snooping¶
sequenceDiagram
participant CPU0
participant BUS as Shared Bus / Interconnect
participant CPU1
participant MEM as Main Memory
Note over CPU0,CPU1: CPU0 holds M state for cache line A
CPU1->>BUS: BusRd (Read) for line A
CPU0->>BUS: Snoop: see BusRd for MY Modified line
CPU0->>MEM: Write-back modified data (M→I or M→S)
MEM-->>CPU1: Supply clean data
Note over CPU0,CPU1: Both now in S state
CPU0->>BUS: BusRdX (Read Exclusive / Write intent)
BUS->>CPU1: Invalidation signal
CPU1->>CPU1: Transition S→I
MEM-->>CPU0: Supply data (exclusive)
Note over CPU0: CPU0 in E state, free to write → M
7. I/O Architecture: DMA, Interrupts, and Bus Protocols¶
flowchart TD
subgraph CPU_SIDE["CPU + Memory Side"]
CPU2["CPU Core"]
NORTH["Northbridge / MCH\nMemory Controller Hub"]
DRAM2["DRAM\nDIMM Slots"]
end
subgraph IO_SIDE["I/O Side"]
PCIE["PCIe Root Complex\nGen 4: 16 GT/s per lane\nx16 = 32 GB/s"]
SOUTH["Southbridge / ICH\nI/O Controller Hub"]
USB["USB 3.0\n5 Gbps"]
SATA["SATA / NVMe\nStorage"]
GBE["Gigabit Ethernet\n125 MB/s"]
end
CPU2 <--> NORTH
NORTH <--> DRAM2
NORTH <--> PCIE
PCIE <--> SOUTH
SOUTH <--> USB & SATA & GBE
subgraph DMA_FLOW["DMA Transfer Flow"]
DRV["Device Driver\nPrograms DMA descriptor:\nsrc addr, dst addr, length"]
DMAC["DMA Controller\nMaster on bus\nDirectly writes to DRAM"]
IRQ["Interrupt to CPU\nTransfer complete\nCPU picks up result"]
DRV --> DMAC --> DRAM2 --> IRQ --> CPU2
end
8. RISC vs. CISC: Decode Complexity Tradeoff¶
flowchart LR
subgraph CISC["CISC (x86)"]
C_INSTR["Variable-length instructions\n1-15 bytes\nMany addressing modes\nMemory operands allowed"]
C_DECODE["Complex Decoder\nMicrocode ROM\nCISC → µops translation\n4-6 decoder stages"]
C_UOPS["Internal µops\nRISC-like 3-operand\nHomogeneous execution"]
C_INSTR --> C_DECODE --> C_UOPS
end
subgraph RISC["RISC (ARM/MIPS/RISC-V)"]
R_INSTR["Fixed-length instructions\n32 bits (most)\nLoad/store architecture\nRegister operands only"]
R_DECODE["Simple Decoder\n1-2 cycles\nDirect to execution"]
R_EXEC["Execution Units\nDirectly execute\ndecoded instruction"]
R_INSTR --> R_DECODE --> R_EXEC
end
NOTE["Modern x86 CPUs internally\noperate as RISC µop machines\n— CISC is a compatibility\nencoding layer on top"]
9. Floating Point: IEEE 754 Internal Representation¶
flowchart LR
subgraph FP32["32-bit Float (single)"]
SIGN["Bit 31\nSign (1=neg)"]
EXP["Bits 30-23\n8-bit exponent\nBias=127"]
MANT["Bits 22-0\n23-bit mantissa\nImplied leading 1"]
end
subgraph FP64["64-bit Double"]
SIGN2["Bit 63\nSign"]
EXP2["Bits 62-52\n11-bit exponent\nBias=1023"]
MANT2["Bits 51-0\n52-bit mantissa"]
end
subgraph SPECIAL["Special Values"]
INF["Exponent=all 1s\nMantissa=0\n→ ±Infinity"]
NAN["Exponent=all 1s\nMantissa≠0\n→ NaN (quiet/signaling)"]
DENORM["Exponent=0\nMantissa≠0\n→ Denormalized\n(near-zero precision)"]
ZERO["Exponent=0\nMantissa=0\n→ ±0"]
end
FPU Pipeline: Fused Multiply-Add (FMA)¶
flowchart LR
A["Operand A\nExponent + Mantissa"] --> ALIGN["Exponent\nAlignment\nRight-shift smaller"]
B["Operand B"] --> MULTIPLY["Mantissa\nMultiply\n53×53 bit product"]
C["Operand C"] --> ALIGN
ALIGN --> ADDER["105-bit\nAdder"]
MULTIPLY --> ADDER
ADDER --> NORMALIZE["Normalize\nLeading 1\nShift left/right"]
NORMALIZE --> ROUND["Round\nRTZ/RNE/RUP/RDN\n(IEEE 754 modes)"]
ROUND --> RESULT["Result\nIEEE 754 float"]
10. Computer Architecture Design Principles (Patterson & Hennessy)¶
mindmap
root((Architecture Design Principles))
Simplicity
Fixed instruction length
Regularity in ISA
Favor 3-register format
Common Case
Optimize frequent operations
Not the worst case
SPEC benchmarks guide
Smaller = Faster
Fewer registers → faster
Smaller caches → lower latency
But miss rate increases
Good Design
Compromise when needed
Branch delay slots
MIPS: filled by compiler
Parallelism
Instruction-level pipelining
Data-level SIMD
Thread-level multicore
Locality
Temporal: reuse recently accessed
Spatial: access nearby addresses
Cache exploits both
11. Modern CPU Microarchitecture Numbers (Intel Ice Lake / AMD Zen 3)¶
| Resource | Intel Ice Lake | AMD Zen 3 |
|---|---|---|
| ROB entries | 352 | 256 |
| Reservation stations | 160 | 128 |
| Physical int registers | 280 | 192 |
| L1-I cache | 32KB, 8-way | 32KB, 8-way |
| L1-D cache | 48KB, 12-way | 32KB, 8-way |
| L2 cache | 512KB, 16-way | 512KB, 8-way |
| L3 cache | 12MB, 12-way | 32MB, 16-way |
| Decode width | 5-wide | 4-wide (up to 6 µops) |
| Branch predictor | TAGE-based | Hybrid TAGE |
| FMA latency | 4 cycles | 4 cycles |
| L1-D latency | 5 cycles | 4 cycles |
| DRAM latency | ~70ns | ~70ns |
The relentless improvement in out-of-order window size (ROB, RS, physical registers) is the primary lever for extracting ILP from sequential code — every doubling of ROB size exposes more parallelism across loop iterations, function calls, and independent statement sequences.