Machine Learning & AI Internals: Under the Hood¶
Source synthesis: ML/AI reference books (comp 8, 195–196, 224–225, 233, 254–255, 263, 280, 371, 434, 444, 449–450, 458, 469) covering neural network backpropagation mechanics, gradient descent optimizers, transformer attention internals, convolutional network computation graphs, and GPU memory management.
1. Neural Network Forward Pass — Computation Graph¶
flowchart LR
subgraph Layer Math
X["Input x\n(batch_size × features)"]
W["Weight matrix W\n(features × hidden)"]
B["Bias b\n(hidden,)"]
Z["z = xW + b\n(matrix multiply + broadcast)"]
A["a = σ(z)\n(element-wise activation)"]
L["Loss L = CrossEntropy(a, y)"]
end
X -->|"GEMM on GPU\ncuBLAS sgemm"| Z
W --> Z
B -->|"broadcast"| Z
Z --> A
A --> L
subgraph Memory Layout (row-major)
W_mem["W: [f0_h0, f0_h1, ..., f0_hn,\n f1_h0, f1_h1, ...]"]
Cache["Cache efficiency:\nGEMM tiles fit in L2\nFP16: 2× throughput vs FP32\nTF32: Tensor Core A100 19.5 TFLOPS"]
end
Tensor Core GEMM Pipeline¶
flowchart TD
subgraph A100 GPU GEMM
Global["Global Memory\n(HBM2: 2TB/s)\nMatrices A, B, C"]
Shared["Shared Memory (SRAM)\n(19 MB L1/shared per SM)\nA tile (16×16) + B tile (16×16)\n→ loaded once, reused K times"]
TensorCore["Tensor Core\nD = A×B + C\n(16×16×16 matrix op in 1 cycle)\nFP16 inputs → FP32 accumulate"]
Registers["Register File\nOutput tile accumulated here"]
WriteBack["Write C tile back to Global Memory"]
end
Global -->|"async memcpy (cp.async)"| Shared
Shared --> TensorCore --> Registers --> WriteBack --> Global
2. Backpropagation — Chain Rule Mechanics¶
flowchart BT
subgraph Forward pass
x --> z1["z1 = W1x+b1"] --> a1["a1 = ReLU(z1)"] --> z2["z2 = W2a1+b2"] --> y_hat["ŷ = softmax(z2)"] --> L["L = -Σ y log ŷ"]
end
subgraph Backward pass (chain rule)
dL -->|"∂L/∂ŷ = ŷ - y\n(softmax+xentropy shortcut)"| dz2["∂L/∂z2"]
dz2 -->|"∂L/∂W2 = a1ᵀ · ∂L/∂z2"| dW2["grad W2"]
dz2 -->|"∂L/∂a1 = W2ᵀ · ∂L/∂z2"| da1["∂L/∂a1"]
da1 -->|"∂L/∂z1 = da1 ⊙ (z1>0)\n(ReLU gate)"| dz1["∂L/∂z1"]
dz1 -->|"∂L/∂W1 = xᵀ · ∂L/∂z1"| dW1["grad W1"]
end
Autograd Tape (PyTorch)¶
flowchart TD
subgraph PyTorch Autograd
T1["Tensor x\n(leaf, requires_grad=True)\n.grad_fn = None\n.is_leaf = True"]
T2["z = x * W\n.grad_fn = MulBackward\n.saved_tensors = (x, W)"]
T3["a = z.relu()\n.grad_fn = ReluBackward\n.saved_tensors = (z,)"]
T4["L = a.sum()\n.grad_fn = SumBackward"]
T1 --> T2 --> T3 --> T4
subgraph Backward call
B1["L.backward()\n→ SumBackward.backward(grad=1.0)"]
B2["ReluBackward.backward(grad=1.0)\n→ grad *= (z > 0)"]
B3["MulBackward.backward(grad=...)\n→ grad_x = grad * W\n grad_W = grad * x"]
B4["x.grad += grad_x\nW.grad += grad_W"]
B1 --> B2 --> B3 --> B4
end
end
3. Gradient Descent Optimizers — Internal State¶
flowchart LR
subgraph SGD with Momentum
m_t["m_t = beta*m_{t-1} + (1-beta)*grad\n(exponential moving avg of grad)"]
theta_t["theta_t = theta_{t-1} - alpha*m_t\n(beta=0.9, alpha=0.01)"]
State["State per param:\nm (momentum buffer)"]
end
subgraph Adam
g["g_t = grad L (gradient)"]
m1["m_t = beta1*m_{t-1} + (1-beta1)*g_t\n(1st moment, beta1=0.9)"]
v1["v_t = beta2*v_{t-1} + (1-beta2)*g_t^2\n(2nd moment, beta2=0.999)"]
bc["m_hat_t = m_t/(1-beta1^t)\nv_hat_t = v_t/(1-beta2^t)\n(bias correction for cold start)"]
up["theta_t = theta_{t-1} - alpha*m_hat_t/(sqrt(v_hat_t) + eps)\n(eps=1e-8, alpha=0.001)"]
g --> m1
g --> v1
m1 --> bc
v1 --> bc
bc --> up
State2["State per param:\nm, v (2× memory vs SGD)"]
end
subgraph AdamW (decoupled weight decay)
WD["theta_t = theta_{t-1} - alpha*m_hat_t/(sqrt(v_hat_t)+eps) - alpha*lambda*theta_{t-1}\n(weight decay applied directly\nNOT through gradient)"]
Note2["L2 regularization via gradient:\n∇L += λθ (changes Adam moments)\nAdamW: decay separate from Adam update\n→ better generalization"]
end
4. Transformer — Self-Attention Internals¶
flowchart TD
subgraph Scaled Dot-Product Attention
X["Input X\n(seq_len × d_model)"]
Wq["W_Q (d_model × d_k)"]
Wk["W_K (d_model × d_k)"]
Wv["W_V (d_model × d_v)"]
Q["Q = XW_Q\n(seq_len × d_k)"]
K["K = XW_K"]
V["V = XW_V"]
Score["Attention scores:\nA = QKᵀ / √d_k\n(seq_len × seq_len)"]
Mask["+ Mask (causal: upper-triangle = -∞)"]
Softmax["Softmax(A) row-wise\n→ attention weights"]
Out["Output = Softmax(A) · V\n(seq_len × d_v)"]
X --> Q & K & V
Wq --> Q
Wk --> K
Wv --> V
Q --> Score
K --> Score
Score --> Mask --> Softmax --> Out
V --> Out
end
subgraph Multi-Head Attention
H["h heads in parallel\nhead_i = Attention(XW_Q_i, XW_K_i, XW_V_i)\nd_k = d_model/h (e.g. 64 for h=12, d_model=768)"]
Concat["Concat(head_1,...,head_h)\n→ (seq_len × d_model)"]
Wo["× W_O → final output"]
H --> Concat --> Wo
end
FlashAttention — Memory-Efficient Tiling¶
flowchart LR
subgraph Naive Attention
S["S = QKᵀ\n(seq_len²×d) in HBM\nO(N²) memory: 1024²×4B = 4MB"]
P["P = softmax(S)\n(stored in HBM)"]
O["O = PV\n(HBM read+write)"]
S --> P --> O
Note1["HBM bandwidth bottleneck:\n3× full seq² materialization"]
end
subgraph FlashAttention
Tiles["Tile Q into blocks of Br rows\nTile K,V into blocks of Bc cols\nBr×Bc fits in SRAM"]
Online["Online softmax:\ntrack max(score) per row\nupdate running sum\n→ exact softmax without storing S"]
SRAM["All computation in SRAM\n(1.5MB vs O(N²) HBM)\n→ 5–10× faster for long sequences"]
Tiles --> Online --> SRAM
end
5. Convolutional Neural Network — Computation Path¶
flowchart TD
subgraph Conv Layer Internals
Input["Input: (N, C_in, H, W)\ne.g. (32, 3, 224, 224) = 32 RGB images"]
Filter["Filter: (C_out, C_in, kH, kW)\ne.g. (64, 3, 3, 3) = 64 filters, 3×3"]
Im2Col["im2col transform:\nunfold each filter position\n→ (N×H_out×W_out) × (C_in×kH×kW)\ne.g. (32×112×112) × (3×3×3)"]
GEMM2["GEMM:\n(C_out) × (C_in×kH×kW)\n× (C_in×kH×kW) × (N×H_out×W_out)\n→ (C_out) × (N×H_out×W_out)"]
Output["Output: (N, C_out, H_out, W_out)\nH_out = (H - kH + 2p)/s + 1"]
end
Input --> Im2Col --> GEMM2 --> Output
Filter --> GEMM2
subgraph BN + ReLU Fusion
BN["BatchNorm:\nμ_B = mean(x), σ²_B = var(x)\nx̂ = (x - μ_B)/√(σ²_B+ε)\ny = γx̂ + β\n(γ,β learned; μ,σ² running averages for inference)"]
ReLU2["ReLU: max(0, y)\n(CUDNN fuses BN+ReLU → single kernel\nno intermediate tensor allocation)"]
BN --> ReLU2
end
6. GPU Memory Hierarchy & Tensor Management¶
flowchart TD
subgraph A100 Memory Hierarchy
HBM["HBM2 (80GB)\n2TB/s bandwidth\n~400 cycles latency\nParameters, activations, gradients"]
L2["L2 Cache (40MB)\n~5TB/s\nShared across all SMs"]
Shared["Shared Memory / L1\n(192KB per SM, 108 SMs)\n~19TB/s\nTiled GEMM, reduction"]
Registers["Register File\n(65536 × 32-bit per SM)\nfast local variables"]
end
Registers --> Shared --> L2 --> HBM
subgraph GPU Memory Allocation (PyTorch)
Allocator["PyTorch caching allocator\n(cudaMalloc reserved pool)"]
Block["Block sizes: power-of-2 bins\n(512B → 1KB → ... → 1GB)\nreuse without cudaMalloc overhead"]
Frag["Fragmentation: large allocations\nsplit from pool; small blocks recycled"]
GC["torch.cuda.empty_cache()\nreturns pool blocks to CUDA\n(doesn't free reserved memory)"]
end
Mixed Precision Training (FP16 + FP32)¶
flowchart LR
subgraph AMP (Automatic Mixed Precision)
FP32_P["FP32 Master Weights\n(optimizer state)"]
FP16_P["FP16 Weights\n(half precision copy\nfor forward/backward)"]
FP16_A["FP16 Activations\n(2× memory saving)"]
FP32_G["FP32 Gradients\n(accumulate in FP32\nto prevent underflow)"]
GradScaler["GradScaler\nscale = 2^16 initial\nscale loss before backward\nunscale before optimizer step\n→ prevents FP16 gradient underflow"]
end
FP32_P -->|"cast to FP16"| FP16_P
FP16_P --> FP16_A
FP16_A -->|"backward → FP16 grad"| FP32_G
GradScaler --> FP32_G
FP32_G -->|"optimizer updates"| FP32_P
7. Large Language Model — KV Cache Internals¶
flowchart TD
subgraph Autoregressive Generation
Prompt["Prompt tokens [t1, t2, t3, t4]"]
Prefill["Prefill phase:\nall prompt tokens processed together\n→ K,V matrices computed and cached\nfor all prompt positions"]
KVCache["KV Cache (per layer, per head):\nK_cache: (seq_len, n_heads, d_k)\nV_cache: (seq_len, n_heads, d_v)\nGPU memory: ~2×seq×layers×heads×d_k×2B"]
Decode["Decode phase (token by token):\nnew token t5 → compute Q5\nattend to cached K[1..4] + K5\n→ only 1 new K,V per step\n→ O(1) compute per new token"]
Next["Generate t5, t6, t7... autoregressively"]
end
Prompt --> Prefill --> KVCache --> Decode --> Next
subgraph KV Cache Memory Budget
Budget["Llama-2 70B, seq=4096:\nlayers=80, heads=8 (GQA), d_k=128\nK+V = 2×4096×80×8×128×2B = 1.3GB\nFull model: 140GB → KV cache: ~1% overhead"]
GQA["Grouped Query Attention (GQA):\nn_kv_heads < n_q_heads\n(e.g. 8 KV heads shared by 64 Q heads)\n→ 8× smaller KV cache vs MHA"]
end
8. Convolutional Backprop — Gradient Flow¶
sequenceDiagram
participant Loss
participant Conv2 as Conv Layer 2
participant Pool as MaxPool
participant Conv1 as Conv Layer 1
Loss->>Conv2: ∂L/∂output (upstream gradient)
Note over Conv2: ∂L/∂W2 = im2col(input)ᵀ · ∂L/∂output<br/>(weight gradient: correlate input with upstream grad)
Note over Conv2: ∂L/∂input = ∂L/∂output ⋆ flip(W2)<br/>(full convolution with flipped filter)
Conv2->>Pool: ∂L/∂pool_output
Note over Pool: MaxPool backward:<br/>route gradient only through max-index positions<br/>(switch stored during forward pass)
Pool->>Conv1: ∂L/∂conv1_output
Note over Conv1: same pattern: ∂L/∂W1, ∂L/∂x
9. Decision Tree & Random Forest Internals¶
flowchart TD
subgraph CART Algorithm (Classification)
Root["Root node: all N samples"]
Split["Find best split:\nfor each feature f, threshold t:\n Gini = Σ p_k(1-p_k) per child\n Info Gain = Gini_parent - weighted_avg_children\n→ pick (f*, t*) minimizing Gini"]
Left["Left child: {x | x_f* < t*}"]
Right["Right child: {x | x_f* ≥ t*}"]
Stop["Stop: max_depth reached\nor min_samples_leaf\nor all samples same class"]
Leaf["Leaf: predict majority class\n(or mean for regression)"]
Root --> Split --> Left & Right
Left --> Stop --> Leaf
end
subgraph Random Forest Ensemble
Bootstrap["Bootstrap sampling:\nn_estimators=100 trees\neach trained on random sample\nwith replacement (63.2% unique)"]
FeatureSample["Feature sampling per split:\nmax_features = √p (classification)\nmax_features = p/3 (regression)\n→ decorrelates trees"]
Aggregate["Aggregation:\nclassification: majority vote\nregression: mean\n→ variance reduction (Var[mean] = Var[X]/n)"]
Bootstrap --> FeatureSample --> Aggregate
end
10. Support Vector Machine — Kernel Trick Internals¶
flowchart LR
subgraph SVM Hard Margin
Sep["Find hyperplane w·x + b = 0\nmaximize margin = 2/‖w‖\nsubject to: y_i(w·x_i + b) ≥ 1"]
Primal["Primal: min ½‖w‖²\n(quadratic programming)"]
Dual["Dual (Lagrangian):\nmax Σα_i - ½ΣΣα_i α_j y_i y_j (x_i·x_j)\nsubject to: α_i ≥ 0, Σα_i y_i = 0\n→ only dot products appear!"]
end
subgraph Kernel Trick
Linear["K(x,z) = x·z\n(linear SVM)"]
Poly["K(x,z) = (x·z + c)^d\n(polynomial: implicit degree-d features)"]
RBF["K(x,z) = exp(-γ‖x-z‖²)\n(RBF: infinite-dim feature space\n→ always linearly separable)"]
Note1["Replace x_i·x_j → K(x_i,x_j)\nNever compute Φ(x) explicitly\n→ dual uses only pairwise kernel evaluations"]
end
subgraph KKT Conditions
KKT["α_i > 0 → support vector (on margin)\nα_i = 0 → interior point (not on margin)\nPrediction: f(x) = sign(Σ α_i y_i K(x_i, x) + b)"]
end
11. Embedding Space — Word2Vec Skip-gram Internals¶
flowchart TD
subgraph Skip-gram Training
Center["Center word: 'cat'\n(one-hot: index 2000 of 50000 vocab)"]
WIn["W_in: (vocab × embed_dim)\nrow 2000 = embedding of 'cat'"]
Hidden["hidden = W_in[2000]\n(embed_dim=300 vector)"]
WOut["W_out: (embed_dim × vocab)"]
Scores["scores = hidden · W_out\n(50000 dot products)"]
Softmax2["softmax → P(context | center)"]
Loss2["loss = -log P('sits'|'cat') - log P('on'|'cat') ...\n(for window=2)"]
end
Center --> WIn --> Hidden --> Scores --> Softmax2 --> Loss2
subgraph Negative Sampling Optimization
NS["Full softmax: 50000 outputs per step\nNegative sampling:\n- 1 positive context word (gradient push)\n- k=5 negative samples (random noise distribution)\n- objective: maximize P(positive) - P(negatives)\n→ 50000× speedup"]
Dist["Noise distribution: P^(3/4)(w)\n(unigram^0.75 smooths frequency)"]
end
12. Model Quantization — INT8 / INT4 Internals¶
flowchart TD
subgraph Post-Training Quantization (PTQ)
FP32_W["FP32 weights\n(e.g. 1.23, -0.45, 2.67)"]
CalibRange["Calibration:\nrun representative data\nmeasure min/max or percentile\nper-channel or per-tensor"]
Scale["scale = (max - min) / 255\nzero_point = round(-min / scale)\n(asymmetric INT8)"]
INT8_W["INT8 weights\nw_q = clamp(round(w/scale + zp), 0, 255)\n(4× memory reduction vs FP32)"]
DequantGEMM["GEMM in INT8:\naccumulate in INT32\n→ dequantize output back to FP32/BF16"]
end
FP32_W --> CalibRange --> Scale --> INT8_W --> DequantGEMM
subgraph GPTQ (Layer-wise Quantization)
Hessian["Compute Hessian H = 2XX^T\n(second-order info about weight sensitivity)"]
OBQ["Optimal Brain Quantization:\nquantize one weight at a time\ncompensate remaining weights:\nδW = -q_err / H_ii × H_{i,:}\n→ minimize quantization error row by row"]
INT4["INT4 weights (2 bits saved vs INT8)\n16× memory reduction vs FP32\nGPTQ 4-bit LLM: 70B model fits in 2×A100 40GB"]
end
13. Training Pipeline — Data Flow¶
sequenceDiagram
participant Disk as Dataset (disk)
participant Loader as DataLoader (worker processes)
participant Pin as Pinned Memory Buffer
participant GPU as GPU (CUDA)
participant Model as Forward Pass
participant Optim as Optimizer
Disk->>Loader: read batch (multiprocessing, num_workers=4)
Loader->>Loader: augment (RandomCrop, Normalize, etc.)
Loader->>Pin: copy to pinned (page-locked) memory
Note over Pin,GPU: async transfer: CUDA DMA engine\n(overlaps with previous GPU compute)
Pin->>GPU: cudaMemcpyAsync (H2D, CUDA stream)
GPU->>Model: forward pass (CUDA kernels dispatched)
Model->>Model: loss computation
Model->>GPU: backward pass (autograd)
GPU->>Optim: gradients in GPU memory
Optim->>GPU: weight update (in-place on GPU)
Note over Loader,GPU: Double buffering:\nwhile GPU processes batch N,\nCPU loads batch N+1 → overlap I/O + compute
14. Performance Numbers¶
block-beta
columns 2
block:gpu_compute["GPU Compute (A100 SXM)"]:1
g1["FP16 Tensor Core: 312 TFLOPS"]
g2["BF16 Tensor Core: 312 TFLOPS"]
g3["INT8 Tensor Core: 624 TOPS"]
g4["FP32 (non-TC): 19.5 TFLOPS"]
end
block:model_size["Typical Model Sizes"]:1
m1["BERT-base: 110M params, 440MB FP32"]
m2["GPT-2 (1.5B): 6GB FP32, 3GB FP16"]
m3["Llama-2 70B: 280GB FP32, 140GB FP16, 35GB INT4"]
m4["ViT-L/16: 307M params, 1.2GB FP32"]
end
block:training["Training Throughput"]:1
t1["A100 (FP16): ~2000 tokens/sec (GPT-2)"]
t2["FlashAttention 2: 70% of peak A100 MFU"]
t3["DDP 8-GPU: ~7.5× speedup (communication overhead)"]
t4["Gradient accumulation: simulate 8× batch with 1 GPU"]
end
block:inference["Inference Latency"]:1
i1["KV cache reuse: O(1) per new token"]
i2["Prefill bottleneck: compute-bound (GEMM)"]
i3["Decode bottleneck: memory-bound (KV cache BW)"]
i4["Speculative decoding: 2–4× speedup (draft model)"]
end
Key Takeaways¶
- Autograd tape builds a DAG of
grad_fnnodes during forward pass;.backward()traverses it in reverse, applying the chain rule at each node using saved tensors - Adam maintains 2 moment buffers per parameter (m, v) — memory cost is 3× weights (weights + m + v), vs SGD which is 1×; AdamW decouples L2 decay to avoid contaminating moment estimates
- FlashAttention avoids materializing the O(N²) attention matrix in HBM by tiling and running online softmax — critical for long sequences (>2K tokens)
- KV cache enables O(1) decode steps — memory grows linearly with sequence length; GQA reduces KV heads to save cache memory at inference
- INT8 GEMM accumulates in INT32 to prevent overflow, then dequantizes — the scale/zero-point are computed per-channel for better accuracy than per-tensor
- im2col + GEMM is how convolutions are actually implemented — converts the sliding window operation into a large matrix multiply, enabling Tensor Core acceleration
- Negative sampling in Word2Vec replaces the O(vocab) softmax with O(k) sigmoid evaluations — mathematically equivalent in the limit but 10,000× faster to train