Mathematical Computing Internals: Numerical Methods, Linear Algebra & Optimization¶

Under the Hood: How floating-point arithmetic loses precision, how matrix decompositions factor linear systems, how gradient descent navigates loss landscapes, how FFT reduces O(N²) to O(N log N), how numerical integration accumulates error — the exact algorithms, mathematical guarantees, and computational mechanics.

1. Floating-Point Arithmetic: IEEE 754 Internals¶

flowchart TD
    subgraph "IEEE 754 Double Precision (64-bit)"
        BITS["64 bits:\n  bit 63: sign (1 bit)\n  bits 52-62: exponent (11 bits) biased by 1023\n  bits 0-51: mantissa / fraction (52 bits)"]
        VALUE["value = (-1)^sign × 2^(exp-1023) × 1.mantissa\n(implicit leading 1 in normalized form)"]
        RANGE["Range: ±5×10⁻³²⁴ to ±1.8×10³⁰⁸\nPrecision: ~15-17 significant decimal digits\nEpsilon: 2⁻⁵² ≈ 2.22×10⁻¹⁶"]
        BITS --> VALUE --> RANGE
    end
    subgraph "Special Values"
        INF["+∞: exp=all_ones, mantissa=0\n1/0 = +∞ (not exception)"]
        NAN["NaN: exp=all_ones, mantissa≠0\n0/0 = NaN, sqrt(-1) = NaN\nNaN ≠ NaN (always!)"]
        DENORM["Denormalized: exp=0\n→ implicit leading 0 instead of 1\nGradual underflow near 0"]
        INF --> NAN --> DENORM
    end

Catastrophic Cancellation¶

flowchart TD
    subgraph "Example: Quadratic Formula"
        EQ["x = (-b ± sqrt(b²-4ac)) / 2a\nb=1000, a=0.001, c=0.001"]
        NAIVE["Naive: x₁ = (-1000 + sqrt(1000000-0.000004)) / 0.002\n       = (-1000 + 999.999998) / 0.002\n       = 0.000002 / 0.002\n       = 0.001  ← WRONG (cancelled digits!)"]
        STABLE["Stable: use x₂ = -b-sqrt(b²-4ac) / 2a\nthen x₁ = c / (a × x₂)  (from Vieta's formulas)\n       = 0.001 / (0.001 × -1000.000002)\n       ≈ -1e-6  ← Correct!"]
        EQ --> NAIVE
        EQ --> STABLE
    end

Machine epsilon: The smallest ε such that 1.0 + ε ≠ 1.0 in floating point. For doubles, ε ≈ 2.22×10⁻¹⁶. Adding 1e16 to 1.0 gives 1e16 (the 1 is lost!).

2. Matrix Decompositions: LU, QR, SVD¶

LU Decomposition (Gaussian Elimination)¶

Factors matrix A = L × U where L is lower triangular, U is upper triangular. Used to solve Ax = b via forward/back substitution in O(N³).

flowchart LR
    subgraph "LU Factorization: Gaussian Elimination"
        A["A = [4  3]\n     [6  3]"]
        ELIM["Elimination step:\n  m₂₁ = a₂₁/a₁₁ = 6/4 = 1.5\n  Row 2 -= 1.5 × Row 1:\n  [6,3] - 1.5×[4,3] = [0, -1.5]"]
        LU["L = [1    0 ]\n     [1.5  1 ]\nU = [4   3 ]\n     [0  -1.5]"]
        SOLVE["Solve Ax=b: LUx=b\n1. Ly=b (forward sub, O(N²))\n2. Ux=y (backward sub, O(N²))"]
        A --> ELIM --> LU --> SOLVE
    end

Partial Pivoting: Numerical Stability¶

flowchart TD
    PROB["Without pivoting:\n  small pivot a₁₁ → large multiplier m\n  amplifies rounding errors\n  Condition number blows up"]
    FIX["Partial Pivoting:\n  before each elimination step,\n  swap rows to maximize |pivot|\n  Permutation matrix P tracks swaps\n  PA = LU"]
    PROB --> FIX

SVD: Singular Value Decomposition¶

A = U × Σ × Vᵀ — the fundamental matrix decomposition:

flowchart TD
    subgraph "SVD Components"
        U["U: m×m orthogonal\n(left singular vectors\n= principal directions of output space)"]
        SIGMA["Σ: m×n diagonal\n(singular values σ₁≥σ₂≥...≥σₙ≥0\n= importance of each direction)"]
        VT["Vᵀ: n×n orthogonal\n(right singular vectors\n= principal directions of input space)"]
        U --> SIGMA --> VT
    end
    subgraph "Applications"
        PCA["PCA: SVD of data matrix\n(data - mean)\nSingular vectors = principal components\nσᵢ² proportional to variance explained"]
        LSQ["Least Squares: Ax≈b\nx = V × Σ⁺ × Uᵀ × b\n(pseudoinverse: replace σᵢ with 1/σᵢ\nif σᵢ > threshold, else 0)"]
        COMPRESS["Image compression:\nkeep top k singular values\nA_k = Σᵢ₌₁ᵏ σᵢ uᵢ vᵢᵀ\nRank-k approximation\n(Eckart-Young theorem: optimal)"]
        PCA --> LSQ --> COMPRESS
    end

3. Fast Fourier Transform: Butterfly Network¶

DFT: X[k] = Σₙ₌₀ᴺ⁻¹ x[n] × e^(-2πi×k×n/N) — naively O(N²).

FFT (Cooley-Tukey) reduces to O(N log N) by divide and conquer:

flowchart TD
    subgraph "FFT Divide and Conquer (N=8)"
        X["DFT of x[0..7]"]
        XE["DFT of x[0,2,4,6] (even)\n= X_even[k]"]
        XO["DFT of x[1,3,5,7] (odd)\n= X_odd[k]"]
        COMBINE["Combine:\n  X[k]   = X_even[k] + W^k × X_odd[k]\n  X[k+N/2] = X_even[k] - W^k × X_odd[k]\n  W = e^(-2πi/N) (twiddle factor)"]
        X --> XE --> COMBINE
        X --> XO --> COMBINE
    end
    subgraph "Butterfly Diagram"
        B["Two inputs: a, b\nOne twiddle: W\nOutputs:\n  a + W×b  (sum)\n  a - W×b  (diff)\nAll N log N butterflies\ncomputed in-place"]
    end

FFT Memory Access Pattern¶

flowchart LR
    subgraph "Bit-Reversal Permutation (pre-processing)"
        P["Input re-ordered by bit-reversed index:\nn=0(000)→0, n=1(001)→4, n=2(010)→2,\nn=3(011)→6, n=4(100)→1, ...\nAllows in-place butterfly with\nsequential memory access in passes"]
    end
    subgraph "Pass Structure"
        PASS1["Pass 1: span=2 butterflies (N/2 butterflies)"]
        PASS2["Pass 2: span=4 butterflies (N/4 groups)"]
        PASS3["Pass 3: span=8 butterflies (N/8 groups)"]
        PASSn["... log₂(N) passes total"]
        PASS1 --> PASS2 --> PASS3 --> PASSn
    end

4. Numerical Integration: Error Analysis¶

Trapezoidal Rule vs Simpson's Rule¶

flowchart TD
    subgraph "Quadrature Rules"
        TRAP["Trapezoidal Rule:\n∫ₐᵇ f(x)dx ≈ h/2 × (f(a) + 2f(a+h) + ... + f(b))\nError: O(h²) per step, O(h²) global"]
        SIMP["Simpson's Rule:\n∫ₐᵇ f(x)dx ≈ h/3 × (f(a) + 4f(a+h) + 2f(a+2h) + ...)\nError: O(h⁴) per step, O(h⁴) global\n(uses parabola, not line fit)"]
        GAUSS["Gaussian Quadrature:\nOptimally place N nodes + weights\nExact for polynomials of degree ≤ 2N-1\n(using Legendre polynomial roots as nodes)"]
        TRAP --> SIMP --> GAUSS
    end
    subgraph "Adaptive Quadrature"
        ADAPT["Adaptive: estimate local error\n  compute I(a,b) with coarse N\n  compute I(a,b) with fine 2N\n  if |I_coarse - I_fine| < tol: done\n  else: bisect [a,b] and recurse\nConcentrates effort on\ncurvy/singular regions"]
    end

5. Linear Programming: Simplex Method Internals¶

Optimize cᵀx subject to Ax ≤ b, x ≥ 0.

flowchart TD
    subgraph "Simplex Algorithm"
        INIT["Start at a basic feasible solution\n(corner of feasible polytope)"]
        CHECK["Check optimality:\nare all reduced costs ≥ 0?\n(current vertex is optimal)"]
        PIVOT["Select entering variable:\nmost negative reduced cost\n(steepest edge heuristic)"]
        RATIO["Ratio test:\nminimum ratio min(bᵢ/aᵢⱼ) for aᵢⱼ>0\n→ leaving variable (prevents cycling)"]
        MOVE["Pivot: swap entering ↔ leaving\n(Gauss-Jordan elimination on tableau)\nMove to adjacent vertex"]
        INIT --> CHECK
        CHECK -->|not optimal| PIVOT --> RATIO --> MOVE --> CHECK
        CHECK -->|optimal| DONE["Optimal solution found"]
    end

Worst case vs average case: Simplex is exponential in the worst case (Klee-Minty cube) but polynomial in practice. Interior-point methods (IPM) are polynomial O(N³·⁵) in theory but simplex often faster in practice for sparse LP.

flowchart TD
    subgraph "Gradient Descent Variants"
        BGD["Batch GD:\n  grad = (1/N) Σ ∇L(xᵢ, yᵢ, θ)\n  Uses ALL N samples per step\n  Exact gradient, slow per step"]
        SGD["Stochastic GD:\n  grad = ∇L(x₁, y₁, θ)\n  Uses 1 sample per step\n  Noisy gradient, fast per step\n  Noise helps escape local minima!"]
        MINI["Mini-batch GD:\n  grad = (1/B) Σᵦ ∇L(xᵢ, yᵢ, θ)\n  B = 32-512 samples\n  GPU-vectorized, practical standard"]
        BGD --> MINI
        SGD --> MINI
    end
    subgraph "Adam Optimizer Internal State"
        ADAM["θ_t+1 = θ_t - α × m̂_t / (√v̂_t + ε)\nwhere:\n  g_t = ∇L (current gradient)\n  m_t = β₁×m_{t-1} + (1-β₁)×g_t  (1st moment/momentum)\n  v_t = β₂×v_{t-1} + (1-β₂)×g_t²  (2nd moment/RMS)\n  m̂_t = m_t / (1-β₁ᵗ)  (bias correction)\n  v̂_t = v_t / (1-β₂ᵗ)  (bias correction)\nTypical: β₁=0.9, β₂=0.999, α=0.001, ε=1e-8"]
    end

Learning Rate Scheduling¶

flowchart LR
    subgraph "Warmup + Cosine Decay Schedule"
        WU["Warmup (0→T_warm):\n  lr = lr_max × t/T_warm\n  Avoids large gradient steps\n  at random initialization"]
        COS["Cosine Decay (T_warm→T_max):\n  lr = lr_min + 0.5×(lr_max-lr_min)×\n  (1 + cos(π×(t-T_warm)/(T_max-T_warm)))\n  Smooth decay to lr_min\n(used in BERT, GPT training)"]
        WU --> COS
    end

7. Monte Carlo Methods: Convergence and Variance Reduction¶

flowchart TD
    subgraph "Basic Monte Carlo Integration"
        MC["∫ f(x) dx over [a,b]\n≈ (b-a)/N × Σᵢ f(xᵢ)  where xᵢ ~ Uniform[a,b]\nError: O(1/√N) — rate independent of dimension!\n(vs Simpson's O(h⁴) degrading in higher dims)"]
    end
    subgraph "Variance Reduction Techniques"
        IS["Importance Sampling:\nSample from q(x) ∝ |f(x)|\nEstimator: (1/N) Σ f(xᵢ)/q(xᵢ)\nLower variance when q matches f's peaks"]
        CV["Control Variates:\ng(x) known integral, correlated with f\nestimate f-g, add E[g]\nVar[f-g] < Var[f] when Cov(f,g)>0"]
        QUASI["Quasi-Monte Carlo:\nLow-discrepancy sequences (Halton, Sobol)\nbetter than uniform random sampling\nO(1/N) error rate for smooth f\n(vs O(1/√N) for random)"]
        IS --> CV --> QUASI
    end

8. Graph Algorithms: Mathematical Foundations¶

Shortest Path: Dijkstra and Bellman-Ford¶

flowchart TD
    subgraph "Dijkstra Internal Mechanics"
        INIT["dist[source]=0, dist[v]=∞ for all v\nmin-heap H = {(0, source)}"]
        LOOP["Extract min (u, d_u) from H\nFor each neighbor v:\n  if d_u + w(u,v) < dist[v]:\n    dist[v] = d_u + w(u,v)\n    push (dist[v], v) to H"]
        TERM["Terminate when H empty\nCorrect for non-negative weights\nO((V+E) log V) with binary heap"]
        INIT --> LOOP --> TERM
    end
    subgraph "Bellman-Ford: Negative Weights"
        BF["Relax ALL edges V-1 times:\n  for i in 1..V-1:\n    for each edge (u,v,w):\n      dist[v] = min(dist[v], dist[u]+w)\nV-th relaxation: if any dist improves\n→ negative cycle detected\nO(VE) time"]
    end

Maximum Flow: Ford-Fulkerson and Augmenting Paths¶

flowchart LR
    subgraph "Residual Graph"
        G["Original graph:\nCapacity(S→A)=10, Cap(A→T)=10"]
        RG["Residual graph:\nForward: S→A cap=10-flow\nBackward: A→S cap=flow\n(allows 'undoing' flow)"]
        G --> RG
    end
    subgraph "Edmonds-Karp (BFS augmentation)"
        BFS["BFS in residual graph\nFind shortest augmenting path S→T\nAugment by bottleneck capacity\nRepeat until no path found\nO(VE²) — polynomial!"]
    end

9. Eigenvalues: Power Iteration and QR Algorithm¶

flowchart TD
    subgraph "Power Iteration (Dominant Eigenvalue)"
        PI["Start: random vector v₀\nIterate: v_{t+1} = A×v_t / ||A×v_t||\nConverges to eigenvector of largest |λ|\nRate: |λ₂/λ₁|ᵗ — fast if eigenvalues well-separated"]
    end
    subgraph "QR Algorithm (All Eigenvalues)"
        QR["Initialize: A₀ = A\nRepeat:\n  Factor: Aₜ = Qₜ × Rₜ  (QR decomposition)\n  Update: A_{t+1} = Rₜ × Qₜ  (reversed product)\nConverges: Aₜ → upper triangular\nDiagonal = eigenvalues\nProduct of Qᵢ = eigenvectors\nO(N³) per iteration, fast convergence"]
    end

10. Information Theory Foundations¶

flowchart TD
    subgraph "Entropy and KL Divergence"
        ENTROPY["Shannon Entropy:\nH(X) = -Σ P(x) log₂ P(x) [bits]\nMeasures: average information per symbol\nUniform dist: H = log₂(N) [maximum]\nOne certain outcome: H = 0 [minimum]"]

        CROSS["Cross-Entropy:\nH(P,Q) = -Σ P(x) log Q(x)\n= H(P) + KL(P||Q)\nLoss function in ML classification:\n= -log Q(correct_class)\n(minimizing cross-entropy = maximizing\nlog-likelihood of true distribution P)"]

        KL["KL Divergence:\nKL(P||Q) = Σ P(x) log(P(x)/Q(x))\nNon-symmetric: KL(P||Q) ≠ KL(Q||P)\nKL(P||Q)=0 iff P=Q\nUsed in VAE loss, RL policy optimization"]

        ENTROPY --> CROSS
        ENTROPY --> KL
    end
    subgraph "Data Compression: Huffman Coding"
        HUFF["Build min-heap of symbol frequencies\nRepeatedly merge two lowest-freq nodes\nLeft branch: 0, right branch: 1\nFrequent symbols → short codes\nOptimal prefix-free code\nApproaches entropy H(X) bits/symbol"]
    end

11. Numerical Stability: Condition Numbers¶

flowchart TD
    subgraph "Condition Number κ(A)"
        DEF["κ(A) = ||A|| × ||A⁻¹||\n= σ_max / σ_min  (ratio of largest to smallest singular value)\nMeasures: how much relative error in b\namplifies relative error in x for Ax=b"]
        INTERP["κ(A) = 1: perfectly conditioned (orthogonal matrix)\nκ(A) = 100: losing ~2 decimal digits of precision\nκ(A) = 10¹⁵: near singular — useless results\nRule: lose ≈ log₁₀(κ) decimal digits"]
        DEF --> INTERP
    end
    subgraph "Preconditioning"
        PREC["Replace Ax=b with M⁻¹Ax = M⁻¹b\nChoose M ≈ A so M⁻¹A ≈ I\n→ κ(M⁻¹A) << κ(A)\nCommon: diagonal scaling M=diag(A)\nJacobi preconditioner\nImproves convergence of iterative solvers (CG, GMRES)"]
    end

Summary: Mathematical Computing Properties¶

Algorithm	Complexity	Numerical Issues	Mitigation
Gaussian elimination	O(N³)	Pivot blowup	Partial pivoting
FFT	O(N log N)	Round-off accumulation	Double precision
Conjugate Gradient	O(N√κ) iterations	κ large → slow convergence	Preconditioning
Power iteration	O(N² per iter)	Slow if λ₁≈λ₂	Deflation
QR eigenvalue	O(N³ per iter)	Non-Hermitian → complex eig	Hessenberg reduction
Monte Carlo	O(1/√N)	Variance proportional to f²	Importance sampling
Backprop	O(N params)	Vanishing/exploding gradients	Layer norm, gradient clip