Distributed Computing: Principles, Algorithms, and Systems — Under the Hood¶

Source: Distributed Computing: Principles, Algorithms, and Systems — Ajay D. Kshemkalyani & Mukesh Singhal, Cambridge University Press, 2008 (756 pages)

1. The Distributed Execution Model: Events, Causality, and Global State¶

A distributed system is not a sequential machine. There is no global clock — each process has its own local clock and communicates only by message passing. The fundamental unit of analysis is an event: a local computation, a send, or a receive.

flowchart LR
    subgraph P1["Process P1"]
        E1["e₁¹ (local)"] --> E2["e₂¹ (send m1)"] --> E3["e₃¹ (receive m2)"] --> E4["e₄¹ (local)"]
    end
    subgraph P2["Process P2"]
        F1["f₁² (local)"] --> F2["f₂² (receive m1)"] --> F3["f₃² (send m2)"] --> F4["f₄² (local)"]
    end
    E2 -->|"m1 (message in transit)"| F2
    F3 -->|"m2"| E3

    subgraph HAPPENS_BEFORE["Happens-Before (→) Relation"]
        HB1["e₂¹ → f₂²  (send before receive)"]
        HB2["f₂² → f₃²  (same process, sequential)"]
        HB3["f₃² → e₃¹  (send before receive)"]
        HB4["e₂¹ → e₃¹  (transitivity)"]
    end

Lamport's happens-before (→) is a partial order — not all event pairs are comparable. Two events a and b are concurrent (a ∥ b) if neither a → b nor b → a. This is the root cause of distributed system complexity: concurrent events cannot be linearized without coordination.

2. Logical Clocks: Assigning Timestamps Without a Global Clock¶

Lamport Scalar Clocks¶

Each process Pi maintains a counter C[i]. The rules: 1. Before every event: C[i] += 1 2. On send: piggyback C[i] in message 3. On receive(msg with timestamp t): C[i] = max(C[i], t) + 1

sequenceDiagram
    participant P1 as P1 (C=0)
    participant P2 as P2 (C=0)
    participant P3 as P3 (C=0)

    Note over P1: C=1: local event e₁
    P1->>P2: m1 (timestamp=1)
    Note over P2: C=max(0,1)+1=2: receive m1
    Note over P2: C=3: local event f₂
    P2->>P3: m2 (timestamp=3)
    Note over P3: C=max(0,3)+1=4: receive m2
    Note over P1: C=2: local event e₂

    Note over P1,P3: Lamport: a→b ⟹ C(a) < C(b)\nBut C(a) < C(b) ⟹ NOT necessarily a→b\n(false positives: concurrent events may share ordering)

Limitation: Scalar clocks cannot detect concurrency. If C(a) < C(b), it could mean a → b OR a ∥ b.

Vector Clocks: Capturing Full Causality¶

Process Pi maintains vector VC[i][1..n]. Rules: 1. Before event at Pi: VC[i][i] += 1 2. On send: piggyback entire vector VC[i] 3. On receive at Pi from Pj with timestamp VT: VC[i][k] = max(VC[i][k], VT[k]) for all k, then VC[i][i] += 1

sequenceDiagram
    participant P1 as P1 VC=[0,0,0]
    participant P2 as P2 VC=[0,0,0]
    participant P3 as P3 VC=[0,0,0]

    Note over P1: VC=[1,0,0]: send m1
    P1->>P2: m1 (VC=[1,0,0])
    Note over P2: VC=[1,1,0]: receive m1, then +1 own
    Note over P2: VC=[1,2,0]: send m2
    P2->>P3: m2 (VC=[1,2,0])
    Note over P3: VC=[1,2,1]: receive
    Note over P1: VC=[2,0,0]: independent local event

    Note over P1,P3: e@P1=[2,0,0], f@P3=[1,2,1]\ne∥f: neither dominates the other component-wise

Vector clock comparison: VC(a) < VC(b) iff VC(a)[k] ≤ VC(b)[k] for all k and strict for at least one. This is a necessary and sufficient condition for a → b. Concurrent events are detected when neither dominates.

3. Global State and Consistent Cuts¶

A global state is a tuple of local process states and channel states. A consistent cut is a global state where for every message received, the corresponding send is also included.

flowchart LR
    subgraph TIME["Space-Time Diagram"]
        direction TB
        P1_LINE["P1: e₁ ——● e₂ ——● e₃ ——●"]
        P2_LINE["P2: f₁ ——● f₂ ——● f₃ ——●"]
        P3_LINE["P3: g₁ ——● g₂ ——● g₃ ——●"]
    end

    subgraph CUTS["Cut Comparison"]
        CUT_A["Consistent Cut C₁:\ne₂, f₁, g₃\nAll received messages\nalso have senders included"]
        CUT_B["Inconsistent Cut C₂:\ne₁, f₂, g₁\nf₂ = receive(m from P1)\nbut P1 cut at e₁ (before send)\n→ INCONSISTENT"]
    end

Chandy-Lamport Snapshot Algorithm¶

Records a consistent global state without freezing the system:

sequenceDiagram
    participant P1 as P1 (initiator)
    participant P2 as P2
    participant P3 as P3

    P1->>P1: Record own state S1
    P1->>P2: MARKER (on channel c12)
    P1->>P3: MARKER (on channel c13)
    Note over P1: Begin recording messages on incoming channels

    P2->>P2: Receive MARKER from P1\n→ Record state S2\nRecord c12 state = {} (empty: MARKER was first)
    P2->>P3: MARKER (on channel c23)
    Note over P2: Record incoming msgs from P3 (not yet seen MARKER from P3)

    P3->>P3: Receive MARKER from P1\n→ Record state S3
    P3->>P3: Receive MARKER from P2\n→ Record c23 state = {msgs since S3 snapshot}

    Note over P1,P3: Global snapshot = (S1, S2, S3, c12={}, c13={}, c23={msgs})
    Note over P1,P3: This is a consistent cut:\neach received message also has its send in snapshot

Key insight: The MARKER acts as a timestamp separator. Everything before the MARKER on a channel belongs to the snapshot; everything after does not. The algorithm is non-intrusive — normal computation continues.

4. Distributed Mutual Exclusion: Algorithms and Complexity¶

Lamport's Algorithm (1978)¶

Uses Lamport clocks to totally order requests. Every site maintains a request queue sorted by (timestamp, site_id).

sequenceDiagram
    participant Si as Site Si
    participant Sj as Site Sj (all other sites)
    participant CS as Critical Section

    Si->>Sj: REQUEST(tsi, i) broadcast to all
    Si->>Si: Add (tsi, i) to own queue
    Sj-->>Si: REPLY(tsj) after adding to own queue
    Note over Si: Si enters CS when:\n1. (tsi,i) is at head of queue\n2. Received REPLY from ALL other sites
    Si->>CS: Enter Critical Section
    CS->>Si: Exit CS
    Si->>Sj: RELEASE broadcast
    Sj->>Sj: Remove (tsi,i) from queue

Message complexity: 3(N-1) messages per CS entry (N-1 REQUESTs + N-1 REPLYs + N-1 RELEASEs). Synchronization delay: T (one message round trip).

Ricart-Agrawala Algorithm (1981): Optimized¶

Eliminates explicit RELEASE messages by merging them into deferred REPLYs:

sequenceDiagram
    participant Si as Si (wants CS)
    participant Sj as Sj (in CS or wants CS)
    participant Sk as Sk (idle)

    Si->>Sj: REQUEST(tsi, i)
    Si->>Sk: REQUEST(tsi, i)
    Sk-->>Si: REPLY immediately (Sk doesn't want CS)

    Note over Sj: Sj in CS → defer REPLY to Si
    Sj->>CS: (finishes CS)
    Sj-->>Si: REPLY (deferred)

    Note over Si: Received N-1 REPLYs → Enter CS
    Si->>CS: Enter CS

Message complexity: 2(N-1) messages (N-1 REQUESTs + N-1 REPLYs). Correctness: Total order on requests via timestamps ensures no two sites are in CS simultaneously.

Maekawa's Quorum-Based Algorithm: √N Messages¶

Instead of broadcasting to all N sites, each site broadcasts to only its quorum set of size ~√N:

flowchart TD
    subgraph QUORUM_SETS["Quorum Sets for N=9 sites (3×3 grid)"]
        Q1["R(S1) = {S1, S2, S3, S4, S7}"]
        Q2["R(S5) = {S5, S2, S8, S4, S6}"]
        Q3["Any two quorum sets intersect:\nR(S1) ∩ R(S5) = {S2, S4}"]
    end
    DEADLOCK["Deadlock Problem:\nSi locks Sij, Sj locks Sjk, Sk locks Ski\n→ Circular wait"]
    RESOLVE["Resolution:\nFAILED / INQUIRE / YIELD messages\n→ 5√N messages worst case"]
    QUORUM_SETS --> DEADLOCK --> RESOLVE

Why quorums guarantee safety: Any two quorum sets must intersect — so two sites can never simultaneously hold locks on disjoint sets. The intersection site acts as the "common voter" serializing access.

5. Deadlock Detection: Wait-For Graphs¶

Centralized Deadlock Detection¶

flowchart TD
    subgraph WFG["Wait-For Graph (WFG)"]
        P1(["P1"]) -->|"waiting for resource held by"| P2(["P2"])
        P2 -->|waiting for| P3(["P3"])
        P3 -->|waiting for| P1
        NOTE["Cycle P1→P2→P3→P1 = DEADLOCK"]
    end
    subgraph CENTRALIZED["Centralized Algorithm"]
        LOCAL1["Local WFG at Site 1\nP1 → P2"] -->|periodic update| CONTROLLER["Central Deadlock\nDetector"]
        LOCAL2["Local WFG at Site 2\nP2 → P3"] --> CONTROLLER
        LOCAL3["Local WFG at Site 3\nP3 → P1"] --> CONTROLLER
        CONTROLLER -->|"Cycle detection\n(DFS/BFS)"| RESOLVE["Kill youngest\nor cheapest process"]
    end

Chandy-Misra-Haas Algorithm for Distributed Deadlock (AND model)¶

sequenceDiagram
    participant P1 as P1 (blocked)
    participant P2 as P2 (blocked, intermediary)
    participant P3 as P3 (blocked)

    Note over P1: P1 is blocked waiting for P2
    P1->>P2: PROBE(P1, P1, P2) [initiator, sender, receiver]
    Note over P2: P2 is also blocked → propagate PROBE
    P2->>P3: PROBE(P1, P2, P3)
    Note over P3: P3 is blocked, waiting for P1
    P3->>P1: PROBE(P1, P3, P1)
    Note over P1: Receive PROBE with initiator=P1\n→ DEADLOCK DETECTED
    Note over P1: P1 initiates victim selection:\nkill the process with max PID in cycle

Complexity: O(e) messages where e = edges in WFG. Probes propagate along waiting edges — the PROBE returns to its initiator only if there is a cycle.

6. Consensus and Agreement: The FLP Impossibility¶

The Fischer-Lynch-Paterson (FLP) impossibility theorem (1985) is the most important result in distributed systems theory:

It is impossible to solve consensus in an asynchronous distributed system where even a single process may crash.

stateDiagram-v2
    [*] --> BIVALENT_INITIAL: Initial state\n(some process may crash)
    BIVALENT_INITIAL --> DECISION_ATTEMPT: Execute one step
    DECISION_ATTEMPT --> BIVALENT_AGAIN: Step taken by potentially\ncrashed process — indistinguishable\nfrom live-but-slow process
    BIVALENT_AGAIN --> DECISION_ATTEMPT: Repeat forever
    note right of BIVALENT_AGAIN: Bivalent = both 0 and 1\nstill reachable\n\nMonovalent = only 0 or 1\nreachable (decided)\n\nFLP: Can never deterministically\ntransition bivalent→monovalent\nin async system with failures

Why this matters: Any system claiming Byzantine/crash fault tolerance in an asynchronous network must either: 1. Use randomization (Randomized consensus, Ben-Or algorithm) 2. Use partial synchrony assumptions (Paxos, Raft — assume messages eventually arrive) 3. Solve a weaker problem (k-set consensus, approximate agreement)

Paxos: Consensus Under Partial Synchrony¶

Paxos assumes eventually synchronous channels — messages may be delayed but eventually arrive. It uses two phases:

sequenceDiagram
    participant PROPOSER as Proposer (Leader)
    participant A1 as Acceptor 1
    participant A2 as Acceptor 2
    participant A3 as Acceptor 3

    Note over PROPOSER: Phase 1a: Prepare
    PROPOSER->>A1: PREPARE(n=5)
    PROPOSER->>A2: PREPARE(n=5)
    PROPOSER->>A3: PREPARE(n=5)

    Note over A1,A3: Accept if n > highest_promised
    A1-->>PROPOSER: PROMISE(n=5, last_accepted=null)
    A2-->>PROPOSER: PROMISE(n=5, last_accepted=(n=3, v=42))
    A3-->>PROPOSER: PROMISE(n=5, last_accepted=null)

    Note over PROPOSER: Phase 2a: Accept\nIf any PROMISE had prior value,\nmust use that value (v=42)\nOtherwise propose own value
    PROPOSER->>A1: ACCEPT(n=5, v=42)
    PROPOSER->>A2: ACCEPT(n=5, v=42)
    PROPOSER->>A3: ACCEPT(n=5, v=42)

    A1-->>PROPOSER: ACCEPTED(n=5)
    A2-->>PROPOSER: ACCEPTED(n=5)
    Note over PROPOSER: Quorum (2/3) → COMMIT v=42
    PROPOSER->>A1: COMMIT(v=42)
    PROPOSER->>A2: COMMIT(v=42)
    PROPOSER->>A3: COMMIT(v=42)

Why Phase 1b must return the highest prior accepted value: If an acceptor has already accepted value v in a prior round, there may be a quorum that has committed v. The new proposer must preserve this value to prevent two different values being committed in different rounds.

Raft: Understandable Paxos¶

stateDiagram-v2
    [*] --> Follower: Node starts
    Follower --> Candidate: Election timeout\n(no heartbeat from leader)
    Candidate --> Leader: Receives votes from majority
    Candidate --> Follower: Discovers leader\nor higher term
    Leader --> Follower: Discovers higher term
    Leader --> Leader: Sends heartbeats\nevery 150-300ms

    state Leader {
        [*] --> AppendEntries_RPC
        AppendEntries_RPC --> Commit: Majority acknowledgment
        Commit --> [*]
    }

Raft log commitment: An entry is committed when the leader receives acknowledgment from a majority of nodes. Only then does the leader reply to the client. Uncommitted entries may be overwritten by a new leader — committed entries never are.

7. Failure Detectors: The Theory Behind Heartbeats¶

Chandra and Toueg (1996) formalized failure detectors as oracle modules:

flowchart TD
    subgraph CLASSES["Failure Detector Classes (by two properties)"]
        subgraph COMPLETENESS["Completeness\n(correct detectors eventually suspect crashed processes)"]
            STRONG["Strong: every crashed process\neventually suspected by ALL correct processes"]
            WEAK["Weak: every crashed process\neventually suspected by SOME correct process"]
        end
        subgraph ACCURACY["Accuracy\n(how often correct processes are wrongly suspected)"]
            STRONG_ACC["Strong: no correct process ever suspected"]
            WEAK_ACC["Weak: some correct process is never suspected"]
            EVENTUAL_S["Eventual Strong: after some time,\nno correct process ever suspected"]
            EVENTUAL_W["Eventual Weak: after some time,\nsome correct process never suspected"]
        end
    end

    STRONG --> PERFECT["Perfect Failure Detector\n(Strong Completeness + Strong Accuracy)\nRequires synchrony"]
    EVENTUAL_S --> EVENTUALLY_PERFECT["Eventually Perfect ◇P\n(eventual strong both)\nSufficient for consensus!"]
    EVENTUAL_W --> EVENTUALLY_WEAK["Eventually Weak ◇W\n(weakest detector for consensus)"]

Theorem (Chandra-Toueg): Consensus is solvable with failure detector class ◇W (eventually weak) even in asynchronous systems. This justifies why Zookeeper, etcd, and Raft can solve consensus despite the FLP impossibility — they use timeouts (implementing ◇P) and accept false suspicions temporarily.

Phi Accrual Failure Detector (used in Akka, Cassandra)¶

Instead of binary "alive/dead", outputs a suspicion level φ:

flowchart TD
    HEARTBEATS["Inter-arrival intervals of heartbeats\n[t₁, t₂, t₃, ..., tₙ]"] --> STATS["Compute distribution:\nmean μ, std dev σ\n(exponential distribution assumed)"]
    STATS --> PHI["φ(t_now) = -log₁₀(P(T > t_now))\nwhere T ~ Exponential(1/μ)"]
    PHI -->|"φ < threshold (e.g. 8)"| ALIVE["Process considered ALIVE"]
    PHI -->|"φ ≥ threshold"| SUSPECTED["Process SUSPECTED\n(application decides action)"]

    NOTE["φ = 8 → P(false suspicion) ≈ 10⁻⁸\nφ = 10 → P(false suspicion) ≈ 10⁻¹⁰\nAdjust threshold for network conditions"]

8. Distributed Transactions: 2PC and 3PC Internals¶

Two-Phase Commit (2PC): Blocking Protocol¶

sequenceDiagram
    participant COORD as Coordinator
    participant P1 as Participant 1
    participant P2 as Participant 2
    participant P3 as Participant 3

    Note over COORD: Phase 1: Voting
    COORD->>P1: PREPARE (canCommit?)
    COORD->>P2: PREPARE
    COORD->>P3: PREPARE
    P1-->>COORD: YES (logged PREPARED to WAL)
    P2-->>COORD: YES
    P3-->>COORD: NO (e.g., constraint violation)

    Note over COORD: Any NO → ABORT
    COORD->>P1: ABORT
    COORD->>P2: ABORT
    COORD->>P3: ABORT

    Note over COORD,P3: BLOCKING SCENARIO:\nIf coordinator crashes after\nparticipants voted YES but before\nsending COMMIT/ABORT,\nparticipants are BLOCKED waiting forever

2PC blocking problem: A participant that has voted YES is uncertain — it cannot unilaterally abort (another participant may have committed) nor commit (another may have aborted). It must wait for the coordinator to recover.

Three-Phase Commit (3PC): Non-Blocking¶

3PC adds a pre-commit phase that allows participants to detect coordinator failure and safely commit:

sequenceDiagram
    participant COORD as Coordinator
    participant P1 as Participant 1
    participant P2 as Participant 2

    COORD->>P1: PREPARE
    COORD->>P2: PREPARE
    P1-->>COORD: VOTE_COMMIT
    P2-->>COORD: VOTE_COMMIT

    COORD->>P1: PRE-COMMIT ← NEW PHASE
    COORD->>P2: PRE-COMMIT
    P1-->>COORD: ACK
    P2-->>COORD: ACK

    COORD->>P1: COMMIT
    COORD->>P2: COMMIT

    Note over P1,P2: If coordinator crashes after PRE-COMMIT:\nnew coordinator queries participants.\nAll saw PRE-COMMIT → safe to COMMIT\n(all voted YES, none can abort)

Non-blocking guarantee: 3PC requires at most f failures out of n nodes. It cannot tolerate network partitions — that requires Paxos/Raft. 3PC's "pre-commit" phase means: "I know everyone voted YES, so commit is safe."

9. Peer-to-Peer Overlay Networks: Chord DHT Internals¶

Consistent Hashing Ring¶

flowchart LR
    subgraph RING["Chord Ring (m=6 bits, 2⁶=64 positions)"]
        N0["Node 0\n(holds keys 57-0)"]
        N14["Node 14\n(holds keys 1-14)"]
        N32["Node 32\n(holds keys 15-32)"]
        N45["Node 45\n(holds keys 33-45)"]
        N51["Node 51\n(holds keys 46-51)"]
        N57["Node 57\n(holds keys 52-57)"]
    end
    N0 --> N14 --> N32 --> N45 --> N51 --> N57 --> N0

Chord Finger Table: O(log N) Lookup¶

Each node n maintains a finger table where finger[i] = successor(n + 2^(i-1) mod 2^m):

flowchart TD
    subgraph FT_N0["Finger Table of Node 0 (m=6)"]
        F1["finger[1] = succ(0+1) = N14"]
        F2["finger[2] = succ(0+2) = N14"]
        F3["finger[3] = succ(0+4) = N14"]
        F4["finger[4] = succ(0+8) = N14"]
        F5["finger[5] = succ(0+16) = N32"]
        F6["finger[6] = succ(0+32) = N45"]
    end
    subgraph LOOKUP["Lookup key k=40 from N0"]
        STEP1["N0: finger[6]=N45 > 40? No → finger[5]=N32 > 40? No"]
        STEP2["N0 forwards to N32"]
        STEP3["N32: finger[6]=succ(32+32)=N0 → wrap. finger[5]=succ(32+16)=N51 > 40? Yes"]
        STEP4["N32 forwards to N45"]
        STEP5["N45: 40 ∈ (32, 45] → N45 is responsible. FOUND."]
    end
    FT_N0 --> LOOKUP

Lookup complexity: O(log N) hops with O(log N) finger table entries per node. When a node joins, it needs to update at most O(log² N) other nodes' finger tables.

10. Spanning Trees and Broadcast/Convergecast¶

Distributed algorithms for information dissemination rely on spanning trees — tree overlays connecting all processes with no cycles.

sequenceDiagram
    participant ROOT as Root (P1)
    participant P2 as P2
    participant P3 as P3
    participant P4 as P4 (leaf)

    Note over ROOT: Broadcast: root sends message
    ROOT->>P2: FLOOD(msg, P1)
    ROOT->>P3: FLOOD(msg, P1)
    P2->>P4: FLOOD(msg, P1) [P2 is P4's parent in tree]
    Note over P4: Received — leaf node, no children

    Note over P4,ROOT: Convergecast: aggregate result back up
    P4-->>P2: RESULT(partial: subtree of P4)
    P3-->>ROOT: RESULT(partial: subtree of P3)
    P2-->>ROOT: RESULT(partial: subtree of P2 + P4)
    Note over ROOT: Combine all results → global aggregate (min/max/sum)

Message complexity: Broadcast = O(n) messages on tree. Convergecast = O(n) messages. Building the spanning tree = O(m + n log n) messages (Gallagher-Humblet-Spira MST algorithm).

11. Byzantine Fault Tolerance: Dealing with Liars¶

In the Byzantine Generals Problem, up to f processes may send arbitrary/malicious messages. The system can tolerate Byzantine failures only if n ≥ 3f + 1.

sequenceDiagram
    participant G as General (Commander)
    participant L1 as Lieutenant 1 (loyal)
    participant L2 as Lieutenant 2 (loyal)
    participant T as Traitor

    G->>L1: ATTACK
    G->>L2: ATTACK
    G->>T: ATTACK

    T->>L1: (claims General said) RETREAT
    T->>L2: (claims General said) ATTACK

    Note over L1: Received: ATTACK (from G), RETREAT (from T)\nMajority of 3 loyal process values → ?

    Note over L1,L2: With n=4, f=1: n ≥ 3(1)+1=4 ✓\nAlgorithm: OM(m) runs m+1 rounds\nwhere m = number of traitors\nEach loyal Lt eventually decides on majority vote

Why n ≥ 3f + 1: With n = 3f, the f traitors can confuse the 2f loyal generals into a tie. The extra f+1 loyal generals provide the decisive majority.

Practical BFT: PBFT Protocol¶

sequenceDiagram
    participant CLIENT as Client
    participant PRIMARY as Primary Replica
    participant R1 as Replica 1
    participant R2 as Replica 2
    participant R3 as Replica 3 (faulty)

    CLIENT->>PRIMARY: Request(op)
    Note over PRIMARY: Phase 1: Pre-prepare
    PRIMARY->>R1: PRE-PREPARE(v, n, digest(m))
    PRIMARY->>R2: PRE-PREPARE(v, n, digest(m))
    PRIMARY->>R3: PRE-PREPARE(v, n, digest(m))

    Note over R1,R3: Phase 2: Prepare (multicast to all)
    R1->>R2: PREPARE(v, n, digest, i=1)
    R2->>R1: PREPARE(v, n, digest, i=2)
    R3--xR1: PREPARE (faulty: wrong digest)

    Note over R1: Received 2f PREPARE msgs → PREPARED
    Note over R1,R3: Phase 3: Commit
    R1->>R2: COMMIT(v, n, digest, i=1)
    R2->>R1: COMMIT(v, n, digest, i=2)
    Note over R1: 2f+1 COMMIT msgs → EXECUTE and REPLY to client
    R1-->>CLIENT: Reply(result)
    R2-->>CLIENT: Reply(result)

    Note over CLIENT: Accept reply after f+1 identical replies\n(guarantees at least one reply from honest replica)

PBFT complexity: O(n²) messages per request (all replicas multicast to all). This limits PBFT to small replica counts (~dozens). Modern systems like HotStuff reduce this to O(n) via threshold signatures.

12. Distributed Shared Memory and Memory Consistency Models¶

flowchart TD
    subgraph MODELS["Memory Consistency Models (weakest to strongest)"]
        EVENTUAL["Eventual Consistency\nReplicas diverge temporarily\n→ eventually converge (DNS, DynamoDB)"]
        CAUSAL["Causal Consistency\nCausally related writes seen in order\nConcurrent writes may differ (COPS)"]
        SEQUENTIAL["Sequential Consistency\n(Lamport)\nAll processes see same total order\nNot necessarily real-time"]
        LINEARIZABLE["Linearizability\n(Herlihy-Wing)\nSequential + real-time order\netcd, Zookeeper, Spanner"]
    end
    EVENTUAL --> CAUSAL --> SEQUENTIAL --> LINEARIZABLE
    LINEARIZABLE -->|"Higher cost\n(requires coordination)"| PERF["Lower Performance"]
    EVENTUAL -->|"Lower cost\n(async replication)"| PERF2["Higher Performance"]

CAP Theorem Internals¶

flowchart TD
    CAP["CAP Theorem:\nIn the presence of network Partition,\nchoose Consistency OR Availability\n(but not both)"]

    subgraph CA["CP Systems (Consistent + Partition-tolerant)"]
        ETCd["etcd/ZooKeeper\n(Raft/Paxos)\nRejects writes if no quorum\n→ Unavailable during partition"]
        SPANNER["Google Spanner\n(TrueTime + Paxos)\nStrongly consistent,\nbut can't serve during partition"]
    end

    subgraph AP["AP Systems (Available + Partition-tolerant)"]
        CASSANDRA["Apache Cassandra\n(Eventual consistency)\nAlways accepts writes\nReads may see stale data"]
        DYNAMO["Amazon DynamoDB\n(tunable consistency)\nSloppy quorums allow divergence"]
    end

    CAP --> CA
    CAP --> AP

13. Complete Data Flow: Message Through a Distributed System¶

sequenceDiagram
    participant CLIENT as Client
    participant LB as Load Balancer
    participant SVC_A as Service A (replica 1)
    participant SVC_A2 as Service A (replica 2)
    participant DB_COORD as DB Coordinator (Paxos leader)
    participant DB_F1 as DB Follower 1
    participant DB_F2 as DB Follower 2

    CLIENT->>LB: HTTP Request
    LB->>SVC_A: Forward (round-robin)

    SVC_A->>DB_COORD: BEGIN TRANSACTION\nWRITE(key=k, val=v)
    DB_COORD->>DB_F1: AppendEntries(log entry)
    DB_COORD->>DB_F2: AppendEntries(log entry)
    DB_F1-->>DB_COORD: ACK (quorum: 2/3)
    DB_COORD->>DB_COORD: Commit (mark durable in WAL)
    DB_COORD-->>SVC_A: COMMIT OK (with vector clock VC=[5,3,2])

    SVC_A-->>CLIENT: 200 OK

    Note over SVC_A2: Lagging replica\nVC=[5,2,2] (not yet seen entry)
    CLIENT->>LB: Read same key
    LB->>SVC_A2: Forward
    SVC_A2->>DB_F2: READ(k)
    DB_F2-->>SVC_A2: Old value (DB_F2 not yet applied commit)
    Note over SVC_A2: Stale read — AP/eventual consistency\nFor strong consistency: SVC_A2 must\nread from leader (DB_COORD)

Summary: Complexity Landscape¶

block-beta
    columns 3
    H1["Algorithm"]:1 H2["Messages per Operation"]:1 H3["Synchronization Delay"]:1
    R1["Lamport ME"]:1 M1["3(N-1)"]:1 D1["T (one round trip)"]:1
    R2["Ricart-Agrawala"]:1 M2["2(N-1)"]:1 D2["T"]:1
    R3["Maekawa Quorum"]:1 M3["5√N (with deadlock)"]:1 D3["2T"]:1
    R4["Paxos Consensus"]:1 M4["4(N-1) typical"]:1 D4["2 round trips"]:1
    R5["PBFT Byzantine"]:1 M5["O(N²)"]:1 D5["3 round trips"]:1
    R6["Chord Lookup"]:1 M6["O(log N) hops"]:1 D6["O(log N) × T"]:1
    R7["Chandy-Lamport Snapshot"]:1 M7["O(e) — one per channel"]:1 D7["Non-blocking"]:1

The fundamental tradeoff in distributed systems: stronger guarantees (consistency, mutual exclusion, consensus) require more messages and higher latency. Every real system (etcd, Kafka, Cassandra, ZooKeeper) is a carefully engineered point in this design space, trading correctness strength against performance.