Microservices Internals: Under the Hood¶
Source synthesis: Microservices reference books (comp 107, 112, 147–151, 163–164, 370) covering service mesh, API gateway, inter-service communication, distributed tracing, service discovery, and resilience patterns.
1. Service Mesh Architecture — Data Plane vs Control Plane¶
flowchart TD
subgraph ControlPlane["Control Plane (Istio / Linkerd)"]
Pilot["Pilot / istiod\n- xDS server (ADS)\n- watches k8s API\n- computes Envoy config\n- pushes LDS/RDS/CDS/EDS"]
Citadel["Citadel / SPIFFE\n- issues SVID x509 certs\n- cert rotation every 24h\n- SPIFFE ID: spiffe://cluster.local/ns/default/sa/myapp"]
Galley["Galley / config validator\n- validates VirtualService\n- MeshConfig\n- DestinationRule"]
end
subgraph DataPlane["Data Plane (Envoy sidecars)"]
App1["Service A\n:8080"]
Sidecar1["Envoy Proxy\n(iptables redirect\nall traffic through :15001)"]
App2["Service B\n:8080"]
Sidecar2["Envoy Proxy"]
end
Pilot -->|"xDS push (gRPC stream)"| Sidecar1
Pilot -->|"xDS push"| Sidecar2
Citadel -->|"mTLS cert"| Sidecar1
Citadel -->|"mTLS cert"| Sidecar2
App1 <-->|"loopback"| Sidecar1
Sidecar1 <-->|"mTLS\nHTTP/2 / gRPC"| Sidecar2
Sidecar2 <-->|"loopback"| App2
iptables Traffic Interception¶
flowchart LR
subgraph Pod Netns
App["App process\n:8080"]
Envoy["Envoy\n:15001 (outbound)\n:15006 (inbound)"]
IP["iptables rules\n(injected by istio-init)"]
end
Out["Outbound call to 10.244.1.7:8080"]
App -->|"connect() → 10.244.1.7:8080"| IP
IP -->|"REDIRECT --to-port 15001\n(OUTPUT chain, ISTIO_OUTPUT)"| Envoy
Envoy -->|"original dst via SO_ORIGINAL_DST\n→ route decision\n→ upstream TLS"| Out
In["Inbound from 10.244.1.5"]
In -->|"PREROUTING: REDIRECT --to-port 15006"| Envoy
Envoy -->|"policy check + telemetry\n→ forward to :8080"| App
2. Envoy xDS — Dynamic Configuration Protocol¶
sequenceDiagram
participant Envoy
participant istiod as istiod (xDS server)
participant K8s as Kubernetes API
K8s-->>istiod: Service/Endpoints/VirtualService watch events
istiod->>istiod: compute xDS config snapshot
Envoy->>istiod: DiscoveryRequest{node_id, resource_names, version_info}
istiod-->>Envoy: DiscoveryResponse{version, resources:[LDS listeners]}
Envoy-->>istiod: ACK (version matches)
Note over Envoy: LDS: Listeners (ports to bind)
Note over Envoy: RDS: Route configs (Host+Path → Cluster)
Note over Envoy: CDS: Cluster configs (load balancing policy, circuit breaker)
Note over Envoy: EDS: Endpoint addresses (pod IPs + weights + health)
istiod-->>Envoy: CDS push (new cluster added)
Envoy-->>istiod: ACK
istiod-->>Envoy: EDS push (pod IP changed)
Envoy-->>istiod: ACK
3. API Gateway — Request Processing Pipeline¶
flowchart TD
Client["Client\n(mobile / browser)"]
GW["API Gateway\n(Kong / AWS API GW / Nginx)"]
subgraph Gateway Pipeline
TLS_Term["TLS Termination\n(certificate at edge)"]
Auth["Authentication\n- JWT validation (RS256 pubkey)\n- API key lookup (hash → secret store)\n- OAuth2 token introspection"]
RateLimit["Rate Limiting\n- token bucket per key (Redis)\n- sliding window counter\n- 429 Too Many Requests"]
Transform["Request Transform\n- header injection (X-User-Id)\n- path rewrite (/v1/users → /users)\n- body schema validation"]
Route["Routing\n- path prefix match\n- host-based routing\n- canary weight split"]
LB["Load Balancing\n- round-robin / least-conn\n- health check (active probes)\n- circuit breaker"]
Upstream["Upstream Services\n(microservices)"]
Cache["Response Cache\n(CDN / Varnish / Redis)\nCache-Control headers"]
end
Client --> GW --> TLS_Term --> Auth --> RateLimit --> Transform --> Route --> LB --> Upstream
Upstream -->|"response"| Cache -->|"cached or passthrough"| Client
Token Bucket Rate Limiter Internals¶
flowchart LR
subgraph Redis Token Bucket
Key["key: ratelimit:{api_key}\nfields:\n tokens: 95.0\n last_refill: 1709123456789"]
Refill["Refill:\ntokens += rate × (now - last_refill)\ntokens = min(tokens, capacity)"]
Consume["Consume:\nif tokens >= 1:\n tokens -= 1 → ALLOW\nelse:\n → 429 DENY"]
Script["Lua script (atomic EVAL)\n→ no race condition\n→ single RTT to Redis"]
end
Request -->|"EVAL lua, key"| Script
Script --> Refill --> Consume
4. Service Discovery — Client-Side vs Server-Side¶
flowchart TD
subgraph Client-Side Discovery (Eureka / Consul)
SvcA["Service A"]
Registry["Service Registry\n(Consul / Eureka)\nhealth-checked store\nof {name → [ip:port]}"]
SvcB_1["Service B instance 1\n10.244.1.5:8080"]
SvcB_2["Service B instance 2\n10.244.1.7:8080"]
LB_Client["Client-side LB\n(Ribbon / gRPC client LB)\nround-robin / p2c"]
SvcA -->|"1. lookup service-b"| Registry
Registry -->|"2. return [10.244.1.5, 10.244.1.7]"| SvcA
SvcA --> LB_Client
LB_Client -->|"3. pick instance"| SvcB_1
end
subgraph Server-Side Discovery (Kubernetes Service)
SvcC["Service C"]
ClusterIP["ClusterIP 10.96.0.10:80\n(kube-proxy iptables DNAT)"]
SvcD_1["Service D pod 1"]
SvcD_2["Service D pod 2"]
SvcC -->|"connect ClusterIP"| ClusterIP
ClusterIP -->|"random DNAT"| SvcD_1 & SvcD_2
end
subgraph Consul Internal
Agent["consul agent\n(local sidecar)"]
Server["consul server\n(Raft cluster)"]
HCheck["health check:\nHTTP GET /health → 200?\nTCP connect?\nScript output?"]
Agent -->|"gossip protocol\n(SWIM)\nfailure detection"| Server
Agent --> HCheck
end
5. gRPC Internals — Transport & Protocol Buffers¶
flowchart TD
subgraph gRPC Stack
AppCode["Application Code\ngrpc.Dial() + stub.Method()"]
Stub["Generated Stub\n(protoc-gen-go/grpc)"]
Channel["gRPC Channel\n- connection pool\n- load balancing policy\n- name resolver (DNS/xDS)"]
HTTP2["HTTP/2 Transport\n- multiplexed streams\n- header compression (HPACK)\n- flow control (per-stream + connection)\n- stream ID (odd=client)"]
TLS["TLS 1.3\n(or plaintext h2c)"]
TCP["TCP Socket"]
end
subgraph Protobuf Encoding
Msg["Message{id: 1, name: 'Alice', score: 99.5}"]
Enc["Wire format:\n08 01 — field 1, varint, value 1\n12 05 41 6c 69 63 65 — field 2, len, 'Alice'\n1d 00 00 c7 42 — field 3, fixed32, 99.5"]
Note1["Tag = (field_number << 3) | wire_type\nVarint: base-128, LSB first, MSB=continuation\nNo field names, no nulls — extremely compact"]
end
AppCode --> Stub --> Channel --> HTTP2 --> TLS --> TCP
Msg --> Enc
gRPC Streaming — Backpressure Flow¶
sequenceDiagram
participant Client
participant H2C as HTTP/2 Connection
participant Server
Client->>H2C: SETTINGS (initial_window_size=65535)
Server->>H2C: SETTINGS (initial_window_size=65535)
Client->>H2C: HEADERS frame (stream_id=1, :path=/svc/Method)
Client->>H2C: DATA frame (stream_id=1, payload=1000B)
Note over H2C: client window -= 1000 (64535 remaining)
Server->>H2C: WINDOW_UPDATE (stream_id=1, increment=1000)
Note over H2C: client window restored → can send more
Server->>H2C: DATA frame (response chunk)
Server->>H2C: DATA frame (response chunk)
Server->>H2C: HEADERS frame (END_STREAM, grpc-status=0)
6. Distributed Tracing — OpenTelemetry Internals¶
flowchart TD
subgraph Trace Propagation
Req["HTTP Request\nW3C Trace Context headers:\ntraceparent: 00-{traceId}-{spanId}-01\ntracestate: vendor-specific"]
SvcA["Service A\n- extract context\n- start span (spanId=aaaa)\n- inject into outbound headers"]
SvcB["Service B\n- extract parent spanId=aaaa\n- start child span (spanId=bbbb)\n- record attributes+events"]
SvcC["Service C\n- child span (spanId=cccc)"]
Req --> SvcA -->|"HTTP with traceparent"| SvcB -->|"gRPC metadata"| SvcC
end
subgraph OTLP Export Pipeline
SDK["OTel SDK\n- Tracer → start/end spans\n- SpanProcessor (BatchSpanProcessor)\n- in-memory ring buffer"]
Collector["OTel Collector\n- receives OTLP (gRPC/HTTP)\n- tail sampling processor\n- batch exporter"]
Backend["Jaeger / Zipkin / Tempo\n- trace storage\n- span index\n- dependency graph"]
SDK -->|"OTLP gRPC (async batch)"| Collector
Collector -->|"Jaeger Thrift / OTLP"| Backend
end
subgraph Span Data Model
Span["Span {\n traceId: 128-bit\n spanId: 64-bit\n parentSpanId: 64-bit\n name: 'GET /users'\n kind: CLIENT/SERVER/PRODUCER/CONSUMER\n startTime, endTime (UnixNano)\n attributes: {http.method, http.status_code}\n events: [{name, timestamp, attrs}]\n status: OK / ERROR\n}"]
end
7. Circuit Breaker — State Machine Internals¶
stateDiagram-v2
[*] --> Closed : initial state
Closed --> Open : failure rate > threshold\n(e.g. 50% of last 10 calls fail)
Open --> HalfOpen : timeout elapsed\n(e.g. 30 seconds)
HalfOpen --> Closed : probe request succeeds
HalfOpen --> Open : probe request fails
note right of Closed
Requests pass through normally
Failure counter incremented on error
Sliding window: last N calls or time window
end note
note right of Open
All requests FAIL FAST immediately
No network calls made
Error returned to caller instantly
end note
note right of HalfOpen
Single probe request allowed
Determines if backend recovered
end note
Resilience4j Sliding Window¶
flowchart LR
subgraph Count-Based Window (size=10)
W["Ring buffer [F,S,F,S,S,F,S,S,S,F]\n(F=fail, S=success)\nfailureRate = count(F)/10 = 40%"]
Threshold["threshold=50% → CLOSED (below threshold)"]
end
subgraph Time-Based Window (5 seconds)
T["Epoch buckets (1 per second):\n[t-5: 3F 7S]\n[t-4: 1F 4S]\n[t-3: 5F 2S]\n[t-2: 2F 8S]\n[t-1: 4F 6S]\naggregated failureRate = 15/42 = 36%"]
end
subgraph Bulkhead
B["Semaphore bulkhead:\nmaxConcurrentCalls=10\nmaxWaitDuration=0ms\n→ immediate rejection if saturated\n(isolates one service from starving others)"]
end
8. Saga Pattern — Distributed Transaction Internals¶
sequenceDiagram
participant Orchestrator as Saga Orchestrator
participant Order as Order Service
participant Payment as Payment Service
participant Inventory as Inventory Service
participant Notify as Notification Service
Note over Orchestrator: Choreography-based Saga via events
Orchestrator->>Order: CreateOrder command
Order-->>Orchestrator: OrderCreated event
Orchestrator->>Payment: ReservePayment command
Payment-->>Orchestrator: PaymentReserved event
Orchestrator->>Inventory: ReserveStock command
Inventory-->>Orchestrator: StockReservationFailed event (out of stock)
Note over Orchestrator: ROLLBACK: compensating transactions
Orchestrator->>Payment: CancelPaymentReservation (compensating)
Payment-->>Orchestrator: PaymentCancelled
Orchestrator->>Order: RejectOrder (compensating)
Order-->>Orchestrator: OrderRejected
Note over Orchestrator: Saga completed (with rollback)
Saga vs 2PC Comparison¶
flowchart LR
subgraph 2PC
C2["Coordinator"]
P1["Participant 1\n(DB lock held\nduring prepare phase)"]
P2["Participant 2\n(DB lock held)"]
C2 -->|"Phase 1: PREPARE"| P1 & P2
P1 & P2 -->|"VOTE_COMMIT"| C2
C2 -->|"Phase 2: COMMIT"| P1 & P2
Note2["Problem: coordinator crash\nduring phase 2 → participants\nblocked forever holding locks"]
end
subgraph Saga
So["Orchestrator (stateful)"]
S1["Service 1: local tx\n(no distributed lock)"]
S2["Service 2: local tx"]
So --> S1 --> S2
Note_s["Eventual consistency\nCompensating txs for rollback\nNo cross-service locks\nAT-LEAST-ONCE delivery via MQ"]
end
9. Event-Driven Microservices — Outbox Pattern¶
flowchart TD
subgraph Service A (Order)
Tx["DB Transaction\n(single local tx)"]
Orders["orders table\nINSERT order_id=123"]
Outbox["outbox table\nINSERT {event_type=OrderCreated,\npayload=JSON,\nstatus=PENDING}"]
Tx --> Orders
Tx --> Outbox
end
subgraph Outbox Relay
Poller["Debezium CDC\n(read WAL/binlog)\nor polling thread\n→ reads PENDING outbox rows"]
MQ["Message Broker\n(Kafka / RabbitMQ)\npublish OrderCreated event"]
Mark["UPDATE outbox SET status=PUBLISHED"]
end
subgraph Service B (Inventory)
Consumer["Kafka consumer\nidempotency check:\n(event_id already processed?)"]
Idempotency["processed_events table\n(event_id → UNIQUE constraint)"]
InventoryUpdate["UPDATE inventory\n(reserve stock)"]
end
Outbox --> Poller --> MQ --> Consumer
Consumer --> Idempotency
Idempotency -->|"not seen → process"| InventoryUpdate
Poller --> Mark
10. CQRS — Command Query Responsibility Segregation¶
flowchart LR
subgraph Write Side (Commands)
Cmd["Command: CreateOrder{userId, items}"]
Handler["CommandHandler\n- validate business rules\n- apply domain events\n- save to EventStore (append-only)"]
EventStore["Event Store\n(append-only log)\nOrderCreated\nOrderShipped\nOrderCancelled"]
EventBus["Event Bus (Kafka)\n→ fan out to projections"]
end
subgraph Read Side (Queries)
Projection1["Order Summary\nProjection\n→ PostgreSQL read model\n(denormalized for fast SELECT)"]
Projection2["User Orders\nProjection\n→ Redis cache\n(precomputed list)"]
Query["Query: GetOrder{orderId}\n→ read from projection DB\n(no event replay needed)"]
end
Cmd --> Handler --> EventStore --> EventBus
EventBus --> Projection1 & Projection2
Query --> Projection1
subgraph Event Sourcing Replay
Replay["Rebuild projection:\nreplay ALL events from EventStore\n→ recompute state\n(snapshot every N events\n→ replay from snapshot)"]
end
11. Health Check & Liveness Internals¶
flowchart TD
subgraph Spring Boot Actuator / Kubernetes Probes
Live["Liveness Probe\nGET /actuator/health/liveness\n→ 200 OK: process running\n→ non-200: kubelet restarts container\n(never checks downstream deps!)"]
Ready["Readiness Probe\nGET /actuator/health/readiness\n→ 200 OK: ready to serve traffic\n→ non-200: removed from Service EP\n(checks DB, cache, dependencies)"]
Start["Startup Probe\nGET /actuator/health/startup\n→ disables liveness until first success\n(slow startup apps: avoid false restarts)"]
end
subgraph Health Aggregation
Composite["HealthIndicator tree\nCompositeHealthContributor"]
DB["DataSourceHealthIndicator\nSELECT 1\n→ UP / DOWN"]
Redis_H["RedisHealthIndicator\nPING\n→ UP / DOWN"]
Disk["DiskSpaceHealthIndicator\nfree space check"]
Composite --> DB & Redis_H & Disk
DB -->|"any DOWN → overall DOWN"| Composite
end
12. Service Mesh mTLS — Certificate Lifecycle¶
sequenceDiagram
participant Envoy as Envoy Sidecar
participant Agent as Istio Agent (pilot-agent)
participant istiod as istiod (Citadel)
participant Workload as App Container
Note over Agent: Pod starts → pilot-agent starts first
Agent->>Agent: generate private key (ECDSA P-256)
Agent->>Agent: create CSR (SPIFFE ID in SAN)
Agent->>istiod: gRPC CreateCertificate(CSR)
istiod->>istiod: validate k8s ServiceAccount JWT
istiod->>istiod: sign cert with cluster root CA
istiod-->>Agent: signed SVID cert (24h TTL)
Agent->>Envoy: push cert via SDS (Secret Discovery Service)
Note over Envoy: TLS listener now has cert+key
Envoy->>Envoy: rotate cert at 80% TTL (~19h)
Agent->>istiod: re-CSR (no downtime, hot swap)
Note over Envoy: mTLS handshake with peer
Envoy->>Envoy: verify peer SVID\n(SPIFFE ID: spiffe://cluster.local/ns/foo/sa/bar)\n→ authorization policy check
13. Performance & Overhead Summary¶
block-beta
columns 2
block:sidecar["Sidecar Proxy Overhead"]:1
s1["CPU: ~0.5–2% per request (Envoy)"]
s2["Memory: ~50–100MB per sidecar"]
s3["Latency: +0.3–1ms p50 (local loopback)"]
s4["p99 latency: +2–5ms (mTLS handshake amortized)"]
end
block:discovery["Service Discovery Latency"]:1
d1["Consul health check: 10s interval default"]
d2["DNS TTL: 5–30s (stale pods visible)"]
d3["k8s Endpoints update: ~1–5s after pod ready"]
d4["xDS push to Envoy: ~1–3s after EP change"]
end
block:grpc["gRPC vs REST"]:1
g1["Protobuf encoding: 3–10x smaller than JSON"]
g2["HTTP/2 multiplexing: 1 TCP conn, N streams"]
g3["gRPC streaming: server push (no polling)"]
g4["gRPC latency: ~10–50% lower than REST/JSON"]
end
block:saga["Saga Overhead"]:1
sa1["Outbox polling: 100ms–1s delay (CDC faster)"]
sa2["Compensating tx: idempotency check O(1) with index"]
sa3["Event store replay: O(events) — snapshot every 100 events"]
sa4["2PC lock hold: entire prepare+commit round trip"]
end
Key Takeaways¶
- Envoy interception uses iptables
REDIRECT(notTPROXY) — all traffic routes through localhost ports 15001/15006;SO_ORIGINAL_DSTrecovers the real destination - xDS push protocol uses delta-xDS streams so only changed resources are sent — Envoy ACKs each version; NACK rolls back
- Protobuf varint encoding packs field number + wire type into a single byte for most fields — typical message is 3–10× smaller than equivalent JSON
- Circuit breaker half-open allows exactly one probe request — all others still fast-fail until the probe succeeds
- Saga compensating transactions must be idempotent — the orchestrator may re-send commands on retry (at-least-once delivery from Kafka)
- Outbox pattern guarantees exactly-once delivery by making DB write + event publish atomic in the same local transaction; Debezium CDC reads WAL at near-zero overhead
- CQRS projections are rebuilt from event store replay — snapshots every N events reduce replay time from O(all events) to O(events since snapshot)