콘텐츠로 이동

Cloud & AWS Internals: Hypervisors, Virtual Networks & Managed Services

Under the Hood: How EC2 instances boot on bare metal, how S3 stores objects across failure domains, how VPCs route packets through virtual switches, how Lambda cold starts work — the exact hardware, network, and storage mechanics behind cloud infrastructure.


1. Hypervisor Architecture: EC2 on Nitro

AWS Nitro is a custom hypervisor offloading I/O and security to dedicated hardware cards rather than a host OS.

flowchart TD
    subgraph "Traditional Hypervisor (Xen)"
        DOM0["Dom0 (privileged VM)\nRuns host OS\nHandles all I/O\nConsumes 10-20% CPU"]
        DOMX["DomX (guest VM)\nEC2 instance"]
        NET["Network I/O via Dom0\n→ latency + CPU overhead"]
        DOM0 --> DOMX
        DOM0 --> NET
    end
    subgraph "Nitro Hypervisor"
        NH["Nitro Hypervisor\n(bare metal, <2% overhead)\nOnly CPU + memory virtualization"]
        NIC["Nitro Card: Network\nDedicated FPGA/ASIC\nSR-IOV: guest accesses NIC directly"]
        EBS["Nitro Card: EBS\nNVMe over PCIe to storage\nEncryption in hardware"]
        SEC["Nitro Security Chip\nBoot attestation\nTPM-based instance identity"]
        NH --> NIC
        NH --> EBS
        NH --> SEC
    end

VM Boot Sequence on Nitro

sequenceDiagram
    participant PH as Physical Host
    participant NC as Nitro Controller
    participant NH as Nitro Hypervisor
    participant VM as Guest VM (EC2)

    PH->>NC: Provision request (instance type, AMI, network config)
    NC->>NH: Create vCPU+memory allocation
    Note over NH: Allocate EPT (Extended Page Tables)\nfor guest physical→host physical mapping
    NH->>VM: Virtual CPU VMLAUNCH instruction
    Note over VM: Boots via UEFI/SeaBIOS
    Note over VM: Kernel detects Nitro NVMe driver
    VM->>NC: NVMe over PCIe: fetch EBS blocks for root volume
    Note over VM: initrd → systemd → user space
    VM->>NC: VirtIO-net SR-IOV: connect to ENI
    NC-->>VM: IP assigned via DHCP (VPC DHCP server)
    Note over VM: Instance ready

2. VPC Network Architecture: Virtual Switches and Routing

flowchart TD
    subgraph "AWS Region: us-east-1"
        subgraph "VPC: 10.0.0.0/16"
            subgraph "AZ-1a"
                PUB["Public Subnet 10.0.1.0/24"]
                PRIV["Private Subnet 10.0.2.0/24"]
                EC2A["EC2: 10.0.1.5"]
                EC2B["EC2: 10.0.2.10"]
                PUB --> EC2A
                PRIV --> EC2B
            end
            IGW["Internet Gateway\n(VPC attachment)"]
            NGW["NAT Gateway\n10.0.1.20 (Elastic IP)"]
            RTB_PUB["Route Table (public):\n0.0.0.0/0 → IGW"]
            RTB_PRIV["Route Table (private):\n0.0.0.0/0 → NGW"]
        end
        IGW -->|EIP| Internet["Internet"]
        EC2A --> PUB --> RTB_PUB --> IGW
        EC2B --> PRIV --> RTB_PRIV --> NGW --> IGW
    end

Packet Flow: EC2 to Internet

sequenceDiagram
    participant EC2 as EC2 10.0.1.5
    participant HyperV as Nitro Hypervisor
    participant VSwitch as Virtual Switch (VPC)
    participant IGW as Internet Gateway
    participant Internet as Internet Host 1.2.3.4

    EC2->>HyperV: Send packet\nsrc=10.0.1.5:44321\ndst=1.2.3.4:443
    Note over HyperV: SG egress check:\n443/TCP allowed?
    Note over HyperV: VPC route lookup:\n0.0.0.0/0 → IGW
    HyperV->>VSwitch: Encapsulate in VxLAN/Nitro overlay\nVNI=vpc-id tunnel to physical host running IGW
    VSwitch->>IGW: Inner packet + VPC metadata
    Note over IGW: SNAT: src=10.0.1.5\n→ src=52.x.x.x (EIP)\nConnection tracked in NAT table
    IGW->>Internet: src=52.x.x.x:44321, dst=1.2.3.4:443
    Internet->>IGW: dst=52.x.x.x:44321
    Note over IGW: DNAT lookup: 52.x.x.x:44321\n→ 10.0.1.5:44321
    IGW->>EC2: dst=10.0.1.5:44321

Security Groups: Stateful Packet Inspection

stateDiagram-v2
    [*] --> Evaluate_Egress: Outbound packet
    Evaluate_Egress --> Allowed: Rule match (allow)
    Evaluate_Egress --> Dropped: No rule match (default deny)
    Allowed --> ConnTrack: Add to connection tracking table
    ConnTrack --> PassThrough: Return traffic (automatic)\nno inbound rule needed

SGs are stateful (connection tracking via Linux conntrack tables in the Nitro hypervisor) — inbound rules are only checked for new connections, not return traffic.


3. S3 Internals: Object Storage Architecture

flowchart TD
    subgraph "S3 Storage Hierarchy"
        PUT["PUT /bucket/key — 5MB object"]
        FE["S3 Frontend Fleet\n(per-region, anycast)\nAuthentication + rate limiting"]
        INDEX["Index Service\nBucket+key → object metadata\n(partition key: bucket/key hash)\nStored in DynamoDB-like service"]
        STORE["Storage Fleet\nErasure coding: RS(6,2)\n6 data shards + 2 parity\nAny 6 of 8 can reconstruct"]
        AZ1["AZ-1: shards 1,3,5,7"]
        AZ2["AZ-2: shards 2,4,6,8"]
        PUT --> FE --> INDEX
        FE --> STORE
        STORE --> AZ1
        STORE --> AZ2
    end

Reed-Solomon Erasure Coding (RS 6+2)

S3 splits objects into 6 data chunks and computes 2 parity chunks using Reed-Solomon coding over GF(2⁸):

Object → [d1, d2, d3, d4, d5, d6]  (data chunks)
         [p1, p2]                   (parity: p_i = linear combination of d_j over GF(2⁸))

Reconstruction: Any 6 of 8 shards sufficient.
               Solve system of linear equations over GF(2⁸)
               Tolerates: 2 simultaneous shard failures = 2 AZ failures
sequenceDiagram
    participant Client as S3 Client
    participant FE as S3 Frontend
    participant IDX as Index Service
    participant ST1 as Storage Node 1 (AZ-1)
    participant ST2 as Storage Node 2 (AZ-2)

    Client->>FE: GET /bucket/large-object
    FE->>IDX: Lookup(bucket, key) → object_id, chunk_locations
    IDX-->>FE: chunks: [node1:c1, node2:c2, node3:c3, node4:c4, node5:c5, node6:c6]
    Note over FE: Parallel fetch all 6 chunks
    FE->>ST1: Fetch c1, c3, c5 (parallel)
    FE->>ST2: Fetch c2, c4, c6 (parallel)
    ST1-->>FE: c1, c3, c5
    Note over ST2: Node crashes!
    ST2-->>FE: c2, c4 (only 2/3)

    Note over FE: 5 chunks received (need 6)\nFetch parity p1 from another node
    FE->>ST1: Fetch p1
    ST1-->>FE: p1
    Note over FE: Reconstruct c6 from d1..d5,p1\nvia RS decode (Gaussian elimination GF(2⁸))
    FE->>Client: Stream reassembled object

S3 Consistency Model

Since December 2020, S3 provides strong read-after-write consistency for all operations. Internally, this is achieved by the index service using a serializable metadata store — GET after PUT sees the new object guaranteed (previously only eventually consistent for overwrite).


4. Lambda Cold Start Internals

stateDiagram-v2
    [*] --> Cold: Invocation (no warm container)
    Cold --> Download: Download container image\n(if not cached on worker host)
    Download --> Init_Sandbox: Create MicroVM (Firecracker)\nAllocate memory + vCPUs
    Init_Sandbox --> Run_Init: Run function init code\nimport modules, connect DB
    Run_Init --> Warm: Function ready (warm)
    Warm --> Execute: Invoke handler
    Execute --> Warm: Reuse container (next invocation)
    Warm --> Frozen: No invocations for ~15 min
    Frozen --> [*]: Container destroyed

Firecracker MicroVM

AWS Lambda uses Firecracker (open-source KVM-based microVM):

flowchart LR
    subgraph "Lambda Worker Host"
        FC["Firecracker VMM\n(virtual machine monitor)\nminimal device model:\nonly virtio-net + virtio-block\nNo USB, no PCI bus, no BIOS\n→ 125ms boot time"]
        GUEST["Guest: Amazon Linux 2 mini-kernel\n+ Python/Node/Java runtime\n+ customer code"]
        VSOCK["vsock socket:\nhost ↔ guest IPC\nfor invocation payload delivery"]
        FC --> GUEST
        FC --> VSOCK
    end
    subgraph "Lambda Control Plane"
        CP["Invocation Dispatcher\nPicks warm slot or cold-start\nSends payload via vsock"]
    end
    CP --> VSOCK

Cold start breakdown (Python 3.11, 256MB): - Firecracker boot: ~125ms - Amazon Linux init: ~50ms
- Python interpreter start: ~100ms - Customer import statements: variable (0ms–2000ms) - Total: 250ms–2500ms (vs warm: <1ms overhead)


5. DynamoDB Internals: Partitioning and Replication

flowchart TD
    subgraph "DynamoDB Request Path"
        REQ["PutItem(PK='user#123', SK='profile')"]
        RF["Request Router\nHash(PK) → partition number\npartition_key = hash(PK) mod num_partitions"]
        PART["Storage Node (partition owner)\nLeader of Paxos group"]
        REP1["Replica 1 (AZ-1)"]
        REP2["Replica 2 (AZ-2)"]
        REP3["Replica 3 (AZ-3)"]
        RF --> PART
        PART -->|replicate| REP1
        PART -->|replicate| REP2
        PART -->|replicate| REP3
        Note["Write acknowledged after\n2 of 3 replicas confirm\n(quorum write)"]
    end

DynamoDB LSM-Tree Storage Engine

Each DynamoDB partition uses an LSM-tree (Log-Structured Merge-Tree) under the hood:

flowchart TD
    subgraph "DynamoDB Storage Layer (per partition)"
        WAL["Write-Ahead Log\n(append-only, sequential)\n→ durability guarantee\nbefore memtable write"]
        MEM["MemTable\n(in-memory BTree, sorted by PK+SK)\n→ fast writes"]
        L0["Level-0 SSTables\n(flushed from MemTable)\nsmall, may overlap"]
        L1["Level-1 SSTables\n(compacted, non-overlapping)\n10MB each"]
        L2["Level-2 SSTables\n(100MB each)"]
        BF["Bloom Filter\n(per SSTable, 10 bits/key)\n→ skip irrelevant SSTables on read"]
        IDX["Block Index\n(sparse: one entry per 4KB block)\n→ binary search to block"]
        WAL --> MEM --> L0
        L0 -->|compaction| L1 -->|compaction| L2
        L0 --> BF
        L0 --> IDX
    end

DynamoDB Auto-Partitioning (Adaptive Capacity)

When a partition exceeds 1000 WCU/s or 3000 RCU/s, DynamoDB splits it:

partition_id=abc → [abc_low, abc_high]
split_point = median key in partition
all keys < median → abc_low
all keys ≥ median → abc_high
Transparent to application: router table updated atomically

6. EBS: Block Storage Internals

sequenceDiagram
    participant EC2 as EC2 Instance
    participant Nitro as Nitro NVMe Card
    participant EBS as EBS Storage Fleet

    EC2->>Nitro: NVMe write(LBA=0x1000, data=4KB, queue_depth=32)
    Note over Nitro: Hardware NVMe queue\nNo host CPU involvement
    Nitro->>EBS: TCP over dedicated EBS network\n(encrypted with NitroEnclaveKey)\nWrite(volume_id, offset, data)
    Note over EBS: Stripe data across 6+ nodes\n(RAID-6 equivalent within AZ)\nReplicate to 2nd AZ (Multi-AZ gp3)
    EBS-->>Nitro: ACK (after 2 replicas confirm)
    Nitro-->>EC2: NVMe completion queue entry

EBS gp3 throughput: 125 MB/s baseline, up to 1000 MB/s (provisioned). The Nitro card handles all NVMe protocol, encryption (AES-256 in hardware), and TCP networking to EBS fleet — zero host CPU for I/O.


7. IAM Policy Evaluation Engine

flowchart TD
    REQ["API Call: s3:GetObject\non arn:aws:s3:::my-bucket/file"]

    P1["1. Is the caller authenticated?\n(STS token valid, not expired?)"]
    P2["2. Explicit DENY?\n(Any policy with Deny effect matches?)"]
    P3["3. Organizational SCPs allow?"]
    P4["4. Resource-based policy\nallows cross-account access?"]
    P5["5. Identity-based policy allows?"]
    P6["6. Permissions boundary allows?"]
    P7["7. Session policy (STS assume-role) allows?"]

    ALLOW["ALLOW"]
    DENY["DENY (default)"]

    REQ --> P1
    P1 -->|no| DENY
    P1 -->|yes| P2
    P2 -->|explicit deny found| DENY
    P2 -->|no deny| P3
    P3 -->|not allowed by SCP| DENY
    P3 -->|allowed| P4
    P4 -->|resource policy allows| ALLOW
    P4 -->|no resource policy match| P5
    P5 -->|identity policy allows| P6
    P5 -->|no allow| DENY
    P6 -->|within boundary| P7
    P6 -->|outside boundary| DENY
    P7 -->|session policy allows| ALLOW
    P7 -->|no allow| DENY

Condition evaluation: IAM conditions are evaluated using AND within a Condition block, OR across multiple Condition elements. aws:RequestedRegion, aws:SourceVpc, aws:CurrentTime are context keys injected at evaluation time by the service control plane.


8. RDS Multi-AZ: Synchronous Replication Internals

sequenceDiagram
    participant App as Application
    participant Primary as RDS Primary (AZ-1)
    participant Standby as RDS Standby (AZ-2)
    participant EBS_P as EBS Primary Volume
    participant EBS_S as EBS Standby Volume

    App->>Primary: INSERT INTO orders(...)
    Primary->>EBS_P: Write WAL + data pages
    Primary->>Standby: Synchronous WAL shipping\n(PostgreSQL streaming replication)
    Standby->>EBS_S: Apply WAL → replicate pages
    Standby-->>Primary: WAL position confirmed
    Primary-->>App: COMMIT OK

    Note over Primary: Primary instance failure
    Note over Primary: EBS primary unavailable
    Note over Standby: Automatic failover triggered\n(Route 53 CNAME update: ~60-120s)
    App->>Standby: Connection via CNAME endpoint\n(Standby promoted to primary)
    Standby-->>App: Requests served

Read Replica Architecture (Asynchronous)

Read replicas use asynchronous log shipping. Unlike Multi-AZ (synchronous, same-region failover), read replicas can lag minutes and are used for read scaling, not HA:

Primary → WAL chunks → replica_1 (async, may lag)
                     → replica_2 (async, may lag)
                     → replica_3 (cross-region, higher lag)

9. CloudFront CDN Internals: Edge Caching

flowchart TD
    subgraph "CloudFront Request Flow"
        USER["User in Tokyo"]
        EDGE["CloudFront Edge\n(Tokyo PoP)\n220+ PoPs globally"]
        REG_EDGE["Regional Edge Cache\n(Osaka — larger cache tier)"]
        ORIGIN["Origin: S3 bucket in us-east-1"]

        USER -->|1. DNS: cf-id.cloudfront.net\nresolves to nearest PoP| EDGE
        EDGE -->|2. Cache HIT| USER
        EDGE -->|3. Cache MISS| REG_EDGE
        REG_EDGE -->|4. Cache HIT in regional cache| EDGE
        REG_EDGE -->|5. Cache MISS → origin fetch| ORIGIN
        ORIGIN -->|6. Response + Cache-Control headers| REG_EDGE
        REG_EDGE -->|cache + forward| EDGE
        EDGE -->|cache + respond| USER
    end

Cache Key Composition

Default cache key = host + path + query string params (configured)
Vary headers (Accept-Encoding: gzip,br) → separate cache variants
CloudFront Functions: modify cache key in edge compute (sub-ms JS runtime)
Lambda@Edge: full Node.js (origin/response/request phase — 5ms max)

10. AWS Auto Scaling: Control Loop Mechanics

flowchart TD
    subgraph "Target Tracking Scaling"
        METRIC["CloudWatch Metric\ne.g., ALBRequestCountPerTarget = 1500\ntarget = 1000 req/target"]
        CALC["desired_capacity = ceil(current_metric / target)\n= ceil(1500/1000) * current_instances\n= 1.5 × 2 = 3 instances"]
        ASG["Auto Scaling Group\nLaunch 1 more instance\n(via Launch Template)"]
        COOLDOWN["Cooldown: 300s\nNo scale actions during cooldown\n(prevents thrash)"]
        METRIC --> CALC --> ASG --> COOLDOWN
    end
    subgraph "Instance Launch Flow"
        LT["Launch Template:\nAMI, instance type, SG, IAM role"]
        EC2["EC2 RunInstances API"]
        USERDATA["User Data Script\n(cloud-init runs on boot)\nInstall app, start service"]
        ALB["Register with ALB target group\nHealth check: HTTP /health 200"]
        LT --> EC2 --> USERDATA --> ALB
    end

11. AWS Service Internals Summary

block-beta
    columns 3
    block:Compute
        EC2["EC2\nNitro KVM hypervisor\nSR-IOV NIC"]
        Lambda["Lambda\nFirecracker microVM\n125ms cold boot"]
        ECS["ECS/EKS\nDocker + kubelet\non EC2 or Fargate"]
    end
    block:Storage
        S3["S3\nRS(6,2) erasure coding\nStrong consistency"]
        EBS["EBS\nNVMe over TCP\nAES-256 hardware"]
        ElastiCache["ElastiCache\nRedis replica groups\nCluster mode sharding"]
    end
    block:Network
        VPC["VPC\nVirtual switches\nOverlay (Nitro)"]
        CF["CloudFront\nEdge cache\n220+ PoPs"]
        ALB["ALB\nL7 load balancer\nWeighted target groups"]
    end
    block:Database
        RDS["RDS Multi-AZ\nSync WAL replication\nAuto failover 60-120s"]
        DynamoDB["DynamoDB\nLSM-tree + Paxos\nAuto-partition"]
        Aurora["Aurora\n6-way replication\n3 AZs, 6 copies"]
    end

AWS Shared Responsibility Model: Technical Boundaries

Layer AWS Responsible Customer Responsible
Physical hardware ✅ Nitro cards, BIOS, firmware
Hypervisor ✅ Nitro hypervisor isolation
Host OS ✅ Patching, updates
Guest OS ✅ Patch EC2 AMI
Network ACLs ✅ VPC infrastructure ✅ Configure rules
Data at rest ✅ Hardware AES-256 option ✅ Enable encryption
IAM permissions ✅ Policy engine ✅ Write least-privilege policies
Application code ✅ Vulnerabilities are yours