Java Internals: Under the Hood¶

Synthesized from: Bloch Effective Java 3^rd ed, Oaks & Wong Java Performance 2^nd ed, Evans & Verburg The Well-Grounded Java Developer, Goetz Java Concurrency in Practice, and comp(21/207-212/215-223/241/305/311/326/346/455/457) Java references.

1. JVM Architecture — Class Loading to Execution¶

JVM Runtime Areas¶

flowchart TD
    subgraph JVM_Process["JVM Process Memory"]
        subgraph PerThread["Per-Thread (one per Java thread)"]
            PC["PC Register\n(current bytecode offset)"]
            STACK["JVM Stack\nstack frames: locals, operand stack, frame data"]
            NATIVE["Native Method Stack\n(C stack for JNI calls)"]
        end
        subgraph Shared["Shared Across All Threads"]
            HEAP["Heap\nEden, S0, S1 (Young Gen)\nOld Gen (Tenured)\nGC-managed objects"]
            METASPACE["Metaspace (Java 8+)\nClass metadata, method bytecode\nInterned strings (Java 7+: heap)\nNative memory (not GC'd by default)"]
            CODECACHE["JIT Code Cache\nCompiled native code\n~256MB default"]
        end
    end

Class Loading Lifecycle¶

flowchart TD
    A[".class file or JAR"] --> B["Loading\nBootstrapClassLoader: rt.jar, java.*\nExtClassLoader: ext/*.jar\nAppClassLoader: classpath\nCustom ClassLoader: URLs, dynamic"]
    B --> C["Linking: Verification\nBytecode structure valid?\nType safety checks\nControl flow verification"]
    C --> D["Linking: Preparation\nAllocate static fields\nInitialize to defaults (0, null, false)"]
    D --> E["Linking: Resolution\nSymbolic refs → direct refs\n(field offsets, method table slots)"]
    E --> F["Initialization\nRun static initializers <clinit>\n= static { ... } blocks\nThread-safe: class-level lock"]
    F --> G["Class ready for use"]

Parent delegation model: ClassLoader first asks parent before searching own path. Prevents class substitution attacks (can't shadow java.lang.String). Can be broken intentionally (e.g., OSGi, Tomcat's WebAppClassLoader loads webapp classes first).

2. JVM Stack Frame Layout¶

Stack Frame for method: int compute(int x, int y)
+---------------------------+
| Local Variable Array      |
|  [0] = this (instance)    |  (only for instance methods)
|  [1] = x (int arg)        |
|  [2] = y (int arg)        |
|  [3] = temp local int     |
+---------------------------+
| Operand Stack             |  LIFO, max depth from .class Code attr
|  (grows as opcodes push)  |
+---------------------------+
| Frame Data                |
|  constant_pool ref        |  → runtime constant pool of class
|  method return address    |  → caller's PC after return
|  exception table ptr      |  → [start_pc, end_pc, handler_pc, catch_type]
+---------------------------+

3. JIT Compilation — Tiered Compilation¶

Execution Tiers (Java 8+ HotSpot)¶

flowchart TD
    A["Method invoked first time\nTier 0: Interpreter\n~100 ns/bytecode"] 

    A -->|"invocation count > C1 threshold\n(~2000)"| B["Tier 1-3: C1 (Client Compiler)\nLight optimization: inlining small methods\ninvocation/backedge counters inserted\n~5-10× faster than interpreter"]

    B -->|"OSR or invocation count > C2 threshold\n(~15000)"| C["Tier 4: C2 (Server Compiler)\nAggressive optimization:\n- Inlining (up to 35-byte callee default)\n- Escape analysis → stack allocation\n- Loop unrolling, vectorization\n- Devirtualization via CHA\n~50-100× faster than interpreter"]

    C -->|"Deoptimization trigger:\ntype assumption violated\n(new subclass loaded)"| A

Escape Analysis — Heap Allocation Elimination¶

// This code:
void process() {
    Point p = new Point(1, 2);   // escapes? NO — only used locally
    int sum = p.x + p.y;
    return sum;
}

// After escape analysis + scalar replacement:
void process() {
    int p_x = 1;   // Point fields promoted to stack scalars
    int p_y = 2;   // No heap allocation!
    int sum = p_x + p_y;
    return sum;
}

flowchart TD
    A["new Object()"] --> B{Escape analysis}
    B -->|"Object escapes:\npassed to other method,\nstored in field/array,\nreturned"| C["Heap allocate\n(TLAB or Eden)"]
    B -->|"Does NOT escape:\nlocal scope only"| D["Stack allocate\n(scalar replacement)\nZero GC pressure"]
    B -->|"Escapes only to same thread"| E["Thread-local TLAB alloc\n(still heap, but no lock)"]

4. Garbage Collection — Generational GC¶

Object Lifecycle Through Generations¶

flowchart LR
    ALLOC["new Object()\n→ bump pointer in TLAB\n(Thread Local Allocation Buffer)\n~1 ns allocation"] 

    ALLOC --> EDEN["Eden Space\n~80% of Young Gen\nMost objects die here"]

    EDEN -->|"Minor GC\n(copy surviving objects)"| S0["Survivor 0 (S0)\nage=1"]
    S0 -->|"Minor GC\nage < tenure threshold"| S1["Survivor 1 (S1)\nage=2"]
    S1 -->|"age >= tenure threshold\n(default 15)"| OLD["Old Gen (Tenured)\nlong-lived objects"]

    OLD -->|"Major/Full GC"| COLLECT["Mark-Sweep-Compact\nor G1/ZGC concurrent"]

TLAB — Thread-Local Allocation Buffer¶

flowchart TD
    subgraph Eden_Space["Eden Space"]
        TLAB1["Thread 1 TLAB\n[////used////|free.......]\ntop ptr moves right on alloc\nno lock needed!"]
        TLAB2["Thread 2 TLAB\n[////used////|free......]"]
        TLAB3["Thread 3 TLAB"]
    end

    T1["Thread 1: new Object()\nbump TLAB1.top += sizeof(obj)\n~1 ns, no synchronization"]
    T1 --> TLAB1

When TLAB fills: Thread requests new TLAB from Eden via CAS on Eden.top. Minor GC reclaims entire Eden+Survivors — very fast (only live objects copied, dead objects simply abandoned).

G1 GC Architecture¶

flowchart TD
    subgraph G1_Heap["G1 Heap (e.g. 4 GB, 2048 regions × 2MB)"]
        direction LR
        E1["E (Eden)"] 
        E2["E"]
        S1["S (Survivor)"]
        O1["O (Old)"]
        O2["O"]
        H1["H (Humongous\n> 50% region size)"]
        F1["Free"]
        F2["Free"]
    end

    YOUNG_GC["Young GC (STW, frequent)\nEvacuate Eden+Survivor → new S regions\nUpdate remembered sets"]
    CONC["Concurrent Marking (concurrent with app)\nRoot scan (STW ~few ms)\nConcurrent mark traversal\nRemark (STW ~few ms)\nCleanup (STW ~few ms)"]
    MIXED["Mixed GC\nEvacuate young + some old regions\nPrioritize high-garbage old regions\n(Garbage First = G1 name reason)"]

Remembered Sets (RSet): Each region tracks which OTHER regions hold references INTO it. Avoids full heap scan during young GC — only scan RSets of young regions to find old→young pointers.

5. Java Memory Model (JMM) — Happens-Before¶

JMM Rules¶

flowchart TD
    A["Happens-Before relationships\n(define visibility guarantees)"] --> B["Program order:\neach action in thread happens-before\nthe next action in same thread"]
    A --> C["Monitor lock:\nunlock(m) happens-before\nnext lock(m) by any thread"]
    A --> D["volatile write:\nwrite to volatile field happens-before\nall subsequent reads of same field"]
    A --> E["Thread start:\nThread.start() happens-before\nany action in started thread"]
    A --> F["Thread join:\nall actions in T happen-before\nT.join() returns in another thread"]

volatile — What Hardware Does¶

// Writer thread:
volatile int flag = 0;
data = 42;           // regular store — may buffer in store buffer
flag = 1;            // volatile store → StoreStore + StoreLoad fence on x86
                     // = MFENCE instruction (ensures store buffer flushed)

// Reader thread:
while(flag == 0) {}  // volatile load → LoadLoad + LoadStore fence
int x = data;        // guaranteed to see 42

On x86 (TSO): volatile load = regular load. volatile store = LOCK XCHG or MFENCE. On ARM: DMB SY (full barrier) for both.

6. Java Thread and Monitor Internals¶

Object Header and Lock States¶

Object header (64-bit JVM, without compressed oops):
+--[mark word: 8 bytes]--+--[klass pointer: 8 bytes (4 with CompressedOops)]--+

Mark word states:
Unlocked:     [hash:31 | 0 | age:4 | 0 | 01]
Biased:       [thread_id:54 | epoch:2 | age:4 | 1 | 01]
Lightweight:  [stack_lock_ptr:62 | 00]
Heavyweight:  [monitor_ptr:62 | 10]
GC mark:      [...              | 11]

Lock Escalation Path¶

stateDiagram-v2
    [*] --> Unlocked
    Unlocked --> Biased: First thread locks\n(no CAS needed, just write threadID)
    Biased --> Unlocked: Thread exits synchronized block
    Biased --> Lightweight: Different thread tries to lock\n(bias revocation at STW safepoint)
    Lightweight --> Lightweight: Same thread re-enters (recursive)
    Lightweight --> Heavyweight: CAS fails (contention)\nInflate and allocate ObjectMonitor
    Heavyweight --> Heavyweight: wait()/notify()
    Heavyweight --> Unlocked: All threads release

ObjectMonitor (heavyweight):

class ObjectMonitor {
    void*   _owner;          // owning thread
    jint    _count;          // recursive lock depth
    jint    _waiters;        // threads in wait()
    ObjectWaiter* _WaitSet;  // circular list of waiting threads
    ObjectWaiter* _EntryList; // threads waiting to acquire lock
};

wait(): releases lock, moves thread to _WaitSet, thread parked (OS-level pthread_cond_wait). notify(): moves one thread from _WaitSet to _EntryList. notifyAll(): moves all.

7. Java Collections — Internal Data Structures¶

HashMap Internals (Java 8+)¶

flowchart TD
    subgraph HashMap_Structure
        BA["Node[] table\n(bucket array, power of 2 size)"]
        B0["table[0]: null"]
        B1["table[1]: Node{hash,key,val,next}"]
        B2["table[2]: Node → Node (chain)"]
        B7["table[7]: TreeNode (red-black tree\nwhen chain ≥ 8)"]
    end

    PUT["put(key, val):\nh = hash(key)\n= key.hashCode() ^ (h >>> 16)\n(spread high bits to low)\ni = h & (n-1)  // = h % n for power-of-2 n\ninsert at table[i]"]
    PUT --> B2

Treeification: When bucket chain length ≥ 8 AND table.length ≥ 64, chain converted to TreeNode (red-black tree). O(n) worst case → O(log n). Untreeified when size drops ≤ 6.

Load factor 0.75: Resize threshold = capacity × 0.75. Balances memory vs collision probability. At 0.75 load, expected chain length ≈ 0-1 under uniform hash distribution.

ConcurrentHashMap (Java 8)¶

flowchart TD
    subgraph CHM["ConcurrentHashMap (Java 8)"]
        direction LR
        SEG0["table[0]\nCAS on null bins\nsynchronized on bin head for collision"]
        SEG1["table[1]"]
        SEG2["table[2] - ForwardingNode\n(during resize: points to nextTable)"]
        SEG3["table[3]"]
    end

    WRITE["put(k,v):\n1. Find bin i = (n-1) & hash(k)\n2. if table[i] == null: CAS insert (no lock)\n3. if ForwardingNode: help transfer resize\n4. else: synchronized(table[i]) { insert/update }"]

    NOTE["No global lock!\nContention isolated to individual bins\nConcurrency level ≈ table.length (~16-...)"]

size() returns approximate count. Exact count uses CounterCell[] (striped counter, like LongAdder) to avoid contention on single counter during concurrent increments.

8. Java Serialization and Reflection Internals¶

Reflection Method Invocation Path¶

Method m = Foo.class.getDeclaredMethod("bar", int.class);
m.invoke(fooInstance, 42);

flowchart TD
    A["m.invoke(obj, 42)"] --> B["MethodAccessor.invoke()\nFirst 15 calls: interpreted accessor\n(delegation chain in Java)"]
    B -->|"invocation count > 15"| C["sun.reflect.MethodAccessorGenerator\nGenerates bytecode for accessor class\nat runtime via ASM-like bytecode emission\nInstantiates via defineClass()"]
    C --> D["Generated class: invoke() =\ncast obj to Foo\ncall obj.bar((int)args[0])\nreturn result"]
    D --> E["Native code called\nno more reflection overhead"]

Reflection overhead: First 15 invocations ~500 ns. After JIT-compiled accessor generation: ~5-10 ns (comparable to virtual call). MethodHandles.lookup().findVirtual() → MethodHandle → more predictable JIT optimization than reflection.

9. JVM Safepoints and Stop-The-World¶

flowchart TD
    A["JVM needs safepoint:\n(GC, deoptimization, class redefinition,\nbiased lock revocation, thread dump)"] 

    A --> B["Set safepoint request flag\nin global polling page"]

    B --> C["All threads:\n- Executing bytecode: check safepoint poll at backedges\n- Executing JIT code: poll instruction at loop backedges/method returns\n- In native (JNI): set flag, checked on return to Java\n- Blocked on monitor/IO: already 'at safepoint'"]

    C --> D["All threads reach safepoint\n(last one triggers continuation)"]
    D --> E["VM operation executes\n(GC, etc.)"]
    E --> F["Threads released\ncontinue execution"]

Time-to-safepoint (TTSP): The delay for all threads to reach safe point. Long-running JNI code, tight loops without safepoint polls (before JDK 10 loop strip mining), or large object allocation in TLAB can extend TTSP. Symptom: Application time: 0.0 followed by large GC pause.

10. Java NIO and Direct ByteBuffer¶

flowchart TD
    subgraph Java_Heap["Java Heap"]
        BB["HeapByteBuffer\ndata stored in byte[] on heap\nGC may relocate → copy needed for I/O"]
    end

    subgraph Off_Heap["Off-Heap (C memory)"]
        DBB["DirectByteBuffer\ndata stored outside GC heap\nvia malloc/mmap\naddress stored as long in Java object"]
    end

    subgraph Kernel["Kernel Space"]
        SOCK["Socket buffer (sk_buff)"]
    end

    BB -->|"write(HeapByteBuffer)\nkernel must copy: heap → native buf → kernel"| SOCK
    DBB -->|"write(DirectByteBuffer)\nzero-copy: native buf address directly\npassed to sendfile/write syscall"| SOCK

ByteBuffer.allocateDirect(n) → Unsafe.allocateMemory(n) → malloc(n) in C. Address stored as long address in DirectByteBuffer. GC cannot relocate it (off-heap). Freed when DirectByteBuffer GC'd → Cleaner (PhantomReference) callback calls free().

Memory-mapped files (FileChannel.map()): mmap() syscall → pages mapped directly into JVM process address space → zero-copy reads/writes via DirectByteBuffer accessing OS page cache.

11. String Interning and Compact Strings¶

String Representation (Java 9+ Compact Strings)¶

// Java 9+: String uses byte[] + coder field
class String {
    byte[] value;     // LATIN1: 1 byte/char; UTF16: 2 bytes/char
    byte coder;       // 0=LATIN1, 1=UTF16
    int hash;         // cached hashCode (0 = not computed)
}
// "hello" → value=[104,101,108,108,111], coder=0 (LATIN1)
// "日本語" → value=[...UTF16 bytes...], coder=1

String pool (interned strings): Hash table in Metaspace (Java 7+: heap). String.intern() adds string to pool. String literals automatically interned at class load time.

flowchart LR
    A["String literal \"hello\"\nin bytecode (ldc opcode)"] --> B["JVM string pool lookup\n(hash → bucket → compare)"]
    B -->|"found"| C["Return existing interned\nString object reference"]
    B -->|"not found"| D["Add to pool\nReturn new String reference"]

sequenceDiagram
    participant JVM
    participant ClassLoader
    participant CDS as CDS Archive

    JVM->>JVM: Parse JVM flags, initialize subsystems
    JVM->>ClassLoader: Load bootstrap classes (java.lang.*)
    alt CDS enabled (-Xshare:on)
        ClassLoader->>CDS: Map shared archive (mmap)\n(pre-loaded class metadata, interned strings)
        CDS-->>ClassLoader: Memory-mapped at fixed address\n(instant class availability, no parse/verify overhead)
    else CDS disabled
        ClassLoader->>ClassLoader: Parse rt.jar, verify bytecode\n(adds ~100ms startup overhead)
    end
    JVM->>JVM: Initialize runtime: GC, JIT compiler, thread scheduler
    JVM->>JVM: Load application main class → execute main()

AppCDS (Application Class-Data Sharing): Also archives application classes. Startup time reduction: 20-50% for typical Spring Boot apps. GraalVM Native Image takes this further — compiles entire app to native binary, eliminating JVM startup entirely.

JVM Performance Numbers¶

Operation	Time	Notes
TLAB object allocation	~1 ns	bump pointer, no lock
Eden allocation (no TLAB)	~10 ns	CAS on Eden.top
Minor GC (Young)	1-50 ms	proportional to live objects in Young
G1 Mixed GC pause	50-200 ms	depends on -XX:MaxGCPauseMillis
Full GC (old CMS)	500ms-30s	proportional to heap size
ZGC/Shenandoah pause	<1-10 ms	concurrent marking
Virtual method call	~5-10 ns	vtable dispatch
Interface method call	~10-20 ns	itable search
Monomorphic JIT call	~0-1 ns	inlined
synchronized block (uncontended)	~5-20 ns	biased or thin lock
synchronized block (contended)	~1-10 µs	OS mutex + context switch
Thread.start()	~50-200 µs	OS thread creation
Class loading (cold)	~1-50 ms	parse + verify + initialize
Reflection invoke (first 15x)	~500 ns	interpreted
Reflection invoke (after inflate)	~5-10 ns	JIT compiled accessor