10 May 2025

Inside the JVM: How Java Garbage Collectors Track, Mark, and Sweep Objects

by kan01234

🧬 Inside the JVM: How Java Garbage Collectors Track, Mark, and Sweep Objects

Garbage Collection (GC) isn’t just about freeing memory — it’s a fundamental part of how the JVM manages object lifecycle and reachability. To truly understand memory performance in Java applications, especially when handling large datasets (like our CSV export pipeline), we need to go beyond “heap is full = GC runs”.

Let’s break down how GC really works — from object references to pointer marking, and color-based marking phases used internally by collectors like G1GC and Shenandoah.

🧭 GC Roots & Object Graph Traversal

GC begins with a set of known GC roots — objects that are always reachable:

Local variables on the stack
Static fields
JNI references
Thread objects
System classes

From these roots, GC walks the object reference graph, recursively following pointers from root objects to others.

If an object isn’t reachable from any root, it is garbage and can be collected.

🎨 Object Coloring: White, Gray, Black

Most modern GCs (including G1GC) use a tri-color marking algorithm to track which objects are live.

White: Initially, all objects are white — assumed unreachable.
Gray: Objects found reachable but whose children haven’t yet been scanned.
Black: Reachable objects whose children have been scanned.

✅ Marking Phase (Tri-color Marking)

All objects start as white.
GC starts with GC roots → marks them gray (reachable but not yet fully scanned).
For each gray object:
- Mark its references (children) gray
- Mark the object itself black (fully scanned)
At the end, only white objects remain unreachable → eligible for collection.

This avoids cycles, and ensures even deeply nested object graphs are scanned safely.

📌 Reference Types and Reachability

Java has multiple reference levels, and GC behavior depends on them:

Reference Type	Collected When Memory Needed?	Use Case
Strong Reference	❌ Never	Default (`new Object()`)
Soft Reference	✅ When memory is low	Caches
Weak Reference	✅ On next GC	Maps, metadata
Phantom Reference	✅ After finalization	Cleanup hooks

If your batch job holds strong references to large lists or maps (e.g., List<Entity>), those will not be collected even under pressure.

🪄 Barriers & Pointer Traversal

GCs that run concurrently with your app (like G1GC, Shenandoah, ZGC) must handle pointer updates during GC.

They use read and write barriers:

Write barrier: Intercepts field assignments to track reference changes
Read barrier: (used by Shenandoah, ZGC) intercepts reads to remap pointers during GC relocation

These barriers are critical for concurrent and low-pause collectors to work correctly while your app continues to allocate and modify memory.

⚙️ G1GC Behavior: Region-Based Heap and Incremental GC

G1GC divides the heap into equal-sized regions (1MB–32MB each) instead of strict Young/Old generations.

Key Phases:

Initial Mark: Marks GC roots; piggybacks on a minor GC
Concurrent Mark: Traces object graph concurrently
Remark: Finalizes marking in a STW (stop-the-world) pause
Cleanup: Reclaims unused regions, re-evaluates liveness

Each region is tagged as Eden, Survivor, or Old — but G1 is flexible and can reassign region roles.

Key Feature:

G1GC chooses which regions to collect based on garbage density (garbage/region size) → hence the name Garbage-First.

🌈 Which GCs Use Tri-Color Marking?

Tri-color marking is central to modern concurrent GCs — it’s how they track object reachability without stopping your application. But not all collectors use it.

✅ GCs that use tri-color marking:

GC	Uses Tri-Color?	Notes
G1GC	✅ Yes	During concurrent marking, uses tri-color to identify reachable objects.
Shenandoah	✅ Yes	Employs concurrent tri-color marking with barriers to minimize pause time.
ZGC	✅ Yes	Uses a color-in-pointer scheme; conceptually tri-color with advanced pointer tagging.
CMS (deprecated)	✅ Yes	Legacy concurrent GC using tri-color for live object discovery.
Serial GC	❌ No	Performs full STW (stop-the-world) marking; doesn’t need tri-color.
Parallel GC	❌ No	Optimized for throughput with STW collection; skips incremental marking logic.

Tri-color is especially useful when the GC runs concurrently with application threads. The algorithm prevents issues like:

Collecting objects still in use (e.g., due to race conditions)
Missing new references created mid-GC

By coloring objects white (unreachable), gray (reachable but not fully scanned), and black (fully scanned), the GC guarantees memory safety during concurrent traversals.

📉 Example: Why You Might Have Long GCs in a Batch Job

If you load 10M records from DB:

You create millions of objects → large Eden occupancy
GC kicks in frequently to handle allocation pressure
If Camel or your code holds onto references (e.g., in Exchange headers), objects get promoted to Old Gen
Full GCs (stop-the-world) start happening once Old Gen is full → latency spikes, possible OOM

🧠 Final Thought: GC Is a Tool, Not Magic

Knowing the GC is not just about tuning flags — it’s about knowing:

How your code allocates and retains memory
How the JVM finds and reclaims unused memory
What references are keeping your objects alive

You can’t fix memory problems just by increasing heap size. The solution often lies in rethinking object lifecycles, reducing reference retention, and allowing the GC to do its job effectively.

tags: memory-optimization - java - performance