Garbage Collection Algorithms & Java Garbage Collectors

What is Garbage Collection?

Garbage collection is an automatic memory management feature in many modern programming languages.

The difficulty in garbage collection is not the actual process of collecting the garbage, it is the problem of finding the garbage in the first place. An object is considered to be garbage when no references to that object exist. But how can we know that there are no references to an object exist?

Languages like C or C++ for example, the memory management is mainly handled by the developer. After the object is created, you have to delete or this part of the memory won’t be released automatically.

Garbage Collection Algorithms:

  1. Reference counting.
  2. Mark & Sweep.
  3. Copy Collection.
  4. Generational Collection.

Reference Counting:

  • It’s about keeping track in each object of the total number of references to that object. That is, to add a special field to each object called a reference count. The idea is that the reference count field is not accessible to the program. Instead, the reference count field is updated by the virtual machine itself.
  • With a new reference ++refCount.
  • When a reference disappears —refCount.
Object p = new Integer(1); --> p refCount = 1.
Object q = new Integer(2); --> q refCount = 1.
p = q; --> p refCount = 0 & q refCount = 2.
  • A disadvantage of reference counting is that it does not detect cycles.
    • A cycle is two or more objects that refer to one another. For example, a parent object that has a reference to its child object, which has a reference back to its parent. These objects will never have a reference count of zero even though they may be unreachable by the roots of the executing program.
  • Another disadvantage is the overhead of incrementing and decrementing the reference count each time.

Mark & Sweep:

  • It’s about not reclaiming  unreferenced objects immediately. Instead, garbage is allowed to accumulate until all available memory has been exhausted. When that happens, the execution of the program is suspended temporarily till all unreferenced objects have been reclaimed, then the execution of the program can resume.
  • Mark & Sweep is the first GC algorithm to detect the cycles.
  • It is called a tracing garbage collector because it traces out the entire collection of objects. The objects that a program can access directly are those objects which are referenced by local variables or by any static variables. These variables are called the roots.
  • The algorithm consists of two phases:
    • Mark phase: it finds and marks all accessible objects as alive.
    • Sweep phase: it scans through the heap and reclaims all the unmarked objects.
  • In order to distinguish the live objects from garbage, a special boolean field is added to each object called, marked.
  • A disadvantage for this algorithm is that it freezes the application till the mark & sweep ends.

Copy Garbage Collection:

  • It’s about copying the reachable objects to the other half of the memory and clear the first one.
  • It starts from a set of roots, and traverse all of the reachable memory-allocated objects, copying them from one half of memory into the other half.
  • The area of memory that objects are copied from is called old space and the area of memory that objects are copied to is called new space.
  • When the reachable data is copied, it compacts it so that it is in a contiguous chunk. So, the holes in memory that the garbage data occupied are squeezed.
  • After the copy and compaction, it end up with a compacted copy of the data in new space data and a (hopefully) large, contiguous area of memory in new space in which can quickly and easily allocate new objects.
  • The next time the garbage collection is done, the roles of old space and new space will be reversed.
  • A disadvantage with copy GC is that almost double the memory space is required, becuase during GC phase both the memory half’s are touched (due to ping <-> pong copy).

Generational Garbage Collection:

  • The opposing theory, the “strong generational hypothesis”, which states that the older an object is, the more likely it is to die. Object lifetime distribution does not fall smoothly, and if an object has survived a few collections, it is likely to live quite long. This is the main idea behind the Generational Garbage Collection.
  • If the GC can concentrate on collection of young objects and do not touch too often older ones, the amount of data that has to be analyzed and copied is considerably reduced. Therefore significant efficiency in garbage collection performance could be reached.
  • It divides the heap into two or more regions, called generations.
  • New Objects are always allocated in the youngest generation.
  • The garbage collection algorithm scans the youngest generation most frequently, and performs scanning of successive generation more rarely.
  • Most objects in youngest generation are deallocated during the next scan. However, those objects that survive a few scans or reach a certain age are advanced to the next generation.

Java Garbage Collectors:

  1. Serial GC:
    • It is using Mark & Sweep algorithm.
    • Use this command (-XX:+UseSerialGC) to enable this GC.
    • It is suitable for a small memory and a small number of CPU cores.
  2. Parallel GC (The default GC):
    • It’s the same as the Serial GC but the parallel GC uses several threads, and therefore, faster.
    • Use this command (-XX:+UseParallelGC) to enable this GC.
    • It is useful when there is enough memory and a large number of cores.
  3. Concurrent Mark & Sweep Collector:
    • Designed for applications that prefer shorter garbage collection pauses and that can afford to share processor resources with the garbage collector while the application is running.
    • Use this command (-XX:+UseConcMarkSweepGC) to enable this GC.
    • Similar to the other available collectors, the CMS collector is generational; thus both minor and major collections occur.
    • The CMS collector attempts to reduce pause times due to major collections by using separate garbage collector threads to trace the reachable objects concurrently with the execution of the application threads.
    • During each major collection cycle, the CMS collector pauses all the application threads for a brief period at the beginning of the collection and again toward the middle of the collection. The second pause tends to be the longer of the two pauses.
    • Multiple threads are used to do the collection work during both pauses. The remainder of the collection (including most of the tracing of live objects and sweeping of unreachable objects is done with one or more garbage collector threads that run concurrently with the application. Minor collections can interleave with an ongoing major cycle, and are done in a manner similar to the parallel collector (in particular, the application threads are stopped during minor collections).
  4. Garbage-First Garbage Collector (G1):
    • The heap is partitioned into a set of equally sized heap regions, each a contiguous range of virtual memory.
    • G1 performs a concurrent global marking phase to determine the liveness of objects throughout the heap. After the marking phase completes, G1 knows which regions are mostly empty.
    • It collects those regions first, which often yields a large amount of free space.
    • This is why this method of garbage collection is called Garbage-First. As the name suggests, G1 concentrates its collection and compaction activity on the areas of the heap that are likely to be full of reclaimable objects, that is, garbage.
    • It’s mainly targeted for multiprocessor machines with large memories (>6 GB memory).