DevOps | Cloud | Analytics | Open Source | Programming

How To Handle Garbage Collection in Spark Streaming

In this post we will try to understand How To Handle Garbage Collection in Spark Streaming. Garbage Collection in Spark Streaming is a crucial point of concern in Spark Streaming since it runs in streams or micro batches.

Stream processing can stressfully impact the standard Java JVM garbage collection due to the high number of objects processed during the run-time. This more often than not causes frequent pauses and thereby increase the latency of the real-time applications.

Many a times this goes quite unnoticed and difficult to trace and fix. This issue can be handled by using concurrent mark sweep (CMS) garbage collector as an effective step for both the driver and the executors, which reduces pause time by running garbage collection concurrently with the application.

For the Driver program , this needs to be enabled by passing the additional arguments to the spark-submit command

--driver-java-options -XX:+UseConcMarkSweepGC

For executors, CMS garbage collection can be switched on by setting the below parameter spark.executor.extraJavaOptions to XX:+UseConcMarkSweepGC  

Additional Read -

Free Platform To Practice Big Data  

spark streaming, spark structured streaming,spark streaming kafka  , spark kafka,spark streaming example,structured streaming, apache spark streaming,spark structured streaming kafka,pyspark garbage collection,spark gc tuning,spark garbage collector,spark garbage, Garbage Collection in Spark Streaming, Garbage Collection,Spark Garbage Collection,How does spark handle garbage collection, How can I improve my spark streaming speed, Why your spark apps are slow or failing, What does garbage collection mean