DevOps | Cloud | Analytics | Open Source | Programming





How To Fix Spark Error - org.apache.spark.SparkException: Task not Serializable



In this post , we will see how to find a solution to Fix - Spark Error - org.apache.spark.SparkException: Task not Serializable. This error pops out as the following in Spark Job fail log -


Exception in thread "main" org.apache.spark.SparkException: Task not serializable
Caused by: java.io.NotSerializableException

The Basics :

When a Spark job is submitted , the Spark driver sends instructions to the workers as regards to what needs to be performed by them(workers) aka the code instructions. Now these code instructions can be broken down into two parts -

  • The static parts of the code - These are the parts already compiled and shipped to the workers.
  • The run-time parts of the code e.g. instances of classes. These are created by the Spark driver dynamically only during runtime. So obviously the workers do not already have copy of these. So during runtime , before sending these to the workers , these must be serialized and only then shipped to the workers.
When we execute any code within an RDD closure (map, filter, etc.), all things required to execute that piece of code has to be serialized (as a package) and subsequently sent to the executors to be run. Even objects (or variables of a object) which are referenced are serialized in this exercise. In case this whole exercise is not smoothly done , it gives us the NotSerializableException. To be more precise on the use cases -

  • Such errors are caused when a Variable , which is Initialized in the Spark Driver , is being used in coding logic performed on one of the executors\workers. (As an example of this scenario, imagine an anonymous function ,

_str: String => this.doSomething(str)_

which is accessing a variable - not defined within its scope.)  

  • Or data needs to be sent back and forth amongst the executors. So now when Spark tries to serialize the data (object) to send it over to the worker, and fail if the data(object) is not serializable.
  If you are using udfs there can be two scenarios - either you are instantiating some objects inside udfs (for every row of data which is frivolous with unhealthy performance) or you are instantiating the objects outside the udfs. The second scenario is performance enhancer but since in this case you are instantiating objects outside of the udfs , so the serialization issue will come up as methods can not be serialized on their own.   First thing first , be aware that -

  • Any data that needs to be transferred in Spark has to be Serialized First. The transfer of data can be due to -
    • Shuffling Operation
    • Collecting some results
    • Broadcasting to executors
  As a Solution for this Serialization issue first understand the basic General principle of the solution - We "pack" the non-serializable parts in an object. Now the contents of the object becomes static (non-changeable) and hence they can be easily serialized . Use the below pointers for the next steps -

  • Check if the code logic (which requires data transfer) is within some "methods". If Yes , know that methods can not be serialized on their own. Hence subsequently Spark attempts to serialize the "Class" which hold the "method". So if you can make the whole Class Serializable then that should solve the problem.
 

  • In case , you don't own the "Class" which hold the method or You have no Authority to edit the Class , then you can Instantiate the Class to make it serializable. One example is below(Scala) -

val a = new A with Serializable

  • If your class is not serializable , then you can try to serialize the objects just before using them in the closure and subsequently de-serialize .
 

  • Another alternative would be to try to use a "Function" instead of a "method" since functions are objects in Scala and hence Spark will be able to serialize it
Give this points a try and see if that helps.  

Other Interesting Reads -

 


java io notserializableexception, notserializableexception, java not serializable exception, Task not serializable, org.apache.spark.sparkexception: task not serializable scala,  org.apache.spark.sparkexception: task not serializable udf, Org apache$spark sparkexception Task not serializable: java ,  Task not serializable spark java ,  Job aborted due to stage failure pyspark, Job aborted due to stage failure databricks,  exception in thread "main" org.apache.spark.sparkexception: task not serializable,  Databricks Task not serializable, task not serializable spark java, org.apache.spark.sparkexception: task not serializable udf, org.apache.spark.sparkexception task not serializable foreach, exception in thread "main" org.apache.spark.sparkexception: task not serializable, databricks task not serializable, spark printwriter not serializable, object not serializable (class org apache-spark sparkcontext), spark shell not serializable,apache.spark.sparkexception task not serializable, org.apache.spark.SparkException, Fix Spark Error org.apache.spark.SparkException: Task not Serializable,Caused by: java.io.NotSerializableException,java.io.NotSerializableException