How to Improve Spark Application Performance –Part 1?
This post explains - How to Improve Spark Application Performance. Performance Tuning is a very important aspect of Spark programming. In this post, I will try to explain some of the Spark Tuning Techniques. Below are some the steps and techniques that I follow to improvise Spark Application Performance -
- For Machine Learning or Natural Language Programming Use cases , you could always try using SparkR , PySpark etc. to do initial data operation – like data cleaning , quality check etc. If native Spark libraries are not sufficient , you could also try writing Saprk udfs . The main objective behind this entire step is to reduce the dataset size and thereby to retain only the meaningful portions of the data. Once you done with this step, you could use Native R or Python due to the Extensive Machine Libraries or packges whch they offer.
- When the tasks involves lots of Custom Transformations and you need to write Custo UDFs , it is always best to resort to Scala instead of Python or R or even Java.
- Choose Dataset or Dataframe over RDD.
- If you mandatorily need to use RDD , avoid using Python or R . The best choice is using Scala or Java.
Dynamic Resource Allocation:
- Spark has the in-built capability of effective utilization of resources. If in the Spark Cluster multiple applications are running and share resources , then Spark can provide the mechanism to allocate resources as and when needed including releasing resouces when an application does not need.
Set spark.dynamicAllocation.enabled = true
. This will enable this feature. And various resource managers e.g. YARN, Mesos etc. can take full advantage of it.
Below are some of the configuration that you can test and check for Performance improvement. spark.scheduler.mode = FAIR for better sharing of resources across multiple users max-executor-cores - specify maximum number of executor cores that your application will need. This will take control so the application does not take up all the resources on the cluster. spark.cores.max – specify based on the application need.
- Use Parquet & ORC - both are good in terms of efficiency.
- Avoid small files . Smaller files increase the network & scheduling overhead of the spark job.
- If application is writing data , use maxRecordsPerFile option .
- For Serialization , use Kyro instead of Java serialization. You can use Kryo serialization by setting spark.serializer=org.apache.spark.serializer.KryoSerializer. You will also need to explicitly register the classes that you would like to register with the Kryo serializer via the spark.kryo.classesToRegister configuration. There are also a number of advanced parameters for controlling this in greater detail that are described in the Kryo official documentation.
To register your classes, use the SparkConf that you just created and pass in the names of your classes: conf.registerKryoClasses(Array(classOf[SampleClass1], classOf[SampleClass2]))
- Serialization format has a large impact on shuffle performance
- Use Filtering or aggregating before any shuffle operation.
- Introduce Filtering as early as possible in the various Transformations.
- Ensure each CPU core has 2 – 4 tasks in the cluster.
- Set spark.default.parallelism & spark.sql.shuffle.partitions according to the number of cores
- Use coalesce over Repartition – as coalesce merges partitions on the same node into one partition.
- Try Custom partitioning RDD if jobs are running slower
- If you reuse a dataset or a RDD or Dataframe , cache it. But avoid caching if data is used only once.
Additional Read - How to Build & Run Spark Cassandra Application Best Practices for Dependency Problem in Spark Different Parts of a Spark Application Code , Class & Jars How to Build & Run Spark Cassandra Application
- Avoid Cartesian joins.
- Is possible use Broadcast joins