DevOps | Cloud | Analytics | Open Source | Programming





Spark-Submit Command Line Arguments



In this post, I will explain the Spark-Submit Command Line Arguments(Options). We will touch upon the important Arguments used in Spark-submit command. And at the last , I will collate all these arguments and show a complete spark-submit command using all these arguements. We consider Spark 2.x version for writing this post.

  • --class: The Main CLASS in your application if written in Scala or Java (e.g. org.com.sparkProject.examples.MyApp)

\--class org.com.sparkProject.examples.MyApp

  • --name : Name of the application . Note that the name is overridden if also defined within the Main class of the Spark application. When we run spark in cluster mode the Yarn application is created much before the SparkContext is created, hence we have to set the app name through this SparkSubmit command argument i.e. through "--name" argument . But in client mode , we can set up the app name even in the program also - through spark.appname("MyApp")

\--name SparkApp

  • --master: Possible options are -
    • Standalone - spark://host:port: It is a URL and a port for the Spark standalone cluster  e.g. spark://10.21.195.82:7077). It does not run any external Resource Manager like Mesos or Yarn.
    • YARN/MESOS/KUBERNETES - If you choose Yarn ,Mesos or Kubernetes as Resource Manager . As an example for the YARN Resource Manager, this value will be "yarn".
    • local - Used for executing your code on your local machine. If you pass local, Spark will then run in a single thread (without leveraging any parallelism). On a multi-core machine you can specify either, the exact number of cores for Spark to use by stating local[n] where n is the number of cores to use, or run Spark spinning as many threads as there are cores on the machine using local[*].  e.g. local[8]: Means run locally on 8 Cores , local [*] means use all the available cores.
 

  • --deploy-mode: It denotes where you want to deploy your driver on the worker nodes (cluster) or locally as an external client (client) (default: client).  To understand the difference between Cluster & Client Deployments, read this post.
    • Cluster mode - In cluster mode, the driver will run on one of the worker nodes. This mode is preferred for Production Run of a Spark Applications or Jobs.
    • Client mode - In client mode, the driver run will run in the local machine (your laptop\desktop terminal). This mode is used for Testing , Debugging or To Test Issue Fixes of a Spark Application or job. However although the the driver runs locally but all the executors during runtime will execute their operations only in the Cluster.
 

  • --conf: Arbitrary or Specific Spark configuration changed in Runtime if you want for running your application. Values are given in (key,value) pair format.  e.g.

    --conf "spark.eventlog.enabled=true"

  • --executor-memory - Defines how much memory to set for each executors to run the application.  Default 512MB. e.g

    --executor-memory 2G

  • --driver-memory - Defines how much memory to allocate for the application on the driver. The default is 1,024M.

--driver-memory 3G

  • --num-executors: No of executor machines requested for the job. Dynamic allocation ensures that the initial number of executors is at least the number specified here. e.g.

--num-executors 12

  • application-jar: This is the path to the bundled or compiled jar including your application and all dependencies. The URL must be globally visible inside of your cluster, for instance, an hdfs:// path or a file:// path that is present on all nodes. e.g.

/AA/BB/target/spark-project-1.0-SNAPSHOT.jar

  • application-arguments: Arguments passed to the main method of your main class, if any. Note that -
    • Arguments passed before the .jar file will act as arguments to the JVM,
    • Arguments passed after the jar file is considered as arguments passed to the Sprak program.

/project/spark-project-1.0-SNAPSHOT.jar input1.txt input2.txt

  • –jars : Mention all the dependency jars (separated by comma) needed to run the Spark Job. Note you need to give the Full path of the jars if the jars are placed in different folders. e.g.

\-jars cassandra-connector.jar, some-other-package-1.jar, some-other-package-2.jar

  • -files :  If Spark needs any Additional files for its execution , those should be given using this option. Multiple files can be mentioned separated by comma.
    • In case of client deployment mode, the path must point to a local file.
    • In case of cluster deployment mode, the path can be either a local file or a URL globally visible inside your cluster;
 

  • -py-files - If the Spark application\job needs .zip, .egg, or .py files, you have to use this option.Two or more files should be comma separated.
    • In case of client deployment mode, the path must point to a local file.
    • In case of cluster deployment mode, the path can be either a local file or a URL globally visible(within the cluster)

\--py-files dependency\_files/egg.egg

Note Additional points below for PySpark job -

  • If you want to run the Pyspark job in client mode , you have to install all the libraries (on the host where you execute the spark-submit)  - imported outside the function maps.
  • If you want to run the PySpark job in cluster mode, you have to ship the libraries using the option --archives in the spark-submit command.
  Using most of the above a Basic skeleton for spark-submit command becomes -


**./****bin/spark-submit \ --class <main-class> \ --master <master-url> \ --deploy-mode <deploy-mode> \ --conf <key>=<value> \ ... # options <application-jar> \ [application-arguments] <--- here our app arguments**

  Let us combine all the above arguments and construct an example of one spark-submit command -  

Spark-Submit Example 1 - Java\Scala Code:


**export HADOOP\_CONF\_DIR=XXX**

**./bin/spark-submit \\**
**\--class org.com.sparkProject.examples.MyApp \\** 
**\--master yarn \\** 
**\--deploy-mode cluster \\** 
**\--executor-memory 20G \\** 
**\--num-executors 50 \\** 
**\--conf "spark.eventlog.enabled=true"** 
**\--jars cassandra-connector.jar, some-other-package-1.jar, some-other-package-2.jar** 
**/project/spark-project-1.0-SNAPSHOT.jar input1.txt input2.txt #Argument to the Program** 


Spark-Submit Example 2- Python Code:

Let us combine all the above arguments and construct an example of one spark-submit command -


**./bin/spark-submit \\**
 **--master yarn \\**
 **--deploy-mode cluster \\
  --executor-memory 5G \\
  --executor-cores 8 \\** **--py-files dependency\_files/egg.egg
  --archives dependencies.tar.gz**
 **mainPythonCode.py value1 value2   #This is the Main Python Spark code file followed by 
                                    #arguments(value1,value2) passed to the program** 

Example of how the arguments passed (value1, value2) can be handled inside the program.


import os
import sys

n = int(sys.argv\[1\])
a = 2
argspassed = \[\]
for \_ in range(n):
  argspassed.append(sys.argv\[a\])
  a += 1
  print(argspassed)

Spark-Submit Example 3 - Local :


**./bin/spark-submit**
 **--class org.com.sparkProject.examples.MyApp \\**
 **--master local\[2\] \\    #Running on 2-cores**
 **/project/spark-project-1.0-SNAPSHOT.jar input.txt**

Spark-Submit Example 4 - Standalone(Deploy Mode-Client) :


**./bin/spark-submit**
**\--class org.com.sparkProject.examples.MyApp \\**
**\--master spark://<IP\_Address:Port\_No>**
**/project/spark-project-1.0-SNAPSHOT.jar input.txt**

Spark-Submit Example 5 - Standalone(Deploy Mode-Cluster) :


**./bin/spark-submit**
**\--class org.com.sparkProject.examples.MyApp** 
**\--master spark://<IP\_Address:Port\_No>**
**\--deploy-mode cluster** 
**/project/spark-project-1.0-SNAPSHOT.jar input.txt**

Spark-Submit Example 6 - Deploy Mode - Yarn Cluster :


**export HADOOP\_CONF\_DIR=XXX ./bin/spark-submit**
**\--class org.com.sparkProject.examples.MyApp** 
**\--master yarn** 
**\--deploy-mode cluster** 
**\--executor-memory 5G**
**\--num-executors 10**
**/project/spark-project-1.0-SNAPSHOT.jar input.txt**

Spark-Submit Example 7 - Kubernetes Cluster :


**export HADOOP\_CONF\_DIR=XXX ./bin/spark-submit**
**\--class org.com.sparkProject.examples.MyApp** 
**\--master k8s://<IP\_Address>:443**
**\--deploy-mode cluster** 
**\--executor-memory 5G**
**\--num-executors 10**
**/project/spark-project-1.0-SNAPSHOT.jar input.txt**

Other Posts You May Find Helpful -

 


What is spark submit, How do I deploy a spark application,How do I run spark submit in cluster mode, How do I submit a spark job to yarn,spark-submit yarn cluster example, spark-submit python, spark-submit scala example, spark-submit --files ,spark-submit --packages, spark-submit --py-files, spark-submit java example, spark submit --files multiple files, spark-submit command pyspark, spark-submit yarn , cluster example, spark-submit command not found, spark-submit command scala, spark-submit --files, spark-submit --packages, spark-submit java example, spark-submit --py-files, spark-submit yarn cluster example, spark-submit scala example, spark-submit pyspark example, spark-submit --packages, spark-submit --files, spark-submit --py-files, spark-submit java example, spark-submit command not found, spark submit command, spark submit command arguments, spark submit arguments, spark-submit --files, spark-submit yarn cluster example, spark-submit python, spark-submit scala example, spark-submit --packages, spark-submit --py-files, spark-submit java example, spark-examples jar, spark submit options, spark-submit yarn cluster example, spark-submit options emr, spark-submit --files, spark-submit python, spark-submit scala example, spark-submit --packages, spark-submit --py-files, spark-submit java example, spark submit parameters,spark-submit yarn cluster example, spark-submit pyspark example, spark-submit --files, spark-submit scala example, spark-submit --packages, spark-submit emr, spark-submit --py-files, spark-submit java example,spark submit parameters, spark submit, spark-submit, spark, apache spark