DevOps | Cloud | Analytics | Open Source | Programming





Best Practices for Dependency Problem in Spark



Resolve Dependency Problem in Spark . While building any Spark Application - this is one of the main concerns that any Engineer should have. This post tries to put some guidance as to how you could do that. When building and deploying Spark applications all dependencies require compatible versions.

1. Scala

Build.sbt:

All packages would need to use the same major Scala version (2.10, 2.11, 2.12 etc.) A better practice is to declare the versions Globally It is better to specify version globally and use %%:


name := "Test Project" version := "1.0" scalaVersion := "2.11.7" libraryDependencies ++= Seq(
   "org.apache.spark" **%%** "spark-core" % "2.0.1",
   "org.apache.spark" **%%** "spark-streaming" % "2.0.1",
   "org.apache.bahir" **%%** "spark-streaming-twitter" % "2.0.1"
)

Maven:


<project>
  <groupId>com.example</groupId>
  <artifactId>simple-project</artifactId>
  <modelVersion>4.0.0</modelVersion>
  <name>Simple Project</name>
  <packaging>jar</packaging>
  <version>1.0</version>
  <properties>
    <spark.version>2.0.1</spark.version>
  </properties> 
  <dependencies>
    <dependency> <!-- Spark dependency -->
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-core_2.11</artifactId>
      <version>${spark.version}</version>
    </dependency>
    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-streaming_2.11</artifactId>
      <version>${spark.version}</version>
    </dependency> 
    <dependency>
      <groupId>org.apache.bahir</groupId>
      <artifactId>spark-streaming-twitter_2.11</artifactId>
      <version>${spark.version}</version>
    </dependency>
  </dependencies>
</project>

2. Spark

 

Build.sbt:

Always use variable - instead of idividually specifying the versions. Highlighted in Red.


name := "Simple Project" version := "1.0" val **sparkVersion** = "2.0.1" libraryDependencies ++= Seq(
   "org.apache.spark" % "spark-core_2.11" % **sparkVersion**,
   "org.apache.spark" % "spark-streaming_2.10" % **sparkVersion**,
   "org.apache.bahir" % "spark-streaming-twitter_2.11" % **sparkVersion** )

Maven:

Notice the highlighted


<project>
  <groupId>com.example</groupId>
  <artifactId>simple-project</artifactId>
  <modelVersion>4.0.0</modelVersion>
  <name>Test Project</name>
  <packaging>jar</packaging>
  <version>1.0</version>
  <properties>
    <spark.version>2.0.1</spark.version>
    <scala.version>2.11</scala.version>
  </properties> 
  <dependencies>
    <dependency> <!-- Spark dependency -->
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-core_${**scala.version**}</artifactId>
      <version>${**spark.version**}</version>
    </dependency>
    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-streaming_${**scala.version**}</artifactId>
      <version>${**spark.version**}</version>
    </dependency> 
    <dependency>
      <groupId>org.apache.bahir</groupId>
      <artifactId>spark-streaming-twitter_${**scala.version**}</artifactId>
      <version>${**spark.version**}</version>
    </dependency>
  </dependencies>
</project>

3. Additional Best Practices -

  • Spark version in dependencies should match Spark installed versions. e.g. For 2.0.1 version on the cluster , use 2.0.1 to build jars.

  • Scala version used to build jar should match Scala version used to build deployed Spark. Defaults are -

    • Spark 1.x -> Scala 2.10
    • Spark 2.x -> Scala 2.11
  • Additional packages in fat jar should be accessible on the worker nodes . You can use various options available in spark-submit for this viz -

    • --jars argument for spark-submit - to distribute local jar files.
    • --packages argument for spark-submit - to fetch dependencies from Maven repository. 
  • If you submit the spark job in the cluster node , include application jar in --jars.

  Additional Read - Explained – How to Improve Spark Application Performance –Part 1?