How To Setup Spark Scala SBT in Eclipse
This post explains How To Setup Spark Scala SBT in Eclipse . Normally running any Spark application especially in Scala is a bit lengthy since you need to code, compile , build jar and finally deploy or execute it.
Alternatively we can also use Eclipse . Eclipse is a FREE IDE - immensely used for writing Java or Scala code. So if we can Integrate the Spark environment in Eclipse , it is of Big Help - we can quickly run , debug and unit test the code changes without the pain of a Lengthy process . I will explain a Step by Step process to do so -
Part 1 - Set up The Environment :
1. Create and Verify The Folders
Create the below folders in C drive. You can also use any other drive . But for this post , I am considering the C Drive for the set-up.
1.1. For Spark -
C:\Spark 1.2. For Hadoop -
C:\Hadoop\bin 1.3. For Java - Check where your Java JDK is installed. If Java is not already installed , install it from Oracle website (
https://java.com/en/download/help/windows_manual_download.xml) . Ideally Java version 8 works fine without any issues so far. So try that. Lets assume Java is installed . Note down the Java JDK path. Typically it is like -
C:\Program Files\Java\jdk1.8.0_191. It might be different based on what folder you choose. But whatsoever , Note the path down. We will need all the above 3 Folder names in our next steps.
2. Downloads
Download the following - > Download Spark from -
https://spark.apache.org/downloads.html Extract the files and place it in -
C:\Spark. e.g If you have downloaded spark 2.2.1 version and extracted , it will look something like -
C:\Spark\spark-2.2.1-bin-hadoop2.7
> Download
winutils.exe from -
https://github.com/steveloughran/winutils/blob/master/hadoop-2.7.1/bin/winutils.exe
Copy the winutils.exe file in
C:\Hadoop\bin
3. Environment Variable Set-up
Let's set up the environment variable now. Open the
Environment variables windows . And Create New or Edit if already available. Based on what I have chosen , I will need to add the following variables as Environment variables - SPARK_HOME -
C:\Spark\spark-2.2.1-bin-hadoop2.7 HADOOP_HOME -
C:\Hadoop JAVA_HOME -
C:\Program Files\Java\jdk1.8.0_191 These values are as per my folder structure. Please try to keep the same folder structure. For my case , it looks like below once I set-up the environment variables -
Also Add Java & Spark bin dir location in your
Windows Path Variable. For my case the dir locations are below. I have added them to my Windows PATH variable.
C:\Program Files\Java\jdk1.8.0_191 C:\Spark\spark-2.2.1-bin-hadoop2.7
4. Eclipse
I believe Eclipse might already have been set up in your system. If not you can install Eclipse from -
https://www.eclipse.org/downloads/packages/ Setting up Eclipse is Quite straightforward. So I am gonna safely skip this part.
5. Maven
Add Maven plugin to your Eclipse using "Help" --> Install New Software option. Use below link -
http://download.eclipse.org/technology/m2e/releases/
6. Scala
This is important - Version of Scala Software to be installed needs to EXACTLY SAME as the Scala version mentioned in the Spark. To verify , in windows command line type
spark-shell and you will notice the Compatible version of Scala mentioned. Snapshot of the message from my Terminal -
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_191) Type in expressions to have them evaluated. Type :help for more information. So I know my Compatible Scala version is 2.11.8. Find your version - https://www.scala-lang.org/download/all.html Download & install. Verify the version -
scala -version
Download and install from - https://www.scala-sbt.org/download.html
8. Eclipse Plugin for Scala
Download and install Scala plugin for eclipse . Download from - https://marketplace.eclipse.org/content/scalastyle
Part 2 - Set Up For Project Specific :
1. Folder
// 1. Make a Project Folder
mkdir SimpleApp
#Create folder structure within "SimpleApp" dir
mkdir lib, project, target, src
mkdir src/main, src/test
mkdir src/main/java ,src/main/resources ,src/main/scala
mkdir src/test/java, src/test/resources, src/test/scala
2. Build.sbt
Ensure the scala version & Spark version used in below matches exactly what you see while running "spark-shell" command.
// 2. Create build.sbt file in /SimpleApp dir with below content.
// Use correct Scala & spark version
name := "Simple Project"
version := "1.0"
scalaVersion := "2.11.8"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.2.1"
3. SBT Project Building
// ALL "sbt" COMMANDS NEED TO BE RUN FROM /SimpleApp dir level
// 3 Go to SimpleApp dir & Run below -
> sbt
4. Eclipse Plugin
- Create a plugins.sbt file in /SimpleApp/project/ Copy below contents Also refer - https://github.com/sbt/sbteclipse addSbtPlugin("com.typesafe.sbteclipse" % "sbteclipse-plugin" % "5.2.4")
// Reload for the necessary files to be download for Eclipse > sbt > reload > eclipse
5. Eclipse
- Open Eclipse
- Open Scala Perspective
- Import the Project i.e. /SimpleApp dir
- Create a New Scala Object in src/main/scala in Eclipse
- Copy the below code
import org.apache.spark.sql.SparkSession
object SimpleApp {
def main(args: Array\[String\]) {
val logFile \= "D:\\\\WorkDirectory\\\\README.md" // Should be some file on your system
val spark \= SparkSession.builder.appName("Simple Application").master("local").getOrCreate()
val logData \= spark.read.textFile(logFile).cache()
val numAs \= logData.filter(line \=> line.contains("spark")).count()
val numBs \= logData.filter(line \=> line.contains("pyspark")).count()
//println(s"Lines with word=spark: $numAs, Lines with with word=pyspark: $numBs")
println(s"Lines with word=spark : $numAs")
println(s"Lines with word=pyspark : $numBs")
println("===============")
spark.stop()
}
}
- Go To Eclipse-->Project-->Properties--> Scala Compiler --> Scala Installation (Select Scala version as in Build.sbt)
- Run
- You can see the Spark Output in the Eclipse Console. It will look something like below -
This marks end as Objective of this Post. Do read other posts from this Blog.
Additional Read -