DevOps | Cloud | Analytics | Open Source | Programming





How To Setup Spark Scala SBT in Eclipse



This post explains How To Setup Spark Scala SBT in Eclipse . Normally running any Spark application especially in Scala is a bit lengthy since you need to code, compile , build jar and finally deploy or execute it. 

Alternatively we can also use Eclipse . Eclipse is a FREE IDE - immensely used for writing Java or Scala code. So if we can Integrate the Spark environment in Eclipse , it is of Big Help - we can quickly run , debug and unit test the code changes without the pain of a Lengthy process . I will explain a Step by Step process to do so -  

Part 1 - Set up The Environment :

 

1. Create and Verify The Folders


Create the below folders in C drive. You can also use any other drive . But for this post , I am considering the C Drive for the set-up. 1.1. For Spark - C:\Spark 1.2. For Hadoop - C:\Hadoop\bin 1.3. For Java - Check where your Java JDK is installed. If Java is not already installed ,  install it from Oracle website (https://java.com/en/download/help/windows_manual_download.xml) . Ideally Java version 8 works fine without any issues so far. So try that. Lets assume Java is installed . Note down the Java JDK path. Typically it is like - C:\Program Files\Java\jdk1.8.0_191. It might be different based on what folder you choose. But whatsoever , Note the path down. We will need all the above 3 Folder names in our next steps.  

2. Downloads


Download the following - > Download Spark from - https://spark.apache.org/downloads.html Extract the files and place it in - C:\Spark. e.g If you have downloaded spark 2.2.1 version and extracted , it will look something like - C:\Spark\spark-2.2.1-bin-hadoop2.7  

>   Download winutils.exe from - https://github.com/steveloughran/winutils/blob/master/hadoop-2.7.1/bin/winutils.exe

Copy the winutils.exe file in C:\Hadoop\bin

 

3. Environment Variable Set-up


Let's set up the environment variable now. Open the Environment variables windows . And Create New or Edit if already available. Based on what I have chosen , I will need to add the following variables as Environment variables - SPARK_HOME - C:\Spark\spark-2.2.1-bin-hadoop2.7 HADOOP_HOME - C:\Hadoop JAVA_HOME - C:\Program Files\Java\jdk1.8.0_191 These values are as per my folder structure. Please try to keep the same folder structure. For my case , it looks like below once I set-up the environment variables - variables Also Add Java & Spark bin dir location in your Windows Path Variable. For my case the dir locations are below. I have added them to my Windows PATH variable. C:\Program Files\Java\jdk1.8.0_191 C:\Spark\spark-2.2.1-bin-hadoop2.7  

4. Eclipse


I believe Eclipse might already have been set up in your system. If not you can install Eclipse from - https://www.eclipse.org/downloads/packages/ Setting up Eclipse is Quite straightforward. So I am gonna safely skip this part.  

5. Maven


Add Maven plugin to your Eclipse using "Help" --> Install New Software option. Use below link - http://download.eclipse.org/technology/m2e/releases/  

6. Scala


This is important - Version of Scala Software to be installed needs to EXACTLY SAME as the Scala version mentioned in the Spark. To verify , in windows command line type spark-shell and you will notice the Compatible version of Scala mentioned.  Snapshot of the message from my Terminal - Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_191) Type in expressions to have them evaluated. Type :help for more information. So I know my Compatible Scala version is 2.11.8. Find your version - https://www.scala-lang.org/download/all.html Download & install. Verify the version -   scala -version  

7. SBT - Scala Build Tool


Download and install from - https://www.scala-sbt.org/download.html  

8. Eclipse Plugin for Scala


Download and install Scala plugin for eclipse . Download from - https://marketplace.eclipse.org/content/scalastyle  

Part 2 - Set Up For Project Specific :

 

1. Folder



// 1. Make a Project Folder 
mkdir SimpleApp

#Create folder structure within "SimpleApp" dir
mkdir lib, project, target, src
mkdir src/main, src/test 
mkdir src/main/java ,src/main/resources ,src/main/scala 
mkdir src/test/java, src/test/resources, src/test/scala

 

2. Build.sbt


Ensure the scala version & Spark version used in below matches exactly what you see while running "spark-shell" command.


// 2. Create build.sbt file in /SimpleApp dir with below content. 
// Use correct Scala & spark version
name := "Simple Project"
version := "1.0"
scalaVersion := "2.11.8"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.2.1"

 

3. SBT Project Building



// ALL "sbt" COMMANDS NEED TO BE RUN FROM /SimpleApp dir level
// 3 Go to SimpleApp dir & Run below -
> sbt

 

4. Eclipse Plugin


 

  • Create a plugins.sbt file in /SimpleApp/project/ Copy below contents Also refer - https://github.com/sbt/sbteclipse addSbtPlugin("com.typesafe.sbteclipse" % "sbteclipse-plugin" % "5.2.4")
// Reload for the necessary files to be download for Eclipse > sbt > reload > eclipse sbt  

5. Eclipse


  • Open Eclipse
  • Open Scala Perspective
  • Import the Project i.e. /SimpleApp dir
  • Create a New Scala Object in src/main/scala in Eclipse
  • Copy the below code
 


import org.apache.spark.sql.SparkSession

object SimpleApp {
  def main(args: Array\[String\]) {
    val logFile \= "D:\\\\WorkDirectory\\\\README.md" // Should be some file on your system 
    val spark \= SparkSession.builder.appName("Simple Application").master("local").getOrCreate()
    val logData \= spark.read.textFile(logFile).cache()
    val numAs \= logData.filter(line \=> line.contains("spark")).count()
    val numBs \= logData.filter(line \=> line.contains("pyspark")).count()
    //println(s"Lines with word=spark: $numAs, Lines with with word=pyspark: $numBs")
    println(s"Lines with word=spark : $numAs")
    println(s"Lines with word=pyspark : $numBs")
    println("===============")
    spark.stop()
  }
}


 

  • Go To Eclipse-->Project-->Properties--> Scala Compiler --> Scala Installation (Select Scala version as in Build.sbt)
  • Run
  • You can see the Spark Output in the Eclipse Console. It will look something like below -
  eclipse 1   This marks end as Objective of this Post. Do read other posts from this Blog.   Additional Read -