How To Set up Apache Spark & PySpark in Windows 10 ?

This post explains How To Set up Apache Spark & PySpark in Windows 10 . We will also see some of the common errors people face while doing the set-up. Please do the following step by step and hopefully it should work for you -  

1. Create and Verify The Folders:

Create the below folders in C drive. You can also use any other drive . But for this post , I am considering the C Drive for the set-up. 1.1. For Spark - C:\Spark 1.2. For Hadoop - C:\Hadoop\bin 1.3. For Java - Check where your Java JDK is installed. If Java is not already installed ,  install it from Oracle website ( . Ideally Java version 8 works fine without any issues so far. So try that. Lets assume Java is installed . Note down the Java JDK path. Typically it is like - C:\Program Files\Java\jdk1.8.0_191. It might be different based on what folder you choose. But whatsoever , Note the path down. We will need all the above 3 Folder names in our next steps.  


Download the following - > Download Spark from - Extract the files and place it in - C:\Spark. e.g I have downloaded spark 2.2.1 version and extracted , it looks something like - C:\Spark\spark-2.2.1-bin-hadoop2.7  

>   Download winutils.exe from -

Copy the winutils.exe file in C:\Hadoop\bin


3. Environment Variable Set-up:

Let's set up the environment variable now. Open the Environment variables windows . And Create New or Edit if already available. Based on what I have chosen , I will need to add the following variables as Environment variables - SPARK_HOME - C:\Spark\spark-2.2.1-bin-hadoop2.7 HADOOP_HOME - C:\Hadoop JAVA_HOME - C:\Program Files\Java\jdk1.8.0_191 These values are as per my folder structure. Please try to keep the same folder structure. For my case , it looks like below once I set-up the environment variables - variables  

4. Run Spark:

If you have done the above steps correctly, you are ready to start Spark. However most of the cases , the issue happens due to the Folder names are not correctly set in the environment variables. So Double check All the above steps ad make sure everything is fine. > Open windows line window or power shell . Both are fine. > Go to Spark bin folder and copy the bin path - C:\Spark\spark-2.2.1-bin-hadoop2.7\bin > type in - cd C:\Spark\spark-2.2.1-bin-hadoop2.7\bin > Type - ls It should show you all the Spark executable files. > Type in - spark-shell You will see a screen like below - This ensures that Spark is running fine now spark-shell  

5. PySpark :

So if you correctly reached this point , that means your Spark environment is Ready in Windows. But for pyspark , you will also need to install Python - choose python 3. Install Python and make sure it is also added in Windows PATH variables. If done , then follow all steps from 4 , and then execute "pyspark" as shown below   pyspark    

6. Next Steps :

As a next step , you can also run spark jobs using spark-submit.  

7. Common Error :

Most common error - The system cannot find the path specified It happens when the environment variables & path are not correctly set up. If you follow all my steps correctly , this error should not appear. If you still face issue , do let me know in the comments.   If you liked this post , you can check my other posts -


