DevOps | Cloud | Analytics | Open Source | Programming





How to Distribute, Install, Manage or Ship Python Modules to Cluster Nodes in PySpark ?



In this post, we will see - How to Distribute, Manage or Ship Python modules to Other Cluster Nodes in PySpark ? Or in other words How we can install Python dependencies on Spark executor in the cluster.

In Production environment,  generally Spark applications are run in Cluster mode using package managers (Kubernetes, Mesos, Yarn etc.). Basically the code is executed in the worker nodes . And hence you have to ensure that your code and all used libraries are available on the worker nodes or to ensure all nodes have the desired environment to execute the code. Spark being executed in a distributed computing environment, it is challenging to ensure that things go smooth. Note the below points -

  • If --deploy-mode is set to cluster (say yarn), the Spark Driver as well executors will run in the yarn cluster in worker nodes.
  • If --deploy-mode is set to client, then Only the Spark Driver will run in client machine or edge node. The executoors will still run in the cluster in worker nodes.
If there are any missing libs or packages on the executors on the cluster, we might get error saying Module not found during runtime of the spark job. Let's see how we can do this -

Solution Option 1 :

  • We will use the --py-files argument of spark-submit to add the dependency i.e. .py, .zip or .egg files. Then these files will be distributed along with your spark application.  Alternatively you can also club all these files as a single .zip or .egg file. If you are unaware of using this flag in spark command line, read it here - Spark-Submit Command Line Arguments .
Below are some examples of how you should be supplying the additional dependency python files along with your main PySpark or                    Spark-Python program.



pyspark --py-files <dependency\_python\_code\_with\_path>.py
pyspark --py-files <dependency\_python\_code\_with\_path>.zip




spark-submit --py-files <dependency\_python\_code\_with\_path>.py sparkMainProg.py
spark-submit --py-files <dependency\_python\_code\_with\_path>.zip sparkMainProg.py
spark-submit --py-files s3a://<dependency\_python\_code\_with\_path>.zip sparkMainProg.py




   

Solution Option 2 :

  • Let's say you have a Spark program sparkMain.py. And it requires another python file (A.py) from which it imports certain modules. So this A.py file should be accessible by all the worker or executor nodes during execution. This file has to be downloaded with this Spark job on every node. The path to the file(A.py) passed can be either a local file, HDFS, FTP URI etc.
 

  • We will use the below in the sparkMain.py. During job execution, Spark will distribute the files on each node. Hence we have to add the base path of A.py to the system path of the Spark job. This is done as shown below -
 



sc.addFile("<path\_to\_the\_A.py\_file>/A.py")




from pyspark import SparkConf
from pyspark import SparkContext
from pyspark import SparkFiles

sys.path.insert(0,SparkFiles.getRootDirectory())


 

  • Once done, when you call or import any function from A.py or even when you import A.py itself, there will be no error.
     

Solution Option 3 :

  • We can also use addPyFile(path) option. This will add the dependency .py files (or .zip) to the Spark job. So that when the job is executed, the module or any functions can be imported from the additional python files. Note that - the path (to the additional files) passed can be either a local file path, HDFS, FTP URI etc.
 

  • So basically we have to add the .py or .zip file dependency for all the tasks to be executed on the SparkContext.
 

  • Let's say sparkProg.py is our main spark program which uses or imports a module A (A.py) or uses some function from module A. So we need to ensure that A.py is accessible during the job by all the executors.
 

  • Create an _init.py_ file wherever you have your A.py file.
 

  • Create a .zip file with both - A.py as well _init.py_ .


extraFile.zip --> it will contain A.py as well \_init.py\_

extraFile.zip = (\_init.py\_ + A.py)


 

  • So now you can refer this dependency zip in your sparkProg.py


from pyspark import SparkConf
from pyspark import SparkContext
from pyspark import SparkFiles

conf = SparkConf()

sc = SparkSession.builder.config(conf=conf) \\
    .appName("sparkProg") \\
    .getOrCreate()

sc.sparkContext.addPyFile("/<path\_to\_the\_zip\_file>/extraFile.zip")


Alternatively



from pyspark import SparkConf
from pyspark import SparkContext
from pyspark import SparkFiles

spark.sparkContext.addPyFile(SparkFiles.get("/<path\_to\_the\_zip\_file>/extraFile.zip"))


 

Solution Option 4 :

Let's discuss the solution with respect to some standard packages like scipy, numpy, pandas etc. If you use them in your pyspark program and run the spark code in the cluster, then you have to ensure that the worker nodes or the executors have\access these libraries or packages. The main logic is we will create a virtual environment with all the required packages and then create a zip file out of that. And subsequently that zip file will be submitted along with the pyspark job. So all the executors or worker nodes can avail the additional packages like scipy, numpy, pandas etc. through the zip file.  

  • Create a virtual environment using virtualenv


$ virtualenv venv1


  • Install all required packages in the virtual environment


$ source venv1/bin/activate
(venv1)$ yum install -y gcc make python-devel
(venv1)$ pip install numpy
(venv1)$ pip install scipy


 

  • So now we have all the packages installed in the virtual environment. We will create a zip file with all these. Think of it as a suitcase containing all the packages that we just installed.


(venv1)$ zip -r venv1.zip venv1


 

  • Place the zip file in network drive like HDFS, S3 etc. so it is accessible. In this example, we will consider hdfs.


hdfs://<path\_name>/venv1.zip
http://s3.amazonaws.com/\[bucket\_name/venv1.zip


  • When you submit the spark job, the additional packages will be copied from the hdfs(or s3) to each worker and they can use those while executing the task.


spark-submit \\ 
--master yarn \\
 --deploy-mode cluster \\
 --archives hdfs://<path\_name>/venv1.zip \\ #<--- Dependency Package files 
 --conf spark.yarn.appMasterEnv.PYSPARK\_PYTHON=hdfs://<path\_name>/venv1/bin/python #<-- Python environment 
sparkMainProg.py #<----Main PySpark Program


 

  • If you are using pyspark shell or notebooks , use below -


export PYSPARK\_DRIVER\_PYTHON=python 
export PYSPARK\_PYTHON=./environment/bin/python
pyspark --archives venv1.zip


 



os.environ\['PYSPARK\_PYTHON'\] = "./environment/bin/python"
spark = SparkSession.builder.config(
"spark.archives",
"venv1.zip").getOrCreate()


 

Solution Option 5 :

We can also use PEX to ship the Python packages. Note that a .pex file does not include python interpreter. It is assumed the worker nodes have the python interpreter pre-installed. PEX creates a self-contained Python environment which is executable. To some extent , it is comparable to conda or virtualenv execept the executable part. We will perfrom the below steps.

  • This step will generate a .pex file. This file will contain all the python dependencies which can be used by the Spark driver and executors.


pip install pyarrow pandas pex
pex pyspark pyarrow pandas -o MY\_pex\_env.pex


 

  • To ship the .pex file in the cluster, we will use spark.files configuration (spark.yarn.dist.files in YARN) or --files option.


export PYSPARK\_DRIVER\_PYTHON=python # DON'T SET this for cluster modes(YARN\\Kubernetes)
export PYSPARK\_PYTHON=./MY\_pex\_env.pex
spark-submit --files MY\_pex\_env.pex app.py


  • For Notebooks


os.environ\['PYSPARK\_PYTHON'\] = "./MY\_pex\_env.pex"
spark = SparkSession.builder.config(
"spark.files", # 'spark.yarn.dist.files' in YARN.
"MY\_pex\_env.pex").getOrCreate()


 

  • For pyspark shell,


export PYSPARK\_DRIVER\_PYTHON=python
export PYSPARK\_PYTHON=./MY\_pex\_env.pex
pyspark --files MY\_pex\_env.pex


 

Solution Option 6:

  • We can also use Conda package management to ship additional or third-party python packages with conda-pack by creating relocatable conda environments.
  • We will create an archive file with conda python env and dependencies - to be used by both the driver and executor.
 



conda create -y -n pyspark\_conda\_env -c conda-forge pyarrow pandas conda-pack
conda activate pyspark\_conda\_env
conda pack -f -o CONDA\_PACKAGES.tar.gz


  • Subsequently ship this file along with scripts or in the code by using the --archives option or spark.archives configuration (spark.yarn.dist.archives in YARN). It automatically unpacks the archive on executors.
 

  • For spark-submit script, use as follows:


export PYSPARK\_DRIVER\_PYTHON=python # DON'T SET this for cluster modes(YARN\\Kubernetes)
export PYSPARK\_PYTHON=./environment/bin/python
spark-submit --archives CONDA\_PACKAGES.tar.gz#environment app.py


 

  • For python notebook, use below -


import os
from pyspark.sql import SparkSession
from app import main

os.environ\['PYSPARK\_PYTHON'\] = "./environment/bin/python"
spark = SparkSession.builder.config(
"spark.archives", # 'spark.yarn.dist.archives' in YARN.
"CONDA\_PACKAGES.tar.gz").getOrCreate()
main(spark)


  • For pyspark shell:


export PYSPARK\_DRIVER\_PYTHON=python
export PYSPARK\_PYTHON=./environment/bin/python
pyspark --archives CONDA\_PACKAGES.tar.gz


    I hope this helps.  

Other Interesting Reads -

   



pyspark package in python ,pyspark virtual environment ,pyspark install packages ,pyspark list installed packages ,spark-submit --py-files ,pyspark import packages ,pyspark dependencies ,how to use python libraries in pyspark ,dependencies for pyspark ,emr pyspark dependencies ,how to manage python dependencies in pyspark ,pyspark add dependencies ,pyspark package in python ,pyspark virtual environment ,pyspark install packages ,pyspark list installed packages ,spark-submit --py-files ,pyspark import packages ,pyspark dependencies ,how to use python libraries in pyspark ,pyspark add dependencies ,pyspark python dependencies ,pyspark egg dependencies ,emr pyspark dependencies ,pyspark emr dependencies ,dependencies for pyspark ,ship python packages ,python ship package ,python shap package ,python shap package install ,shap package python example ,shap values python package ,ship python code ,pyspark dependencies ,pyspark dependencies zip ,pyspark dependencies numpy ,pyspark dependencies install ,pyspark package dependencies , ,pyspark dependency jar ,pyspark java dependency ,pyspark kafka dependency ,pyspark library dependency ,pyspark dependency management ,pyspark maven dependency ,pyspark partial dependence plot ,pyspark pandas dependency ,pyspark python module ,spark submit dependency ,zeppelin pyspark dependencies ,python email client module ,python shap package documentation ,python sending email module ,python pip package.json ,python module shap was not found ,shap package in python ,add python package to pyspark ,how to install python packages in pyspark ,install python package pyspark ,is pyspark a python library ,pyspark add python packages ,pyspark import python package ,pyspark include python package ,pyspark python module ,pyspark python packages ,pyspark upload python package ,python module pyspark.daemon not found ,python package pyspark , ,spark-submit python dependencies ,spark-submit --py-files multiple files ,how to use python libraries in pyspark ,pyspark external libraries ,pyspark --py-files ,pyspark install packages ,databricks spark-submit python ,pyspark py-files zip ,python packages for pyspark ,Install python dependencies in pyspark ,how to install python packages in pyspark ,install pyspark using pip ,how to install python in pyspark ,how to install libraries in pyspark ,how to install pyspark using pip ,how to install pyspark module in python ,how to install pyspark python 3 ,how to install pyspark in python ,how to install python spark ,how to install python spark in ubuntu ,how to install python packages in zeppelin ,install pyspark via pip ,pip command to install pyspark ,install pyspark in python ,install spark using pip ,install pyspark in windows ,install pyspark locally windows ,install pyspark pip windows ,how to install pyspark in pycharm ,python packages in pyspark ,python packages pyspark bar ,python packages pyspark command ,python packages pyspark dataframe ,python packages pyspark download ,python packages pyspark github ,python packages pyspark gui ,python packages pyspark guide ,python packages pyspark hive ,python packages pyspark hook ,python packages pyspark java ,python packages pyspark join ,python packages pyspark latest version ,python packages pyspark library ,python packages pyspark list ,python packages pyspark mysql ,python packages pyspark package ,python packages pyspark program ,python packages pyspark project ,python packages pyspark python ,python packages pyspark query ,python packages pyspark questions ,python packages pyspark queue ,python packages pyspark version ,python packages pyspark view ,python packages pyspark yaml ,python packages pyspark youtube ,python packages pyspark zed ,python packages pyspark zip ,python packages pyspark zoom ,python pyspark library ,python with pyspark ,spark.sql python module ,using python packages in pyspark ,what is pyspark in python ,python ship package ,ship python packages anaconda ,aws pyspark tutorial , ,pyspark dependencies ,pyspark dependencies github ,pyspark dependencies hackerrank ,pyspark dependencies install ,pyspark dependencies numpy ,pyspark dependencies online ,pyspark dependencies only ,pyspark dependencies query ,pyspark dependencies required ,pyspark dependencies runtime ,pyspark dependencies table ,pyspark dependencies tutorial ,pyspark dependencies types ,pyspark dependencies update ,pyspark dependencies value ,pyspark dependencies xampp ,pyspark dependencies xml ,pyspark dependencies xml example ,pyspark dependencies year ,pyspark dependencies yourself ,pyspark dependencies youtube ,pyspark dependencies zip ,pyspark dependency jar ,pyspark dependency management ,pyspark egg dependencies ,pyspark emr dependencies ,pyspark jar dependencies ,pyspark java dependency ,pyspark kafka dependency ,pyspark library dependency ,pyspark maven dependency ,pyspark package dependencies ,pyspark pandas dependency ,pyspark partial dependence plot ,pyspark pip install dependencies ,pyspark python dependencies ,spark submit dependency ,spark-submit pyspark dependencies ,zeppelin pyspark dependencies ,ship python packages api ,ship python packages best ,ship python packages builder ,ship python packages code ,ship python packages command ,ship python packages cost ,ship python packages diagram ,ship python packages download ,ship python packages generator ,ship python packages git ,ship python packages github ,ship python packages html ,ship python packages in java ,ship python packages installed ,ship python packages java ,ship python packages json ,ship python packages jupyter ,ship python packages keras ,ship python packages key ,ship python packages keyword ,ship python packages kotlin ,ship python packages kubernetes ,ship python packages layout ,ship python packages list ,ship python packages location ,ship python packages login ,ship python packages mac ,ship python packages manager ,ship python packages name ,ship python packages namespace ,ship python packages not found ,ship python packages offline ,ship python packages online ,ship python packages path ,ship python packages pdf ,ship python packages python ,ship python packages qt ,ship python packages query ,ship python packages questions ,ship python packages queue ,ship python packages repository ,ship python packages service ,ship python packages template ,ship python packages tutorial ,ship python packages ubuntu ,ship python packages url ,ship python packages validation ,ship python packages version ,ship python packages windows ,ship python packages xcode ,ship python packages xml ,ship python packages xpath ,ship python packages yaml ,ship python packages yes ,ship python packages youtube ,ship python packages zabbix ,ship python packages zerodha ,ship python packages zip ,ship python packages zoho ,spark ship python package ,spark ship python package example ,spark ship python packages anaconda ,spark ship python packages analysis ,spark ship python packages api ,spark ship python packages best ,spark ship python packages builder ,spark ship python packages c# ,spark ship python packages code ,spark ship python packages cost ,spark ship python packages diagram ,spark ship python packages download ,spark ship python packages failed ,spark ship python packages format ,spark ship python packages generator ,spark ship python packages git ,spark ship python packages github ,spark ship python packages guide ,spark ship python packages handler ,spark ship python packages header ,spark ship python packages in java ,spark ship python packages install ,spark ship python packages java ,spark ship python packages json ,spark ship python packages juniper ,spark ship python packages jupyter ,spark ship python packages keras ,spark ship python packages key ,spark ship python packages keyword ,spark ship python packages kotlin ,spark ship python packages kubernetes ,spark ship python packages list ,spark ship python packages location ,spark ship python packages mac ,spark ship python packages manager ,spark ship python packages name ,spark ship python packages namespace ,spark ship python packages not found ,spark ship python packages not working ,spark ship python packages offline ,spark ship python packages package ,spark ship python packages pdf ,spark ship python packages python ,spark ship python packages qt ,spark ship python packages query ,spark ship python packages questions ,spark ship python packages queue ,spark ship python packages repository ,spark ship python packages review ,spark ship python packages sample ,spark ship python packages syntax ,spark ship python packages template ,spark ship python packages tutorial ,spark ship python packages ubuntu ,spark ship python packages uipath ,spark ship python packages update ,spark ship python packages upload ,spark ship python packages validation ,spark ship python packages version ,spark ship python packages windows ,spark ship python packages xml ,spark ship python packages xpath ,spark ship python packages yaml ,spark ship python packages youtube ,spark ship python packages zerodha ,spark ship python packages zip




install python package on spark cluster , ,pyspark pip install ,spark-submit python package ,install python package in azure databricks ,databricks job cluster install library ,pip install spark ,databricks install python package in notebook ,spark-submit python dependencies ,pyspark list installed packages ,install python package on node cluster , ,install python package on cluster ,manually install python module ,install python module from local directory ,pip install ,slurm install python package ,install packages on cluster ,run python script on cluster ,install python locally ,install python package on spark cluster ,install python package on windows ,install python package on jupyter ,install python package on aws lambda ,install python package on linux server ,install python package on docker container ,install python package on emr ,install python package on databricks cluster ,install python package on ubuntu ,install python package on vscode ,Can you use Python packages with Pyspark? ,How do I add packages to Pyspark? ,How do I import Python files into Pyspark? ,How do I install a Databricks module? , ,pyspark pip install ,spark-submit python package ,install python package in azure databricks ,databricks job cluster install library ,pip install spark ,databricks install python package in notebook ,spark-submit python dependencies ,pyspark list installed packages , ,runtimewarning: failed to add file speficied in spark-submit pyfiles to python path ,install python package in azure databricks ,spark-submit python package ,spark-submit python was not found ,install pyarrow databricks ,spark-submit python dependencies ,how to package pyspark application ,pyspark install requirements ,spark-submit python dependencies ,install python package on spark cluster ,spark-submit --py-files multiple files ,spark-submit yarn cluster example ,how to use python libraries in pyspark ,spark-submit py-files example ,pyspark cluster mode ,databricks spark-submit python , ,spark-submit python dependencies ,install python package on spark cluster ,spark-submit --py-files multiple files ,spark-submit yarn cluster example ,how to use python libraries in pyspark ,spark-submit py-files example ,pyspark cluster mode ,databricks spark-submit python ,python install dependencies in venv ,python install dependencies for project ,python install dependencies for script ,python setup.py install dependencies only ,install python dependencies on windows , ,python install dependencies automatically ,package python dependencies for offline install ,python install all dependencies ,pip install ,python dependencies file ,install python dependencies windows ,what are python dependencies ,python dependency management ,install python package on cluster , ,pip install ,install packages on cluster ,manually install python module ,slurm install python package ,run python script on cluster ,install anaconda on cluster ,pip install hpc ,install python locally