Cyber Security | DevOps | Cloud | Analytics | Open Source | Programming





How to Install Python Packages on AWS EMR Notebooks ?



In this post, we will see How to Install Python Packages on AWS EMR Notebooks.   AWS EMR Notebooks is based on Jupyter notebook. Note the below points with regards to the additional ad-hoc packages installed -

  • These installed libs are only available to the specific notebook user during the notebook session. If other users want to use the same libs or the same user needs the same libs in a different session, these libs has to be re-installed.
  • These installed libs don't interfere with the cluster libraries but they take more priority.
  • After you end the notebook session, these libraries will be gone from the EMR cluster.
  • We can import and install Python libs on the remote AWS cluster as and when required. And these will be available to use in EMR notebooks.
  • With EMR Notebooks, you opt to use - Python 3, Pyspark, Spark (scala), or SparkR kernels.
  • You can also use EMR through Sagemaker Notebook with sagemaker spark magic notebook kernels.

Pre-Steps :

Let's follow the steps -

  • Connect\Login to AWS.
 

  • Create a new notebook using PySpark kernel or use existing notebook.
 

  • Open the EMR notebook and set the kernel to "PySpark" - if not already done.
 

  • Check the existing session configuration -

    %%info
    
  You can modify the config as per your preference -  


%%configure \-f 
 { 
 "conf":{ 
     "executorMemory":"4G",
     "spark.dynamicAllocation.enabled":"false"
 } 
 }

 

Installing Packages :

 

  • To install additional packages, use the below commands specifying the package details. It uses the install_pypi_package API. By default latest lib version will be installed along with all the dependencies. But you can specify the exact version as well.
 


sc.install\_pypi\_package("boto3")
sc.install\_pypi\_package("numpy==x.y.z") \# Install numpy version x.y.z
sc.install\_pypi\_package("numpy") # Install numpy latest version
sc.install\_pypi\_package("<package\_name\_with\_version>", "https://pypi.org/simple") \# Install from specific PyPI repo
sc.install\_pypi\_package("pandas", "https://pypi.org/simple") 

 

  • To check the installation, verify using below command . It should list the newly installed packages.

sc.list\_packages()

 

Using the Python Packages in EMR Notebook :

 

  • EMR notebooks comes with pre-packaged Python libs out of the box which you can use without installing anything. But if you want to install specific python libs, then the EMR cluster must have access to the PyPI repo.
  • These local libraries are only available to the Python kernel .
  • These local libraries are not available to the Spark environment on the cluster.
  • This is opposite of the case for the notebook-scoped libraries
  You can check the available local libs -


%%local 

conda list

   

Uninstalling Packages :

 

  • Once done if you want to uninstall the package, use below -

sc.uninstall\_package('<package\_name>')

  Hope this helps.  

Other Interesting Reads -

     


aws jupyter notebook install package ,aws emr install python packages ,aws emr bootstrap install python packages ,install\_pypi\_package pyspark ,emr notebook pip install ,install python package on spark cluster ,sc.install\_pypi\_package upgrade ,spark.pyspark.virtualenv.enabled is set to true emr ,How do I install Python EMR packages? ,Can we run Python on EMR? ,How do I install Python packages in SageMaker? ,How do I install SageMaker modules? ,