DevOps | Cloud | Analytics | Open Source | Programming





How To Connect to Database in PySpark ?



In this Post , we will see How To Connect to Database in PySpark and the different parameters used in that. PySpark SQL can connect to databases using JDBC. This operation can load tables from external database and create output in below formats -

  • A DataFrame OR
  • A Spark SQL Temp view
However this is different from the Spark SQL JDBC server. We use the that to run queries using Spark SQL from other applications.  

Configuration for Database Jars:

  • You need to download the database specific jar from respective databases.
e.g. for postgres download PostgreSQL JDBC Driver from https://jdbc.postgresql.org/download.html  

  • Add the JDBC downloaded folder dir to spark classpath.

export SPARK _CLASSPATH=<path _to _database _jar>

A Sample structure of making a JDBC connection from spark is as follows -


./bin/spark-shell --driver-class-path <JARNAME _CONTAINING _THE _CLASS> --jars <DATABASE _JARNAME>

If using spark-submit, a sample structure is -


$ spark-submit --driver-class-path <COMPLETE _PATH _TO _DB _JAR> pysparkcode.py

There are different properties that can be used to make the JDBC connection.  

Important JDBC Properties:

Some of the important properties used to connect through JDBC are discussed below -  

  • URL - The URL HOSTNAME details of the database that we are trying to connect to.
e.g., url = 'postgresql://localhost:5432/dbname'  

  • dbtable - The table we are reading from or writing to
 

  • customSchema - You can specify custom schema in the JDBC connectors.
 

  • batchsize - You can write insert rows in batches to improve performance on JDBC drivers. Default is 1000.
 

  • fetchsize - You can read rows in batches to improve performance on JDBC drivers.
   

Reading from a Database:

 

Option 1:


from pyspark.sql import SparkSession

spark = SparkSession   
         .builder   
         .appName("PySpark App")   
         .config("spark.jars", "<COMPLETE _PATH _TO _THE _DATABASE _JAR>")   
         .getOrCreate()

df = spark.read   
       .format("jdbc")   
       .option("url", "jdbc:postgresql://localhost:5432/TESTDATABASE")   
       .option("dbtable", "tablename")   
       .option("user", "username")   
       .option("password", "password")   
       .option("driver", "org.postgresql.Driver")   
       .option("customSchema", "CUSTOMER _ID DECIMAL(38, 0), CUSTOMER _NAME STRING")    #If You want to Specify Dataframe column
       .load()

df.printSchema()

Option 2:


df = spark.read   
            .jdbc("jdbc:postgresql:dbserver", "tablename",
             properties={"user": "username", "password": "password"})

Option 3 -


from pyspark.sql import DataFrameReader

url = 'postgresql://localhost:5432/dbname'
properties = {'user': 'username', 'password': 'password'}

df = DataFrameReader(sqlContext).jdbc(url='jdbc:%s' % url, table='tablename', properties=properties)

Option 4 -


url = 'postgresql://localhost:5432/dbname'
df = sqlContext.read.format('jdbc').  
                      options(url='jdbc:%s' % url, dbtable='tablename').  
                      load()

Writing Output to a Database:

 

Option 1:


 # Saving dataframe to a JDBC source
df.write   
     .format("jdbc")   
     .option("url", "jdbc:postgresql:dbserver")   
     .option("dbtable", "tablename")   
     .option("user", "username")   
     .option("password", "password")   
     .save()

Option 2:


df.write   
   .jdbc("jdbc:postgresql:dbserver", "tablename",
    properties={"user": "username", "password": "password"})

Option 3:


 # User Specific Table column to write

df.write   
      .option("createTableColumnTypes", "CUSTOMER CHAR(64), ADDRESS VARCHAR(1024)")   
      .jdbc("jdbc:postgresql:dbserver", "tablename",
       properties={"user": "username", "password": "password"})

Hope this helps.    

Other Interesting Reads -


apache pyspark data types ,apache pyspark dataframe ,apache pyspark kafka ,apache pyspark tutorial ,apache spark api ,apache spark applications ,apache spark by example ,apache spark certification ,apache spark classification ,apache spark course ,apache spark documentation ,apache spark download ,apache spark framework ,apache spark fundamentals ,apache spark git ,apache spark implementation ,apache spark interview questions ,apache spark library ,apache spark practice problems ,apache spark scala ,apache spark sql ,apache spark test ,apache spark training ,apache spark version ,best database for pyspark ,connect to database in pyspark ,connect to database using pyspark ,connect to db2 database using pyspark ,connect to hive database from pyspark ,connect to oracle database from pyspark ,create database in hive using pyspark ,create database in pyspark ,databricks pyspark create database ,how to connect postgresql database in pyspark ,how to connect to cassandra database using pyspark ,how to connect to database in pyspark ,how to connect to hive database from pyspark ,how to connect to hive database using pyspark ,how to connect to mysql database using pyspark ,how to connect to oracle database using pyspark ,how to create database in pyspark ,how to read and write from database in spark using pyspark ,how to read data from oracle database in pyspark ,how to show database in pyspark ,pyspark access database ,pyspark apache arrow ,pyspark apache beam ,pyspark apache hudi ,pyspark apache ignite ,pyspark apache join ,pyspark apache mllib ,pyspark apache phoenix ,pyspark apache zeppelin ,pyspark apache.org ,pyspark azure sql database ,pyspark change database ,pyspark check if database exists ,pyspark connect to azure sql database ,pyspark connect to database ,pyspark connect to db2 database ,pyspark connect to existing hive database ,pyspark connect to hive database ,pyspark connect to mysql database ,pyspark connect to redshift database ,pyspark connect to remote hive database ,pyspark create database ,pyspark create database if not exists ,pyspark create database in hive ,pyspark create dataframe from database ,pyspark create hive database ,pyspark create table in database ,pyspark create temp database ,pyspark database connection ,pyspark database not found ,pyspark database table ,pyspark database xml ,pyspark database yaml ,pyspark database year ,pyspark dataframe to database ,pyspark drop database ,pyspark failed to get database global _temp ,pyspark failed to start database 'metastore _db' ,pyspark get databases ,pyspark get tables in database ,pyspark hive database not found ,pyspark hive use database ,pyspark insert into database ,pyspark list databases ,pyspark list tables in database ,pyspark load database ,pyspark oracle database connection ,pyspark org.apache.hadoop.fs ,pyspark org.apache.hadoop.fs.filesystem ,pyspark org.apache.hadoop.fs.path ,pyspark org.apache.hadoop.fs.s3a.s3afilesystem not found ,pyspark org.apache.hadoop.security.accesscontrolexception permission denied ,pyspark org.apache.thrift.transport.ttransportexception ,pyspark query database ,pyspark query oracle database ,pyspark query sql database ,pyspark read azure sql database ,pyspark read data from database ,pyspark read database ,pyspark read from database ,pyspark read hive database ,pyspark read oracle database ,pyspark refresh database ,pyspark save table to database ,pyspark save to database ,pyspark saveastable database ,pyspark saveastable database not found ,pyspark select database ,pyspark select from database ,pyspark set current database ,pyspark set database ,pyspark set default database ,pyspark show all tables in database ,pyspark show database ,pyspark show databases ,pyspark show tables in database ,pyspark show tables in hive database ,pyspark sql connect to database ,pyspark sql create database ,pyspark sql database ,pyspark sql use database ,pyspark to database ,pyspark update database ,pyspark update database table ,pyspark use database ,pyspark use hive database ,pyspark write database ,pyspark write dataframe to database ,pyspark write into database ,pyspark write to azure sql database ,pyspark write to database ,pyspark write to sql database ,read and write dataframe from database using pyspark ,read data from oracle database using pyspark ,reading data from oracle database with pyspark ,spark graph database ,use database in pyspark ,was ist apache spark ,what exactly is apache spark ,what is the difference between apache spark and pyspark ,how to read and write from database in spark using pyspark ,pyspark read write database by date ,pyspark read write database by name ,pyspark read write database c# ,pyspark read write database column ,pyspark read write database command ,pyspark read write database command line ,pyspark read write database connection ,pyspark read write database data ,pyspark read write database databases ,pyspark read write database delete ,pyspark read write database documentation ,pyspark read write database error ,pyspark read write database example ,pyspark read write database file ,pyspark read write database firebase ,pyspark read write database from database ,pyspark read write database get ,pyspark read write database group by ,pyspark read write database guru ,pyspark read write database header ,pyspark read write database height ,pyspark read write database history ,pyspark read write database in sql server ,pyspark read write database index ,pyspark read write database insert ,pyspark read write database java ,pyspark read write database join ,pyspark read write database key ,pyspark read write database keyword ,pyspark read write database memory ,pyspark read write database mode ,pyspark read write database model ,pyspark read write database name ,pyspark read write database object ,pyspark read write database online ,pyspark read write database path ,pyspark read write database permission ,pyspark read write database postgres ,pyspark read write database postgresql ,pyspark read write database properties ,pyspark read write database query ,pyspark read write database recovery ,pyspark read write database reference ,


pyspark read write database schema ,pyspark read write database scripts ,pyspark read write database table ,pyspark read write database user ,pyspark read write database value ,pyspark read write database variable ,pyspark read write database view ,pyspark read write database xls ,pyspark read write database xml ,pyspark read write database yaml ,pyspark read write database year ,pyspark read write database zero ,pyspark read write database zip ,pyspark read write database zip file ,pyspark read write database zone ,read and write dataframe from database using ,database connection in pyspark ,pyspark connect database ,pyspark connect to cosmos db ,pyspark connect to database ,pyspark connect to db2 database ,pyspark connect to hive database ,pyspark connect to mongodb ,pyspark connect to mysql database ,pyspark connect to oracle database ,pyspark connect to redshift database ,pyspark connect to sql database ,pyspark connection database access ,pyspark connection database backup ,pyspark connection database error ,pyspark connection database example ,pyspark connection database failed ,pyspark connection database file ,pyspark connection database greyed out ,pyspark connection database group by ,pyspark connection database java ,pyspark connection database json ,pyspark connection database key ,pyspark connection database keyword ,pyspark connection database laravel ,pyspark connection database name ,pyspark connection database not found ,pyspark connection database not working ,pyspark connection database password ,pyspark connection database performance ,pyspark connection database permissions ,pyspark connection database php ,pyspark connection database postgres ,pyspark connection database query ,pyspark connection database value ,pyspark connection database variable ,pyspark connection database version ,pyspark connection database view ,pyspark connection database xml ,pyspark connection database zip ,pyspark connection database zip file ,pyspark database connection ,pyspark oracle database connection ,pyspark sql connect to database ,pyspark database ,pyspark database connection ,pyspark database not found ,pyspark database table ,pyspark show databases ,pyspark create database ,pyspark use database ,pyspark list databases ,pyspark drop database ,pyspark access database ,pyspark apache arrow ,apache spark api ,apache spark applications ,pyspark apache beam ,apache spark by example ,pyspark create database if not exists ,pyspark change database ,pyspark create database in hive ,pyspark oracle database connection ,apache spark course ,apache spark certification ,apache spark documentation ,apache spark download ,apache pyspark dataframe ,apache pyspark data types ,pyspark set default database ,pyspark connect to db2 database ,pyspark check if database exists ,apache spark fundamentals ,apache spark framework ,pyspark read from database ,pyspark select from database ,pyspark hive database not found ,pyspark saveastable database not found ,pyspark create dataframe from database ,spark graph database ,pyspark get databases ,apache spark git ,pyspark failed to get database global _temp ,pyspark apache hudi ,pyspark create hive database ,pyspark use hive database ,pyspark read hive database ,pyspark connect to hive database ,pyspark apache ignite ,apache spark interview questions ,apache spark implementation ,pyspark insert into database ,pyspark write into database ,pyspark apache join ,apache pyspark kafka ,pyspark load database ,apache spark library ,pyspark apache mllib ,pyspark connect to mysql database ,pyspark apache.org ,pyspark org.apache.hadoop.fs.s3a.s3afilesystem not found ,pyspark org.apache.thrift.transport.ttransportexception ,pyspark org.apache.hadoop.fs ,pyspark org.apache.hadoop.security.accesscontrolexception permission denied ,pyspark org.apache.hadoop.fs.filesystem ,pyspark org.apache.hadoop.fs.path ,pyspark apache phoenix ,apache spark practice problems ,pyspark query database ,pyspark read database ,pyspark refresh database ,pyspark connect to redshift database ,pyspark show database ,pyspark sql database ,pyspark set database ,pyspark to database ,pyspark update database table ,apache pyspark tutorial ,apache spark training ,apache spark test ,pyspark dataframe to database ,pyspark save to database ,pyspark update database ,pyspark sql use database


pyspark hive use database ,apache spark version ,was ist apache spark ,what exactly is apache spark ,what is the difference between apache spark and pyspark ,pyspark write database ,pyspark apache zeppelin ,database connection in pyspark ,pyspark create table in database ,pyspark read table from database ,pyspark save table to database ,pyspark show all databases ,pyspark list all databases ,pyspark show hive databases ,pyspark list hive databases ,pyspark show tables in database ,show databases in pyspark ,pyspark sql show databases ,pyspark sql create database ,databricks pyspark create database ,pyspark create temp database ,create database in pyspark ,connect oracle database from pyspark ,connect to a database using pyspark ,connect to database in pyspark ,connect to database using pyspark ,connect to db2 database using pyspark ,connect to hive database from pyspark ,connect to oracle database from pyspark ,how to connect mysql database in pyspark ,how to connect oracle database in pyspark ,how to connect postgresql database in pyspark ,how to connect to cassandra database using pyspark ,how to connect to database in pyspark ,how to connect to hive database from pyspark ,how to connect to hive database using pyspark ,how to connect to mysql database using pyspark ,how to connect to oracle database using pyspark ,how to connect to postgres database using pyspark ,pyspark connect database ,pyspark connect database gateway ,pyspark connect database group by ,pyspark connect database java ,pyspark connect database key ,pyspark connect database laravel ,pyspark connect database link ,pyspark connect database location ,pyspark connect database logs ,pyspark connect database path ,pyspark connect database performance ,pyspark connect database php ,pyspark connect database postgres ,pyspark connect database postgresql ,pyspark connect database project ,pyspark connect database query ,pyspark connect database value ,pyspark connect database vb.net ,pyspark connect database version ,pyspark connect database xampp ,pyspark connect database xml ,pyspark connect database xquery ,pyspark connect database zip ,pyspark connect database zone ,pyspark connect to azure sql database ,pyspark connect to cosmos db ,pyspark connect to database ,pyspark connect to db2 database ,pyspark connect to existing hive database ,pyspark connect to hive database ,pyspark connect to mysql database ,pyspark connect to oracle database ,pyspark connect to redshift database ,pyspark connect to remote hive database ,pyspark connect to sql database ,pyspark sql connect to database ,pyspark connect to oracle database ,pyspark connect to hive database ,pyspark connect to mysql database ,pyspark connect to redshift database ,pyspark connect to sql database ,pyspark connect to db2 database ,pyspark connect to remote hive database ,pyspark connect to azure sql database ,pyspark connect to cosmos db ,pyspark connect to existing hive database ,connect oracle database from pyspark ,connect to database in pyspark ,pyspark database connection ,pyspark connect to database ,pyspark sql connect to database ,connect to database using pyspark