How To Read Various File Formats in PySpark (Json, Parquet, ORC, Avro) ?

This post explains Sample Code - How To Read Various File Formats in PySpark (Json, Parquet, ORC, Avro). We will consider the below file formats -

  • JSON
  • Parquet
  • ORC
  • Avro
  • CSV
We will use SparkSQL to load the file , read it and then print some data of it.   First we will build the basic Spark Session which will be needed in all the code blocks.

import org.apache.spark.sql.SparkSession

val spark \= SparkSession
  .appName("Various File Read")
  .config("spark.some.config.option", "some-value")

// For implicit conversions like converting RDDs to DataFrames
import spark.implicits.\_
sc = spark.sparkContext


1. JSON File :



# Json File
path \= "anydir/customerData.json"
inputDF \=

\# Visualize the schema using the printSchema() method

\# Creates a temporary view using the DataFrame

\# Use SQL statements 
listDF \= spark.sql("SELECT name FROM customer WHERE rank BETWEEN 1 AND 10")

\# A DataFrame can also be created for a JSON dataset using RDD Object
jsonStrings \= \['{"name":"Smith","address":{"city":"NYC","building":"rockstarforth"}}'\]
dataRDD \= sc.parallelize(jsonStrings)
dataDf \=


2. Parquet File :

We will first read a json file , save it as parquet format and then read the parquet file.

inputDF \="somedir/customerdata.json")

\# Save DataFrames as Parquet files which maintains the schema information.

\# Read above Parquet file. This gives a Dataframe.
dataParquet \="input.parquet")

\# Parquet files can also be used to create a temporary view and then used in SQL statements.
students \= spark.sql("SELECT name FROM tableParquet WHERE age >= 13 AND age <= 19")


3. Avro File :

Avro formatis supported in Spark Sql from Spark 2.4.

The spark-avro module is not internal . And hence not part of  spark-submit or spark-shell . We need to add the Avro dependency i.e. spark-avro_2.12  through --packages while submitting spark jobs with  spark-submit . Example below -

./bin/spark-submit --packages org.apache.spark:spark-avro_2.12:2.4.4 ... 

If you want to add the Avro package to to spark-shell,  use below command while launching the spark-shell  -

./bin/spark-shell --packages org.apache.spark:spark-avro_2.12:2.4.4 ... 

  First lets create a avro format file

inputDF ="somedir/customerdata.json")

`"name", "city").write.format("avro").save("customerdata.avro")`

Now use below code to read the Avro file df ="avro").load("customerdata.avro")  

4. ORC File :


val orcfile \= "FILE\_PATH\_OF\_THE\_ORC\_FILE"
val df \="orc").load(orcfile)

val orcfile = "FILE\_PATH\_OF\_THE\_ORC\_FILE"
val dataOrc \="inferSchema", true).orc(orcfile)

#If you want to load from Multiple Paths 
val dataDF \="orc").load("hdfs://localhost:8020/Dir1/\*","hdfs://localhost:8020/Dir2/\*/part-r-\*.orc")


5. CSV Files:

Case 1: - Let's say - we have to create the schema of the CSV file to be read.

from pyspark.sql.types import \*

id = StructField("id",StringType(),True)

Occupation = StructField("Occupation",StringType(), True)

columnList = \[id, Occupation\]

dfSchema = StructType(columnList)

Let's print the schema details #==============================



# use the Schema to read the CSV File#=====================================

df ='inputFile.csv',

#Print the data

#Print the schema

  Case 2: - Let's say - we want Spark to  infer the schema instead of creating the schema ourselves.

df ='inputFile.csv',

#Print the data

#Print the schema

  This ends up a concise summary as How to Read Various File Formats in PySpark (Json, Parquet, ORC, Avro). Hope this helps .  

Additional Read -

