How To Read(Load) Data from Local, HDFS & Amazon S3 in Spark ?

This post explains - How To Read(Load) Data from Local , HDFS & Amazon S3 Files in Spark . Apache Spark can connect to different sources to read data. We will explore the three common source filesystems namely - Local Files, HDFS & Amazon S3.

Read from Local Files

Few points on using Local File System to read data in Spark -

Local File system is not Distributed in Nature.
Note the file/directory you are accessing has to be available on each node.
Hence is not an Ideal Option to read file in Big Data.

Nonetheless the code syntax is -


$ spark-shell

# Read as Dataframe
scala> val DF = spark.read.text("file:///home/data/testfile")

# Read as Dataset
scala> val DS = spark.read.textFile("file:///home/data/testfile")

Read from HDFS

HDFS is one of the most widely used & popular storage system in Big Data World.
You can put any structured , semi-structured & unstructured data in HDFS without bothering about the schema.
The Schema needs to be handled only while reading the files from HDFS (Schema on read concept)
Note the HDFS File path url in our code below -
- hdfs:// - protocol type
- localhost - ip address(may be different for your case e.g. - 127.88.12.3)
- 9000 - port number
- /user/hduser/data/testfile - Complete path to the file you want to load.
- You can find the localhost & Port number value in the hadoop core-site.xml config file's . Check the fs.defaultFS parameter value.


$ spark-shell

# Read as Dataset
scala> val DS = spark.read.textFile("("hdfs://localhost:9000/user/hduser/data/testfile")

Read from Amazon S3

S3 is a filesystem from Amazon.
Very widely used in almost most of the major applications running on AWS cloud (Amazon Web Services).
Note the filepath in below example - com.Myawsbucket/data is the S3 bucket name.
You can use both s3:// and s3a://.
- s3a:// means a regular file(Non-HDFS) in the S3 bucket but readable and writable by the outside world.
- s3:// means an HDFS file sitting in the S3 bucket.


$ spark-shell

# Read as Dataset
scala> val DS = spark.read.textFile("s3a://com.Myawsbucket/data/testfile")

Hope you find this post helpful. Additional Reads -

DevOps | Cloud | Cyber Security | Web-Dev | Analytics | Open Source

How To Read(Load) Data from Local, HDFS & Amazon S3 in Spark ?

Read from Local Files

Read from HDFS

Read from Amazon S3

Apply Pod Security Standards To Kubernetes Cluster

Indentation Problem Fix in Python

Most Important Metrics To Monitor In Kafka

Data Skewness in Spark (Salting Method)

Unicode Encode Error in Python (Ascii Codec Encode)

DevOps | Cloud | Cyber Security | Web-Dev | Analytics | Open Source

How To Read(Load) Data from Local, HDFS & Amazon S3 in Spark ?

Read from Local Files

Read from HDFS

Read from Amazon S3

Popular Articles

Apply Pod Security Standards To Kubernetes Cluster

Indentation Problem Fix in Python

Most Important Metrics To Monitor In Kafka

Data Skewness in Spark (Salting Method)

Unicode Encode Error in Python (Ascii Codec Encode)