How To Read(Load) Data from Local, HDFS & Amazon S3 in Spark ?
This post explains - How To Read(Load) Data from Local , HDFS & Amazon S3 Files in Spark . Apache Spark can connect to different sources to read data. We will explore the three common source filesystems namely - Local Files, HDFS & Amazon S3.
Read from Local Files
Few points on using Local File System to read data in Spark -
- Local File system is not Distributed in Nature.
- Note the file/directory you are accessing has to be available on each node.
- Hence is not an Ideal Option to read file in Big Data.
Nonetheless the code syntax is -
$ spark-shell
# Read as Dataframe
scala> val DF = spark.read.text("file:///home/data/testfile")
# Read as Dataset
scala> val DS = spark.read.textFile("file:///home/data/testfile")
Read from HDFS
- HDFS is one of the most widely used & popular storage system in Big Data World.
- You can put any structured , semi-structured & unstructured data in HDFS without bothering about the schema.
- The Schema needs to be handled only while reading the files from HDFS (Schema on read concept)
- Note the HDFS File path url in our code below -
hdfs://
- protocol type
localhost
- ip address(may be different for your case e.g. - 127.88.12.3)
9000
- port number
/user/hduser/data/testfile
- Complete path to the file you want to load.
- You can find the localhost & Port number value in the hadoop
core-site.xml
config file's . Check the fs.defaultFS parameter value.
$ spark-shell
# Read as Dataset
scala> val DS = spark.read.textFile("("hdfs://localhost:9000/user/hduser/data/testfile")
Read from Amazon S3
- S3 is a filesystem from Amazon.
- Very widely used in almost most of the major applications running on AWS cloud (Amazon Web Services).
- Note the filepath in below example - com.Myawsbucket/data is the S3 bucket name.
- You can use both s3:// and s3a://.
- s3a:// means a regular file(Non-HDFS) in the S3 bucket but readable and writable by the outside world.
- s3:// means an HDFS file sitting in the S3 bucket.
$ spark-shell
# Read as Dataset
scala> val DS = spark.read.textFile("s3a://com.Myawsbucket/data/testfile")
Hope you find this post helpful.
Additional Reads -