Cyber Security | DevOps | Cloud | Analytics | Open Source | Programming





How To Read(Load) Data from Local, HDFS & Amazon S3 in Spark ?



This post explains - How To Read(Load) Data from Local , HDFS & Amazon S3 Files in Spark . Apache Spark can connect to different sources to read data. We will explore the three common source filesystems namely - Local Files, HDFS & Amazon S3.  

Read from Local Files


Few points on using Local File System to read data in Spark -

  • Local File system is not Distributed in Nature.
  • Note the file/directory you are accessing has to be available on each node.
  • Hence is not an Ideal Option to read file in Big Data.
Nonetheless the code syntax is -


$ spark-shell

# Read as Dataframe
scala> val DF = spark.read.text("file:///home/data/testfile")

# Read as Dataset
scala> val DS = spark.read.textFile("file:///home/data/testfile")

 

Read from HDFS


  • HDFS is one of the most widely used & popular storage system in Big Data World.
  • You can put any structured , semi-structured & unstructured data in HDFS without bothering about the schema.
  • The Schema needs to be handled only while reading the files from HDFS (Schema on read concept)
  • Note the HDFS File path url in our code below -
    • hdfs:// - protocol type
    • localhost - ip address(may be different for your case e.g. - 127.88.12.3)
    • 9000 - port number
    • /user/hduser/data/testfile - Complete path to the file you want to load.
    • You can find the localhost & Port number value in the hadoop core-site.xml config file's . Check the fs.defaultFS parameter value.

$ spark-shell

# Read as Dataset
scala> val DS = spark.read.textFile("("hdfs://localhost:9000/user/hduser/data/testfile")

Read from Amazon S3


  • S3 is a filesystem from Amazon.
  • Very widely used in almost most of the major applications running on AWS cloud (Amazon Web Services).
  • Note the filepath in below example - com.Myawsbucket/data is the S3 bucket name.
  • You can use both s3:// and s3a://.
    • s3a:// means a regular file(Non-HDFS) in the S3 bucket but readable and writable by the outside world.
    • s3:// means an HDFS file sitting in the S3 bucket.

$ spark-shell

# Read as Dataset
scala> val DS = spark.read.textFile("s3a://com.Myawsbucket/data/testfile")

Hope you find this post helpful.   Additional Reads -