This post explains - How To Read(Load) Data from Local , HDFS & Amazon S3 Files in Spark . Apache Spark can connect to different sources to read data. We will explore the three common source filesystems namely - Local Files, HDFS & Amazon S3.
$ spark-shell
# Read as Dataframe
scala> val DF = spark.read.text("file:///home/data/testfile")
# Read as Dataset
scala> val DS = spark.read.textFile("file:///home/data/testfile")
hdfs://
- protocol typelocalhost
- ip address(may be different for your case e.g. - 127.88.12.3)9000
- port number/user/hduser/data/testfile
- Complete path to the file you want to load.core-site.xml
config file's . Check the fs.defaultFS parameter value.
$ spark-shell
# Read as Dataset
scala> val DS = spark.read.textFile("("hdfs://localhost:9000/user/hduser/data/testfile")
$ spark-shell
# Read as Dataset
scala> val DS = spark.read.textFile("s3a://com.Myawsbucket/data/testfile")
Hope you find this post helpful. Additional Reads -