DevOps | Cloud | Analytics | Open Source | Programming





Apache Spark Tricky Interview Questions Part 5



Apache Spark Tricky Interview Questions Part 5. This is in continuation of our Interview Question series for Apache Spark . If you have not , watch the early parts (links at the end of the post).  We will keep publishing more posts in further continuation of the interview series. Stay Tuned.  

Can we process Binary data in Spark ?

Spark 3.0 supports “binaryFile” format to read various types of Binary files.Binary Data is converted into Dataframe comprising of data & metadata of the binary contents. The binary Dataframe reader supports binary files of types - image, pdf, zip, gzip, tarball etc.. Example code below -


Python:

spark.read.format("binaryFile").option("pathGlobFilter", "\*.png").load("/path/to/data")

Scala:

spark.read.format("binaryFile").option("pathGlobFilter", "\*.png").load("/path/to/data")

Java:
spark.read().format("binaryFile").option("pathGlobFilter", "\*.png").load("/path/to/data");

R:
read.df("/path/to/data", source = "binaryFile", pathGlobFilter = "\*.png")

The dataframe contains the following format - path: StringType modificationTime: TimestampType length: LongType content: BinaryType  

How can you compress a file in Spark ?

You can save the Dataframe or any Spark results in the compressed format as shown below -


df.write.format("com.databricks.spark.csv")
.option("header", "true")
.option("codec", "org.apache.hadoop.io.compress.GzipCodec")
.save("datafile.csv.gz")

How to access a Spark Jar stored in AWS to submit to Local Spark ?


export AWS_ACCESS_KEY_ID=access_key
export AWS_SECRET_ACCESS_KEY=secret_key

./bin/spark-submit \
--master local\[2\] \
--class org.apache.spark.examples.TestCode \
s3a://<AWS\_Bucket\_Jar\_Location.Jar\_File\_Name.jar>

You would also need to download following AWS jar files and put them in Spark jar folder. https://mvnrepository.com/artifact/com.amazonaws/aws-java-sdk https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws  

What is a Spark Shuffling?

Spark Shuffling is a process used to re-distribute the data across different executors or machines or nodes in the cluster. Shuffling can be invoked by performing certain types of Spark transformations like grouByKey(), reduceByKey(), aggregateByKey(), Join() etc. on spark data structures like Dataframe, RDD . (You can read difference between these transformations , in our earlier post here) Shuffling involves - data serializtion\deserialization , I/O operation across Disk or Network. Hence unless effectively done , shuffling can be really expensive and often it causes the spark jobs to linger and thereby cutting down the efficiency and performance. Default shuffle partition number is 200 which comes from Spark SQL configuration spark.sql.shuffle.partitions. You can change the default shuffle partition value using SparkSession object


spark.conf.set("spark.sql.shuffle.partitions",150)

What are the various Spark Cache or Persist Storage Levels ?

  • Below details are based on Spark 3.0 MEMORY_ONLY - Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time they're needed. This is the default level. MEMORY_AND_DISK - Store RDD as deserialized Java objects in the JVM.
  • If the RDD does not fit in memory, store the partitions that don't fit on disk, and read them from there when they're needed. MEMORY_ONLY_SER - (Java and Scala) - Store RDD as serialized Java objects (one byte array per partition).
  • This is generally more space-efficient than deserialized objects, especially when using a fast serializer, but more CPU-intensive to read. MEMORY_AND_DISK_SER- (Java and Scala) Similar to MEMORY_ONLY_SER, but spill partitions that don't fit in memory to disk instead of recomputing them on the fly each time they're needed. DISK_ONLY - Store the RDD partitions only on disk. MEMORY_ONLY_2, MEMORY_AND_DISK_2 - Same as the levels above, but replicate each partition on two cluster nodes. OFF_HEAP (experimental) - Similar to MEMORY_ONLY_SER, but store the data in off-heap memory. This requires off-heap memory to be enabled.  
Other Interesting Reads -

 

 
 spark interview questions, spark interview, pyspark interview questions, apache spark interview questions, spark, apache spark, spark interview questions, spark submit, Apache Spark Tricky Interview Question,spark interview questions and answers, spark scala interview questions and answers for experienced