DevOps | Cloud | Analytics | Open Source | Programming





Apache Spark Tricky Interview Questions Part 2



Apache Spark Tricky Interview Questions Part 2 .This is in continuation of our Interview Question series for Apache Spark . If you have not , watch the early parts (links at the end of the post).  We will keep publishing more posts in further continuation of the interview series. Stay Tuned.  

Does apache spark need  hadoop?

Spark doesn't necessarily require Hadoop . By default , Spark does not have storage mechanism. But if you want to run in multi-node setup, you need a resource manager like YARN or Mesos and a distributed file system like HDFS,S3 etc. Also if you are dealing with Parquet files, some of its functionality relies on Hadoop's code (i.e. handling of Parquet files).  

Explain the significance of  “Stage Skipped” in Apache Spark web UI?

To simply put - it means that these stages are already evaluated before, and the result is available without further re-execution. It signifies that data was fetched from cache and there was no requirement to re-execute the given stage. It is consistent with the DAG which shows that the subsequent next stage would require shuffling (reduceByKey). In Spark whenever shuffling necessity is involved, Spark will automatically cache the generated data.  

How to pass an environment variable to a Spark job?

While submitting the Spark job using spark-submit use the below configs -


\--driver-java-options "-Dconfig.resource=app"
--files <folder\_where\_the\_app\_is\_kept.conf>    ---> CUSTOM CONFIG FILE
--conf 'spark.executor.extraJavaOptions=-Dconfig.resource=app'
--conf 'spark.driver.extraJavaOptions=-Dconfig.resource=app

The "--conf" used in the command will overwrite any previous one - verify this at sparkUI after job started under Environment tab.  

How to write\save a Spark DataFrame to Hive?

  • Use a HiveContext

import org.apache.spark.sql.hive.HiveContext;

HiveContext sqlContext = new org.apache.spark.sql.hive.HiveContext(sc.sc());

  • Save the Spark dataframe to store as hive table

df.write.mode("<append\\overwrite\\ignore\\ErrorIfExists>").saveAsTable("<Hive\_schema\_Name>.<Hive\_table\_Name>");

How does a HashPartitioner work?

HashPartitioner takes a single argument which defines number of partitions to be made. Next the values are assigned to different partitions using hash of keys.  You can have different varieties of hash functions. It depends  on the coding language in which the spark application is developed . Such hash algos are - hashCode(Scala RDD), MurmurHash 3 (Datasets), PySpark, portable_hash etc. Note if distribution of keys is not uniform you can end up in situations when part of your cluster is idle The HashPartitioner.getPartition method takes a key as its argument and returns the index of the partition which the key belongs to. The partitioner has to know what the valid indices are, so it returns numbers in the right range. The number of partitions is specified through the numPartitions constructor argument.  

How to choose - What Type of Cluster to choose in Spark?

Standalone cluster if the requirement and usage is Simple enough. This type of the cluster mode is the easiest to set up and it provides more or less same features as the other cluster managers. It is good enough if you are only running Spark. If the requirement is to run Spark with other applications or superior resource management is sought for managing job queues , use either YARN or Mesos. Both YARN & Mesos are external resource managers . Many commercial setup like Cloudera come with pre-installed YARN along with the distro. Ideally it is better to run Spark on the same nodes as HDFS . This allows for faster access to data storage. You can install Mesos or the standalone cluster manager on the same nodes manually, or most Hadoop distributions already install YARN and HDFS together.  

What is the Difference between spark.sql.shuffle.partitions and spark.default.parallelism?

spark.sql.shuffle.partitions configures the number of partitions that are used when shuffling data for joins or aggregations. spark.default.parallelism is the default number of partitions in RDDs returned by transformations like join, reduceByKey, and parallelize when not set explicitly by the user. Note if the task you are performing is not a join or aggregation and you are using dataframes then these setting does not work. You could, however, set the number of partitions yourself by calling df.repartition(numOfPartitions) (don't forget to assign it to a new val) in your code.  

Other Interesting Read -

   


spark interview questions, spark interview, pyspark interview questions, apache spark interview questions, spark, apache spark, spark interview questions, spark submit, Apache Spark Tricky Interview Questions