DevOps | Cloud | Analytics | Open Source | Programming





Apache Spark Tricky Interview Questions Part 3



Apache Spark Tricky Interview Questions Part 3. This is in continuation of our Interview Question series for Apache Spark . If you have not , watch the early parts (links at the end of the post).  We will keep publishing more posts in further continuation of the interview series. Stay Tuned.

How to take the first 2000 rows in a Spark Operation ?

You can use the function .limit to take the first 2000 rows. Please note it returns a Dataset - NOT a DataFrame .  

What is the Difference between Head() function and Limit() function ?

Limit() returns a new Dataset by taking the first n rows. Whereas Head() returns an array .  

How to calculate Statistical facts - Median and Quantiles using Spark ?

You can use the approxQuantile method.

Python:


df.approxQuantile("x", [0.5], 0.25)

Scala:


df.stat.approxQuantile("x", Array(0.5), 0.25)

The parameter value 0.25 signifies relative error - The lower , the Better results.   If you are using SparkSQL , you can use the approx_percentile function


SELECT approx\_percentile(10.0, array(0.98, 0.324, 0.1021), 100);

Difference between ReduceByKey , GroupByKey , AggregateByKey , CombineByKey

GroupByKey - Least preferred option of all the four.  During GroupByKey data is sent over the network and collected on the reduce workers. It often causes out of disk or memory issues. GroupByKey takes no parameter and groups everything.


sparkContext.Csv(<VARIOUS_PARMS>, 
                .groupByKey()
)

ReduceByKey - In ReduceByKey, at each partition, data is combined based on the keys. Each key will have only one output. So data for each key at each partition is combined and only then is sent over the network. So basically it is like First Grouping and then Aggregating kind of operation. Hence it will shuffle less data compared to groupByKey() and preferred over. ReduceBykey takes 1 parameter only which is a function for merging.


sparkContext.Csv(<VARIOUS_PARMS>, 
                 .reduceByKey((x,y)=> (x+y))
)

AggregateByKey - Logicwise same as reduceByKey() . But it lets you return result in different type - Means you can have a input as type x and aggregate result as type y. Also you can provide initial values when performing aggregation. Format is - aggregateByKey(zeroValue)(seqOp, combOp, [numTasks]) AggregateByKey takes 3 parameters as input and uses 2 functions for merging(one for merging on same partitions and another to merge values across partition.   CombineByKey is similar to aggregateBykey except it can have a function for ZeroValue. CombineByKey takes 3 parameter and all 3 are functions.  

Difference between Checkpoint and Persist to a disk concepts in Spark ?

  Lineage differences - Persist / cache keeps lineage intact (lineage is preserved even if data is fetched from the cache) while checkpoint breaks lineage ( i.e. lineage is completely lost after the checkpoint) Persisting\Caching with StorageLevel.DISK_ONLY makes the generation of RDD to be computed and stored in a location so that these steps need not to be re-performed again. So upon persist, Spark will memorize the RDD lineage even if it doesn't call it. Only after the Spark application is completed, the cache or file is flushed or deleted. However , Checkpointing retains the rdd physically (to hdfs) and only destroys the lineage that created it. The checkpoint file is deleted even after the Spark application completes which makes it reusable is subsequent runs. But Checkpointing causes more computation since first cache is called before doing the actual job of computing and writing to the checkpoint directory.  

Explain the functionality of createOrReplaceTempView in Spark?

Unlike a traditional temp table, a temp view , created by createOrReplaceTempView, is not committed - not even to memory. Hence its statements have to be evaluated everytime it's accessed. createOrReplaceTempView creates (or overwrites if view already exists) a lazily evaluated datastructure aka view (like a table). It is not persisted to memory - For that you have to specifically cache the dataset that underpins the view.  

Other Interesting Reads -

   


spark interview questions, spark interview, pyspark interview questions, apache spark interview questions, spark, apache spark, spark interview questions, spark submit, Apache Spark Tricky Interview Question