Apache Spark Tricky Interview Questions Part 3. This is in continuation of our Interview Question series for Apache Spark . If you have not , watch the early parts (links at the end of the post). We will keep publishing more posts in further continuation of the interview series. Stay Tuned.
df.approxQuantile("x", [0.5], 0.25)
Scala:
df.stat.approxQuantile("x", Array(0.5), 0.25)
SELECT approx\_percentile(10.0, array(0.98, 0.324, 0.1021), 100);
sparkContext.Csv(<VARIOUS_PARMS>,
.groupByKey()
)
ReduceByKey - In ReduceByKey, at each partition, data is combined based on the keys. Each key will have only one output. So data for each key at each partition is combined and only then is sent over the network. So basically it is like First Grouping and then Aggregating kind of operation. Hence it will shuffle less data compared to groupByKey() and preferred over. ReduceBykey takes 1 parameter only which is a function for merging.
sparkContext.Csv(<VARIOUS_PARMS>,
.reduceByKey((x,y)=> (x+y))
)
AggregateByKey - Logicwise same as reduceByKey() . But it lets you return result in different type - Means you can have a input as type x and aggregate result as type y. Also you can provide initial values when performing aggregation. Format is - aggregateByKey(zeroValue)(seqOp, combOp, [numTasks]) AggregateByKey takes 3 parameters as input and uses 2 functions for merging(one for merging on same partitions and another to merge values across partition. CombineByKey is similar to aggregateBykey except it can have a function for ZeroValue. CombineByKey takes 3 parameter and all 3 are functions.
spark interview questions, spark interview, pyspark interview questions, apache spark interview questions, spark, apache spark, spark interview questions, spark submit, Apache Spark Tricky Interview Question