DevOps | Cloud | Analytics | Open Source | Programming





Apache Spark Tricky Interview Questions Part 4



Apache Spark Tricky Interview Questions Part 4. This is in continuation of our Interview Question series for Apache Spark . If you have not , watch the early parts (links at the end of the post).  We will keep publishing more posts in further continuation of the interview series. Stay Tuned.

How to retrieve the current number of partitions of a DataFrame ?

Use the getNumPartitions() on the DataFrame - This will call the underlying RDD of the Dataframe.


df.rdd.getNumPartitions()

df.rdd.getNumPartitions --> For Scala parameterless method

Alternatives


df.rdd.partitions.size

df.rdd.length

When a Spark Dataframe is Better over Dataset?

Firstly be aware that Dataframe is basically a Dataset[Row]. Dataframe has advantages for scenarios like using data without any fixed schema e.g. JSON file with records of different types with different fields. In such a case , Dataframe is better since you can easily Select the data fields even without knowing the whole schema. Or you can use a dynamic configuration at runtime to specify the specific fields to access. Built-in function (e.g. df.ANYFUNCTION) will be more efficient and better optimized than using any UDAFs or custom Lambdas(Dataset). This is because Spark can't optimize the lambda functions as effectively. Some untyped functions (e.g. statistical functions) are available for Dataframes but not for typed Datasets.  

How to Handle Skewed Dataset join in Spark?

If you know the column (used in the join condition) and value (e.g. column A =1,2,3,4..etc)  for which the skew occurs. Let's say the skew happens for column value = 3. In that case , you can segregate the query into 2 slices. One slice will handle the Skewed data (i.e. ONLY Column A = "3") & the other slice will handle the Non-Skewed data (i.e. All Column A values EXCEPT "3") . select X.columnA from X join Y on X.columnA = Y.columnA where X.columnA = 3 AND Y.columnA=3 ; select X.columnA from X join Y on X.columnA = Y.columnA where X.columnA <> 3; As an alternative approach , you can add some randomization to the tables for the balance. By adding random element to the largeer Table and create new join key with it. And for the smaller Table , add the random element using Spark explode/flatMap functions. This will enhance the the number of entries and create new join key.  And then use the enhanced Join key (having a Random component) . Due to the Randomness , distribution will be more effective comparatively.  

How to Cache dataset if there is a looping operation (say after every 100th processing cycle) ?

Let's say , you need to cache the dataset after every 100th process cycle


while (<SOME CONDITION>) {
    count +=1
    ds = <SOME FILTER OPERATION>
    if (count % 100 == 0) {
         ds.cache()
     }
         <SOME PROCESSING>
    }

If you have large data , in such scenario , the process will slow down as more memory will be consummed for each cache. To avoid this and to handle such scenario, you should use Checkpoint or may be writing the Intermediate processed data to external storage to avoid load on memory.


while (<SOME CONDITION>) {
    count +=1
    ds = <SOME FILTER OPERATION>
   if (count % 100 == 0) {
         ds.cache()
         ds.checkpoint OR Write ds to parquet file
     }
         <SOME PROCESSING>
    }

Other Interesting Reads -

     


spark interview questions, spark interview, pyspark interview questions, apache spark interview questions, spark, apache spark, spark interview questions, spark submit, Apache Spark Tricky Interview Question