In this post , we will see How to Fix - Data Skewness in Spark using Salting Method. Data skew problem is basically related to an Uneven or Non-Uniform Distribution of data . In Real-Life Production scenarios, we often have to handle data which is far from ideal data. Hence it is imperative that we are equipped to handle such data scenarios. We will try to understand Data Skew from Two Table Join perspective.
e.g. Existing-Key + "_" + Range(1,10)
Explode(Existing-Key , Range(1,10)) -> x_1, x_2, .............,x_10
Let's Understand - What Exactly happens UNDER THE HOOD when you use the Salting method (Code is below)
Before Salting - ("=" represents records - Notice key x has more no. of records)
x ======================================
y ======
z ======
After Salting - ("=" represents records - Notice the Uniformity now across all keys)
x_1 =========
x_2 =========
x_3 =========
x_4 =========
x_5 =========
x_6 =========
x_7 =========
x_8 =========
x_9 =========
x_10 =========
y ==========
z ==========
# SALTING TECHNIQUE - SKEWNESS REMOVAL CODE
import org.apache.spark.SparkConf
import org.apache.spark.sql.{DataFrame, SparkSession}
import org.apache.spark.sql.functions.{array, concat, explode, floor, lit, rand}
def removeDataSkew(bigLeftTable: DataFrame, Col_X: String, rightTable: DataFrame) = {
var df1 = bigLeftTable
.withColumn(Col_X, concat(
bigLeftTable.col(Col_X), lit("_"), lit(floor(rand(10000) * 10))))
var df2 = rightTable
.withColumn("newExplodedCol",
explode(array((0 to 10).map(lit(_)): _ *)))
(df1, df2) ===> THESE ARE THE NEW DFs FROM WHERE SKEWNESS IS REMOVED
}
# CALLING THE SALTING LOGIC TO REMOVE ORIGINAL DATA SKEW
# After removing Skewness, we get the Un-skewed New Dfs
val (newDf1, newDf2) = removeDataSkew(oldDf1, "_1", oldDf2) ==> "_1" represents the left column(the key)
# NEW JOIN AFTER REMOVING DATA SKEWNESS(THROUGH SALTING TECHNIQUE)
newDf1.join(newDf2,
newDf1.col("<col_name>")<=> newDf2.col("<col_name>")
)
.show(10,false)
Hope this post helps to understand How to Fix - Data Skewness in Spark using Salting Method. Please share the post, if it helped you.
Other Interesting Reads -
how to solve data skew in spark , spark data skew repartition , what is garbage collection in spark , why your spark applications are slow or failing, part 3, dynamic repartitioning in spark ,salting for data skewness , spark join, salted join, What is salting in spark , How does spark prevent data skew , Why Your Spark applications are slow or failing, What is data skew in spark,spark salting, salting in spark, spark skew,data skew in spark,spark skew join optimization,spark skewed join,spark data skew