DevOps | Cloud | Analytics | Open Source | Programming





How To Fix Spark Error - "org.apache.spark.shuffle.FetchFailedException: Too large frame"



In this post , we will see How to Fix Spark Error - "org.apache.spark.shuffle.FetchFailedException: Too large frame". You might encounter this error while running any Spark operation as seen in the terminal like below -


Caused by: org.apache.spark.shuffle.FetchFailedException: Too large frame: xxxxxxxxxxx

Caused by: java.lang.IllegalArgumentException: Too large frame: xxxxxxxxxx

You might also observe a slight different variations of the exception in the below form


FetchFailedException: Adjusted frame length exceeds xxxxxxxxxxx

You might also observe this issue from Snappy (apart from the fetch failure) . This issue generally occurs in some of the below situations (there could be more such situations though)-

  • When you perform any join operation between tables in Spark especially if one of the table , used in the join, is very very large. During such join , data shuffle happens . And if the shuffle block is huge and crosses the default threshold value of 2GB, it causes the above exception.
  • When the input data is skewed.
  •  If one executor stops working in the middle of the job , but the executor had some shuffle output. So another executor will try to fetch metadata of this shuffle output, but exception occurs as the it can not reach the stopped executor.
  To Fix this issue , check the below set of points -

  • Firstly check your Spark version. This issue normally appears in Older Spark versions ( <2.4.x). If possible , you could incorporate the Latest Spark Stable Release and check if the same issue persists or not.
 

  • One obvious option is to try to modify\increase the no. of partitions using spark.sql.shuffle.partitions=[num_tasks]. Otherwise You can also use partition count from default 200 to 2001.  Check if this exercise decreases Partition Size to less than 2GB. (I don't think it is a good idea to increase the Partition size above the default 2GB).
 

  • Decrease spark.buffer.pageSize to 2m
 

  • Try setting spark.maxRemoteBlockSizeFetchToMem < 2GB
 

  • Set spark.default.parallelism = spark.sql.shuffle.partitions (same value)
 

  • If you are running the Spark with Yarn Cluster mode, check the log files on the failing nodes. Search the log for the text "Killing container".  If you notice a text "running beyond physical memory limits", try to increase the spark.yarn.executor.memoryOverhead value. You can use below option in the Spark Submit command for that -

./spark-submit         

--conf,'spark.yarn.executor.memoryOverhead=xxxxxx',

  • Sometimes large block shuffle process might take Longer time than the default(120 Secs). The default 120 seconds might render the executors to time out. So try to increase the spark.network.timeout value.
 

  • Increase the spark.core.connection.ack.wait.timeout value
 

  • If skewed data is causing this exception , you could try to overcome data skewness using techniques like Salting Method. Read our post on how to use Salting here.
  Try the above steps and see how it goes.

Other Interesting Reads -

 


spark.maxremoteblocksizefetchtomem < 2g, org apache spark shuffle fetchfailedexception failed to allocate 16777216 byte(s) of direct memory, org apache$spark shuffle fetchfailedexception failed to connect to, spark java lang illegalargumentexception too large frame spark job failure, spark.maxremoteblocksizefetchtomem default value, spark error java lang illegalargumentexception too large frame, org apache$spark shuffle fetchfailedexception failure while fetching streamchunkid, too large frame error in spark, java.lang.illegalargumentexception too large frame spark, spark error java lang illegalargumentexception too large frame, spark errors spark job failure, org apache spark shuffle fetchfailedexception failed to allocate 16777216 byte(s) of direct memory, how to resolve out of memory error in spark, org apache$spark shuffle fetchfailedexception failed to connect to, spark.maxremoteblocksizefetchtomem default value, org apache spark shuffle fetchfailedexception, org apache spark shuffle fetchfailedexception: too large frame, org.apache.spark.shuffle.fetchfailedexception failed to allocate byte(s) of direct memory, org.apache.spark.shuffle.fetchfailedexception: connection reset by peer, org.apache.spark.shuffle.fetchfailedexception: failure while fetching streamchunkid, org.apache.spark.shuffle.metadatafetchfailedexception: missing an output location for shuffle, org$apache$spark shuffle metadatafetchfailedexception missing an output location for shuffle 42, org apache spark shuffle metadatafetchfailedexception missing an output location for shuffle 38, org apache$spark network shuffle retryingblockfetcher, org.apache.spark.shuffle.fetchfailedexception: failure while fetching streamchunkid, org apache spark shuffle fetchfailedexception: too large frame, org.apache.spark.shuffle.fetchfailedexception failed to allocate byte(s) of direct memory, org.apache.spark.shuffle.fetchfailedexception: connection reset by peer, org.apache.spark.shuffle.metadatafetchfailedexception: missing an output location for shuffle, spark memoryoverhead, spark maxremoteblocksizefetchtomem 2147483135, spark errors, spark , Apache Spark