DevOps | Cloud | Analytics | Open Source | Programming





How Spark Handles Dataset Bigger than Available Memory ?



How Spark handles if a Dataset is Bigger than Available Memory ? Spark can handle dataset even if the dataset size is larger than RAM available. To put it simply , if a dataset doesn’t fit into memory , then Spark spills it to disk.  Although Spark processing is preferably in-memory , but Spark’s capability is not restricted to just memory-only though. Likewise, if dataset is cached and doesn’t fit in memory, Spark can either

  • Recompute on the go or
  • Spill to disk.
So basically spark can load as much as it can into RAM and rest spilled to HDD. However please note that you don’t always need to keep the whole dataset in memory\RAM. The reason for it is –

  • Spark is Lazy in operation and
  • Doesn’t initiate any work unless you trigger an Action.
Plus in most of the cases , you would have Filter or Aggregation types of operation which cuts down the dataset size to even lesser. How exactly it happens is –

  • As soon as an Action is Triggered , Spark will create several partitions of the dataset
  • Next it starts processing each partition in memory based on the transformation and action
  • In case , partition size is more that the current available RAM then spark will try to put as much data or chunk of file in memory and rest put on disk and then process accordingly.
  Also Read AutoML – Databricks’ Automated End-To-End Machine Learning  


What is dataset in spark ,  How to create a dataset in spark,  What is dataset row in spark,  Spark dataframe memory usage , Spark dataset bigger than memory , Spark memory management ,spark dataset bigger, spark big dataset, spark, apache spark, hadoop spark, spark dataset