How Spark Handles Dataset Bigger than Available Memory ?

How Spark handles if a Dataset is Bigger than Available Memory ? Spark can handle dataset even if the dataset size is larger than RAM available. To put it simply , if a dataset doesn’t fit into memory , then Spark spills it to disk. Although Spark processing is preferably in-memory , but Spark’s capability is not restricted to just memory-only though. Likewise, if dataset is cached and doesn’t fit in memory, Spark can either

Recompute on the go or
Spill to disk.

So basically spark can load as much as it can into RAM and rest spilled to HDD. However please note that you don’t always need to keep the whole dataset in memory\RAM. The reason for it is –

Spark is Lazy in operation and
Doesn’t initiate any work unless you trigger an Action.

Plus in most of the cases , you would have Filter or Aggregation types of operation which cuts down the dataset size to even lesser. How exactly it happens is –

As soon as an Action is Triggered , Spark will create several partitions of the dataset
Next it starts processing each partition in memory based on the transformation and action
In case , partition size is more that the current available RAM then spark will try to put as much data or chunk of file in memory and rest put on disk and then process accordingly.

Also Read AutoML – Databricks’ Automated End-To-End Machine Learning


What is dataset in spark ,  How to create a dataset in spark,  What is dataset row in spark,  Spark dataframe memory usage , Spark dataset bigger than memory , Spark memory management ,spark dataset bigger, spark big dataset, spark, apache spark, hadoop spark, spark dataset

DevOps | Cloud | Cyber Security | Web-Dev | Analytics | Open Source

How Spark Handles Dataset Bigger than Available Memory ?

Apply Pod Security Standards To Kubernetes Cluster

Indentation Problem Fix in Python

Most Important Metrics To Monitor In Kafka

Data Skewness in Spark (Salting Method)

Unicode Encode Error in Python (Ascii Codec Encode)

DevOps | Cloud | Cyber Security | Web-Dev | Analytics | Open Source

How Spark Handles Dataset Bigger than Available Memory ?

Popular Articles

Apply Pod Security Standards To Kubernetes Cluster

Indentation Problem Fix in Python

Most Important Metrics To Monitor In Kafka

Data Skewness in Spark (Salting Method)

Unicode Encode Error in Python (Ascii Codec Encode)