Apache Spark Tricky Interview Questions Part 1
In this series , we will discuss some of the Tricky Questions generally asked in the Spark Interviews. We will publish further Parts of this series in Future. Stay Tuned.
Difference between Repartition() vs Coalesce()
- With repartition() the number of partitions can be increased/decreased,
- but with coalesce() the number of partitions can only be decreased.
- The repartition algorithm does a full shuffle and creates new partitions with data that's distributed evenly.
- Coalesce avoids a full shuffle. If it's known that the number is decreasing then the executor can safely keep data on the minimum number of partitions, only moving the data off the extra nodes, onto the nodes that we kept. Basically it minimize data movement.
What does the details & numbers on the progress bar signify in Spark-shell?
[Stage7:===========> (12123 + 5) / 50000]
- This is a Console Progress Bar.
- [Stage 7: means the Progress stage at which the job is in currently and (12123 + 5) / 50000] is (numCompletedTasks + numActiveTasks) / totalNumOfTasksInThisStage].
- The progress bar shows numCompletedTasks / totalNumOfTasksInThisStage.
How to check if spark dataframe is empty?
In Scala:
df.rdd.isEmpty
in Python:
df.rdd.isEmpty()
df.rdd.isEmpty ----> Fastest Execution
df.head(1).isEmpty() -----> Slower Execution
df!=null df.count>0 -----> Slowest Execution
How to turn off INFO logging in Spark?
Edit your conf/log4j.properties file and Change the following line:
log4j.rootCategory=INFO, console
**to**
log4j.rootCategory=ERROR, console
OR
import org.apache.log4j.Logger
import org.apache.log4j.Level
Logger.getLogger("org").setLevel(Level.OFF)
Logger.getLogger("akka").setLevel(Level.OFF)
OR
from pyspark.sql import SparkSession
spark = SparkSession.builder.\\
master('local').\\
appName('foo').\\
getOrCreate()
spark.sparkContext.setLogLevel('WARN')
Difference between Workers & Executors in Spark cluster?
- Executors are worker nodes' processes in charge of running individual tasks in a given Spark job. Theses are initiated at the beginning of any Spark application and usually run for the entire lifetime of the Spark job . Once they complete their assigned task , they send the results back to the driver. They also provide in-memory storage for RDDs that are cached by user programs through Block Manager.
- The Spark driver is the process where the main method runs. First it converts the user program into tasks and after that it schedules the tasks on the executors.
- Workers hold many executors, for many applications. One application has executors on many workers. In general, we call worker instance as a slave as it's a process to execute spark tasks/jobs. Suggested mapping for a node(a physical or virtual machine) and a worker is,
Generally speaking 1 Node = 1 Worker process But A worker node can also be holding multiple executors (processes) if it has sufficient CPU, Memory and Storage. If you use Mesos or YARN as your cluster manager, you can run multiple executors on the same machine with just one worker, which reduces the need to run multiple workers per machine. However, in case of standalone cluster manager, currently it still only allows one executor per worker process on each physical machine. So if you have a Very Large machine and would like to run multiple exectuors on it, you have to start more than 1 worker process. For this you can use , the config Spark_Worker_Instances in the spark-env.sh is for. The default value is 1. However If you do use this setting, make sure you set Spark_Worker_Cores explicitly to limit the cores per worker, or else each worker will try to use all the cores.
Difference between DataFrame, Dataset, and RDD in Spark
- A data frame is a table-like , or two-dimensional array-like structure, in which each column contains data on one variable, and each row. So, a DataFrame has additional metadata due to its tabular format, which allows Spark to run certain optimizations on the finalized query. DataFrames as a collection of Datasets[Row] render a structured custom view into your semi-structured data. For instance, let’s say, you have a huge IoT device event dataset, expressed as JSON. Since JSON is a semi-structured format, it lends itself well to employing Dataset as a collection of strongly typed-specific Dataset[DeviceIoTData].
- RDD is spark native data structure. It is merely a Resilient Distributed Dataset that is more of a blackbox of data that cannot be optimized as the operations that can be performed against it, are not as constrained.
- Dataset is a distributed collection of data. Dataset is a new interface added in Spark 1.6 that provides the benefits of RDDs (strong typing, ability to use powerful lambda functions) with the benefits of Spark SQL’s optimized execution engine. DataFrames as a collection of Datasets[Row] render a structured custom view into your semi-structured data. Dataset of Rows (Dataset[Row]) in Scala/Java is referred as DataFrames.
Is Spark installation required on all Slave nodes of the YARN cluster if running on YARN mode?
No, it is not mandatory to install Spark on all slave nodes if you submit a job through YARN mode . This is because Spark runs on top of YARN and YARN acts as the resource manager. Basically spark uses YARN engine to get all the required resources. we have to install Spark only on one node.
What is the Difference between HDFS and NAS Storage(Network Attached Storage) ?
- Network-attached storage (NAS) is a file-level storage architecture that makes stored data more accessible to networked devices. providing data access to a heterogeneous group of clients. . NAS is 1 of the 3 main storage architectures—along with storage area networks (SAN) and direct-attached storage (DAS). NAS gives networks a single access point for storage with built-in security, management, and fault tolerant capabilities. NAS can either be a hardware or software which provides services for storing and accessing files.
- On the other hand , The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware.
- In HDFS Data Blocks are distributed across all the machines in a cluster.
- In NAS data is stored on a dedicated hardware. whereas a NAS is a high-end storage devices which includes high cost.
- HDFS is designed to work with Distributed Computation like MapReduce , where computation is moved to the data.
- NAS is not suitable for such Distributed computation.
- HDFS uses commodity hardware which is cost-effective
- NAS is not necessarily cost effective.
Other Interesting Reads -
spark interview questions, spark interview, pyspark interview questions, apache spark interview questions, spark, apache spark, spark interview questions, spark submit, Apache Spark Tricky Interview Questions