Basics of Spark & Spark Architecture

Harshavardhan Reddy Peddireddy
2 min readJun 22, 2023

--

spark is a fastest inmemory computation engine. compares to hadoop spark is 10X faster and spark has ‘no storage’ it is an execution engine.

Why spark?

.speed

.Distribution

.Advanced Analytics

.Real time

.Powerful Caching

.Fault Tolerance

  • spark architecture is master and slave architecture like hadoop and spark is distributed data parallel so it is called parallel computation.
components of spark

The Components of Spark

  1. Driver Program
  2. Cluster Manager
  3. Worker Node

Driver Program: In driver node we have spark context. we can submit our application to spark context. spark context interact with driver program and driver program will communicate cluster manager about given code or application. Driver program holds meta data of the worker nodes and data.

Cluster Manager: Cluster Manager will take instructions from driver program and allocate the resources if the resources are existed. As per detail when we can give instructions to cluster manager through driver program cluster manger check worker nodes are existed with given configeration. if the configeration satsifies cluster manager will allocate worker nodes to driver program.

worker nodes: worker node have executors. this executors are JVM machines. each executor have cache and task. cache is a power full memory where we can store important dataframe in it and we can perform work in task.

Working Principle:

spark architecture

The code application submit to spark context after submiting spark session is invoke . Driver program is send instructions to cluster manager to make ready resources(worker nodes). In application we defined how many worker nodes we need and what configeration. if the resourecs are existed then cluster manager will allocate. the cluster is ready at this point to do work.

Input file path is given in application and that data will distributed parallel to all worker nodes to perform given work. every 3 sec worker nodes send signals to driver node about it working status.

the data is parally distubuted to do work .this workers will perform task

example :

we have given 90 gb csv file . and we have 3 worker nodes means approx each worker node will get 30 gb of data. each worker node have 30 gb of data is divided further in to task as per cofig cores. data is splitted in to task in executer.

Basic code to initilize spark in onprimise system

from pyspark.sql import SparkSession

spark = SparkSession.builder\
.master('local')\
.appName('harsha')\
.config('spark.ui.port', '4056')\
.getOrCreate()

data = spark.read.csv(r"C:\Users\peddi\Downloads\staging_transport.csv",header=True,sep=';',inferSchema=True)
data.show()

--

--