Data Analysis using PySpark

Selvan Kumar
3 min readFeb 3, 2021

Apache Spark

Spark is a open source analytics engine used for large-scale data preprocessing. Spark is mainly introduced to improve the speed faster than the Hadoop MapReduce. It is used in distributed parallel preprocessing.

Spark is speed, generality, and easy to use.

Speed : Memory Computation is faster than map reduce for complex application on disk.

Generality: Wide range of workloads of one system. Iterative algorithms, interactive queries and streaming.

Easy to use: API for scala, python, java libraries for SQL, Machine learning, Streaming and graph Processing. Running on Hadoop Cluster or as a standalone.

Data Processing in Spark

Spark is an open source analytics engine for large scale data processing that allows data to be processed in parallel across a cluster.

There are two types of processing in spark they are,

  1. Batch Processing
  2. Stream Processing

Firstly in Batch Processing, data are stored in database then query are working but in Stream Processing after processing that data are stored it gets real time data.

PySpark

Apache Spark is written in Scala programming language. To support Python with Spark, Apache Spark community released a tool named PySpark. Using PySpark, you can work with RDD( Resilient Distributed Dataset) in Python programming language also.

Spark Context

SparkContext is the internal engine that allows the connections with the clusters. If you want to run an operation, you need a SparkContext. It is simply said that it contain a driver Program and it is connected to all distributed datanode(worknode).

SparkContext creation code :

import pyspark
from pyspark import SparkContext
sc = SparkContext()

Now the SparkContext is ready, you can create a collection of data called RDD(Resilient Distributed Dataset). Computation in an RDD is automatically parallelized across the cluster.

RDD

RDD stands for Resilient Distribution Datasets

  • Spark Primary abstraction
  • Distributed Collection of elements
  • Parallized across the cluster

There are two types of RDD, they are

  1. Transformation
  2. Action

Transformation :

Transformation is a process that is done within a dataset which changes the dataset into new one. It is a lazy evaluation as the execution is not done until an action is done.

Some of the Transformations in pyspark are,

  • Group
  • Sort
  • Reduce
  • Filter

Action :

Action executes the transformation that is done so far and returns the output of the all transformed values.

Some of the Actions in pyspark are,

  • Collect
  • Show
  • Count

Shared variables in spark

There are two types of shared variables available in spark they are,

  1. Accumulator
  2. Broadcast Variables

Accumulator :

Accumulator is used for writing. Many different work nodes are written into a single node.

Broadcast Variables :

Broadcast is used for reading. Data is splitted into various nodes from a single node.

Spark Submit

Spark Submit is a utility to submit your spark program to the Spark clusters. Once a client submits a spark user application code, the driver implicitly converts the code containing transformations and actions into a logical Directed Acyclic Graph(DAG).

Finally, I thank Eugune Kingsley Sir for the wonderful hands-on workshop in PySpark.

--

--