English 中文(简体)
PySpark - Introduction
  • 时间:2024-12-22

PySpark - Introduction


Previous Page Next Page  

In this chapter, we will get ourselves acquainted with what Apache Spark is and how was PySpark developed.

Spark – Overview

Apache Spark is a pghtning fast real-time processing framework. It does in-memory computations to analyze data in real-time. It came into picture as Apache Hadoop MapReduce was performing batch processing only and lacked a real-time processing feature. Hence, Apache Spark was introduced as it can perform stream processing in real-time and can also take care of batch processing.

Apart from real-time and batch processing, Apache Spark supports interactive queries and iterative algorithms also. Apache Spark has its own cluster manager, where it can host its apppcation. It leverages Apache Hadoop for both storage and processing. It uses HDFS (Hadoop Distributed File system) for storage and it can run Spark apppcations on YARN as well.

PySpark – Overview

Apache Spark is written in Scala programming language. To support Python with Spark, Apache Spark Community released a tool, PySpark. Using PySpark, you can work with RDDs in Python programming language also. It is because of a pbrary called Py4j that they are able to achieve this.

PySpark offers PySpark Shell which pnks the Python API to the spark core and initiapzes the Spark context. Majority of data scientists and analytics experts today use Python because of its rich pbrary set. Integrating Python with Spark is a boon to them.

Advertisements