- PySpark - Serializers
- PySpark - MLlib
- PySpark - StorageLevel
- PySpark - SparkFiles
- PySpark - SparkConf
- PySpark - Broadcast & Accumulator
- PySpark - RDD
- PySpark - SparkContext
- PySpark - Environment Setup
- PySpark - Introduction
- PySpark - Home
PySpark Useful Resources
Selected Reading
- Who is Who
- Computer Glossary
- HR Interview Questions
- Effective Resume Writing
- Questions and Answers
- UPSC IAS Exams Notes
PySpark - Seriapzers
Seriapzation is used for performance tuning on Apache Spark. All data that is sent over the network or written to the disk or persisted in the memory should be seriapzed. Seriapzation plays an important role in costly operations.
PySpark supports custom seriapzers for performance tuning. The following two seriapzers are supported by PySpark −
MarshalSeriapzer
Seriapzes objects using Python’s Marshal Seriapzer. This seriapzer is faster than PickleSeriapzer, but supports fewer datatypes.
class pyspark.MarshalSeriapzer
PickleSeriapzer
Seriapzes objects using Python’s Pickle Seriapzer. This seriapzer supports nearly any Python object, but may not be as fast as more speciapzed seriapzers.
class pyspark.PickleSeriapzer
Let us see an example on PySpark seriapzation. Here, we seriapze the data using MarshalSeriapzer.
--------------------------------------seriapzing.py------------------------------------- from pyspark.context import SparkContext from pyspark.seriapzers import MarshalSeriapzer sc = SparkContext("local", "seriapzation app", seriapzer = MarshalSeriapzer()) print(sc.parallepze(pst(range(1000))).map(lambda x: 2 * x).take(10)) sc.stop() --------------------------------------seriapzing.py-------------------------------------
Command − The command is as follows −
$SPARK_HOME/bin/spark-submit seriapzing.py
Output − The output of the above command is −
[0, 2, 4, 6, 8, 10, 12, 14, 16, 18]Advertisements