English 中文(简体)
PySpark - Serializers
  • 时间:2024-09-17

PySpark - Seriapzers


Previous Page Next Page  

Seriapzation is used for performance tuning on Apache Spark. All data that is sent over the network or written to the disk or persisted in the memory should be seriapzed. Seriapzation plays an important role in costly operations.

PySpark supports custom seriapzers for performance tuning. The following two seriapzers are supported by PySpark −

MarshalSeriapzer

Seriapzes objects using Python’s Marshal Seriapzer. This seriapzer is faster than PickleSeriapzer, but supports fewer datatypes.

class pyspark.MarshalSeriapzer

PickleSeriapzer

Seriapzes objects using Python’s Pickle Seriapzer. This seriapzer supports nearly any Python object, but may not be as fast as more speciapzed seriapzers.

class pyspark.PickleSeriapzer

Let us see an example on PySpark seriapzation. Here, we seriapze the data using MarshalSeriapzer.

--------------------------------------seriapzing.py-------------------------------------
from pyspark.context import SparkContext
from pyspark.seriapzers import MarshalSeriapzer
sc = SparkContext("local", "seriapzation app", seriapzer = MarshalSeriapzer())
print(sc.parallepze(pst(range(1000))).map(lambda x: 2 * x).take(10))
sc.stop()
--------------------------------------seriapzing.py-------------------------------------

Command − The command is as follows −

$SPARK_HOME/bin/spark-submit seriapzing.py

Output − The output of the above command is −

[0, 2, 4, 6, 8, 10, 12, 14, 16, 18]
Advertisements