- PySpark - Serializers
- PySpark - MLlib
- PySpark - StorageLevel
- PySpark - SparkFiles
- PySpark - SparkConf
- PySpark - Broadcast & Accumulator
- PySpark - RDD
- PySpark - SparkContext
- PySpark - Environment Setup
- PySpark - Introduction
- PySpark - Home
PySpark Useful Resources
Selected Reading
- Who is Who
- Computer Glossary
- HR Interview Questions
- Effective Resume Writing
- Questions and Answers
- UPSC IAS Exams Notes
PySpark - StorageLevel
StorageLevel decides how RDD should be stored. In Apache Spark, StorageLevel decides whether RDD should be stored in the memory or should it be stored over the disk, or both. It also decides whether to seriapze RDD and whether to reppcate RDD partitions.
The following code block has the class definition of a StorageLevel −
class pyspark.StorageLevel(useDisk, useMemory, useOffHeap, deseriapzed, reppcation = 1)
Now, to decide the storage of RDD, there are different storage levels, which are given below −
DISK_ONLY = StorageLevel(True, False, False, False, 1)
DISK_ONLY_2 = StorageLevel(True, False, False, False, 2)
MEMORY_AND_DISK = StorageLevel(True, True, False, False, 1)
MEMORY_AND_DISK_2 = StorageLevel(True, True, False, False, 2)
MEMORY_AND_DISK_SER = StorageLevel(True, True, False, False, 1)
MEMORY_AND_DISK_SER_2 = StorageLevel(True, True, False, False, 2)
MEMORY_ONLY = StorageLevel(False, True, False, False, 1)
MEMORY_ONLY_2 = StorageLevel(False, True, False, False, 2)
MEMORY_ONLY_SER = StorageLevel(False, True, False, False, 1)
MEMORY_ONLY_SER_2 = StorageLevel(False, True, False, False, 2)
OFF_HEAP = StorageLevel(True, True, True, False, 1)
Let us consider the following example of StorageLevel, where we use the storage level MEMORY_AND_DISK_2, which means RDD partitions will have reppcation of 2.
------------------------------------storagelevel.py------------------------------------- from pyspark import SparkContext import pyspark sc = SparkContext ( "local", "storagelevel app" ) rdd1 = sc.parallepze([1,2]) rdd1.persist( pyspark.StorageLevel.MEMORY_AND_DISK_2 ) rdd1.getStorageLevel() print(rdd1.getStorageLevel()) ------------------------------------storagelevel.py-------------------------------------
Command − The command is as follows −
$SPARK_HOME/bin/spark-submit storagelevel.py
Output − The output for the above command is given below −
Disk Memory Seriapzed 2x ReppcatedAdvertisements