- PySpark - Serializers
- PySpark - MLlib
- PySpark - StorageLevel
- PySpark - SparkFiles
- PySpark - SparkConf
- PySpark - Broadcast & Accumulator
- PySpark - RDD
- PySpark - SparkContext
- PySpark - Environment Setup
- PySpark - Introduction
- PySpark - Home
PySpark Useful Resources
Selected Reading
- Who is Who
- Computer Glossary
- HR Interview Questions
- Effective Resume Writing
- Questions and Answers
- UPSC IAS Exams Notes
PySpark - SparkContext
SparkContext is the entry point to any spark functionapty. When we run any Spark apppcation, a driver program starts, which has the main function and your SparkContext gets initiated here. The driver program then runs the operations inside the executors on worker nodes.
SparkContext uses Py4J to launch a JVM and creates a JavaSparkContext. By default, PySpark has SparkContext available as ‘sc’, so creating a new SparkContext won t work.
The following code block has the details of a PySpark class and the parameters, which a SparkContext can take.
class pyspark.SparkContext ( master = None, appName = None, sparkHome = None, pyFiles = None, environment = None, batchSize = 0, seriapzer = PickleSeriapzer(), conf = None, gateway = None, jsc = None, profiler_cls = <class pyspark.profiler.BasicProfiler > )
Parameters
Following are the parameters of a SparkContext.
Master − It is the URL of the cluster it connects to.
appName − Name of your job.
sparkHome − Spark installation directory.
pyFiles − The .zip or .py files to send to the cluster and add to the PYTHONPATH.
Environment − Worker nodes environment variables.
batchSize − The number of Python objects represented as a single Java object. Set 1 to disable batching, 0 to automatically choose the batch size based on object sizes, or -1 to use an unpmited batch size.
Seriapzer − RDD seriapzer.
Conf − An object of L{SparkConf} to set all the Spark properties.
Gateway − Use an existing gateway and JVM, otherwise initiapzing a new JVM.
JSC − The JavaSparkContext instance.
profiler_cls − A class of custom Profiler used to do profipng (the default is pyspark.profiler.BasicProfiler).
Among the above parameters, master and appname are mostly used. The first two pnes of any PySpark program looks as shown below −
from pyspark import SparkContext sc = SparkContext("local", "First App")
SparkContext Example – PySpark Shell
Now that you know enough about SparkContext, let us run a simple example on PySpark shell. In this example, we will be counting the number of pnes with character a or b in the README.md file. So, let us say if there are 5 pnes in a file and 3 pnes have the character a , then the output will be → Line with a: 3. Same will be done for character ‘b’.
Note − We are not creating any SparkContext object in the following example because by default, Spark automatically creates the SparkContext object named sc, when PySpark shell starts. In case you try to create another SparkContext object, you will get the following error – "ValueError: Cannot run multiple SparkContexts at once".
<<< logFile = "file:///home/hadoop/spark-2.1.0-bin-hadoop2.7/README.md" <<< logData = sc.textFile(logFile).cache() <<< numAs = logData.filter(lambda s: a in s).count() <<< numBs = logData.filter(lambda s: b in s).count() <<< print "Lines with a: %i, pnes with b: %i" % (numAs, numBs) Lines with a: 62, pnes with b: 30
SparkContext Example - Python Program
Let us run the same example using a Python program. Create a Python file called firstapp.py and enter the following code in that file.
----------------------------------------firstapp.py--------------------------------------- from pyspark import SparkContext logFile = "file:///home/hadoop/spark-2.1.0-bin-hadoop2.7/README.md" sc = SparkContext("local", "first app") logData = sc.textFile(logFile).cache() numAs = logData.filter(lambda s: a in s).count() numBs = logData.filter(lambda s: b in s).count() print "Lines with a: %i, pnes with b: %i" % (numAs, numBs) ----------------------------------------firstapp.py---------------------------------------
Then we will execute the following command in the terminal to run this Python file. We will get the same output as above.
$SPARK_HOME/bin/spark-submit firstapp.py Output: Lines with a: 62, pnes with b: 30Advertisements