PySpark Tutorial
PySpark Useful Resources
Selected Reading
- PySpark - Serializers
- PySpark - MLlib
- PySpark - StorageLevel
- PySpark - SparkFiles
- PySpark - SparkConf
- PySpark - Broadcast & Accumulator
- PySpark - RDD
- PySpark - SparkContext
- PySpark - Environment Setup
- PySpark - Introduction
- PySpark - Home
PySpark Useful Resources
Selected Reading
- Who is Who
- Computer Glossary
- HR Interview Questions
- Effective Resume Writing
- Questions and Answers
- UPSC IAS Exams Notes
PySpark - SparkFiles
PySpark - SparkFiles
In Apache Spark, you can upload your files using sc.addFile (sc is your default SparkContext) and get the path on a worker using SparkFiles.get. Thus, SparkFiles resolve the paths to files added through SparkContext.addFile().
SparkFiles contain the following classmethods −
get(filename)
getrootdirectory()
Let us understand them in detail.
get(filename)
It specifies the path of the file that is added through SparkContext.addFile().
getrootdirectory()
It specifies the path to the root directory, which contains the file that is added through the SparkContext.addFile().
----------------------------------------sparkfile.py------------------------------------ from pyspark import SparkContext from pyspark import SparkFiles finddistance = "/home/hadoop/examples_pyspark/finddistance.R" finddistancename = "finddistance.R" sc = SparkContext("local", "SparkFile App") sc.addFile(finddistance) print "Absolute Path -> %s" % SparkFiles.get(finddistancename) ----------------------------------------sparkfile.py------------------------------------
Command − The command is as follows −
$SPARK_HOME/bin/spark-submit sparkfiles.py
Output − The output for the above command is −
Absolute Path -> /tmp/spark-f1170149-af01-4620-9805-f61c85fecee4/userFiles-641dfd0f-240b-4264-a650-4e06e7a57839/finddistance.RAdvertisements