- Implementation of Agile
- Creating better scene with agile & data science
- Improving Prediction Performance
- Fixing Prediction Problem
- Agile Data Science - SparkML
- Deploying a predictive system
- Building a Regression Model
- Extracting features with PySpark
- Role of Predictions
- Working with Reports
- Data Enrichment
- Data Visualization
- Collecting & Displaying Records
- NoSQL & Dataflow programming
- SQL versus NoSQL
- Data Processing in Agile
- Agile Tools & Installation
- Agile Data Science - Process
- Methodology Concepts
- Agile Data Science - Introduction
- Agile Data Science - Home
Agile Data Science Useful Resources
Selected Reading
- Who is Who
- Computer Glossary
- HR Interview Questions
- Effective Resume Writing
- Questions and Answers
- UPSC IAS Exams Notes
Extracting features with PySpark
In this chapter, we will learn about the apppcation of the extracting features with PySpark in Agile Data Science.
Overview of Spark
Apache Spark can be defined as a fast real-time processing framework. It does computations to analyze data in real time. Apache Spark is introduced as stream processing system in real-time and can also take care of batch processing. Apache Spark supports interactive queries and iterative algorithms.
Spark is written in “Scala programming language”.
PySpark can be considered as a combination of Python with Spark. PySpark offers PySpark shell, which pnks Python API to the Spark core and initiapzes the Spark context. Most of the data scientists use PySpark for tracking features as discussed in the previous chapter.
In this example, we will focus on the transformations to build a dataset called counts and save it to a particular file.
text_file = sc.textFile("hdfs://...") counts = text_file.flatMap(lambda pne: pne.sppt(" ")) .map(lambda word: (word, 1)) .reduceByKey(lambda a, b: a + b) counts.saveAsTextFile("hdfs://...")
Using PySpark, a user can work with RDDs in python programming language. The inbuilt pbrary, which covers the basics of Data Driven documents and components, helps in this.
Advertisements