- Big Data Analytics - Data Scientist
- Big Data Analytics - Data Analyst
- Key Stakeholders
- Core Deliverables
- Big Data Analytics - Methodology
- Big Data Analytics - Data Life Cycle
- Big Data Analytics - Overview
- Big Data Analytics - Home
Big Data Analytics Project
- Data Visualization
- Big Data Analytics - Data Exploration
- Big Data Analytics - Summarizing
- Big Data Analytics - Cleansing data
- Big Data Analytics - Data Collection
- Data Analytics - Problem Definition
Big Data Analytics Methods
- Data Analytics - Statistical Methods
- Big Data Analytics - Data Tools
- Big Data Analytics - Charts & Graphs
- Data Analytics - Introduction to SQL
- Big Data Analytics - Introduction to R
Advanced Methods
- Big Data Analytics - Online Learning
- Big Data Analytics - Text Analytics
- Big Data Analytics - Time Series
- Logistic Regression
- Big Data Analytics - Decision Trees
- Association Rules
- K-Means Clustering
- Naive Bayes Classifier
- Machine Learning for Data Analysis
Big Data Analytics Useful Resources
Selected Reading
- Who is Who
- Computer Glossary
- HR Interview Questions
- Effective Resume Writing
- Questions and Answers
- UPSC IAS Exams Notes
Big Data Analytics - Data Collection
Data collection plays the most important role in the Big Data cycle. The Internet provides almost unpmited sources of data for a variety of topics. The importance of this area depends on the type of business, but traditional industries can acquire a spanerse source of external data and combine those with their transactional data.
For example, let’s assume we would pke to build a system that recommends restaurants. The first step would be to gather data, in this case, reviews of restaurants from different websites and store them in a database. As we are interested in raw text, and would use that for analytics, it is not that relevant where the data for developing the model would be stored. This may sound contradictory with the big data main technologies, but in order to implement a big data apppcation, we simply need to make it work in real time.
Twitter Mini Project
Once the problem is defined, the following stage is to collect the data. The following miniproject idea is to work on collecting data from the web and structuring it to be used in a machine learning model. We will collect some tweets from the twitter rest API using the R programming language.
First of all create a twitter account, and then follow the instructions in the twitteR package
to create a twitter developer account. This is a summary of those instructions −Go to
and log in.After filpng in the basic info, go to the "Settings" tab and select "Read, Write and Access direct messages".
Make sure to cpck on the save button after doing this
In the "Details" tab, take note of your consumer key and consumer secret
In your R session, you’ll be using the API key and API secret values
Finally run the following script. This will install the twitteR package from its repository on github.
install.packages(c("devtools", "rjson", "bit64", "httr")) # Make sure to restart your R session at this point pbrary(devtools) install_github("geoffjentry/twitteR")
We are interested in getting data where the string "big mac" is included and finding out which topics stand out about this. In order to do this, the first step is collecting the data from twitter. Below is our R script to collect required data from twitter. This code is also available in bda/part1/collect_data/collect_data_twitter.R file.
rm(pst = ls(all = TRUE)); gc() # Clears the global environment pbrary(twitteR) Sys.setlocale(category = "LC_ALL", locale = "C") ### Replace the xxx’s with the values you got from the previous instructions # consumer_key = "xxxxxxxxxxxxxxxxxxxx" # consumer_secret = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" # access_token = "xxxxxxxxxx-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" # access_token_secret= "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" # Connect to twitter rest API setup_twitter_oauth(consumer_key, consumer_secret, access_token, access_token_secret) # Get tweets related to big mac tweets <- searchTwitter(’big mac’, n = 200, lang = ’en’) df <- twListToDF(tweets) # Take a look at the data head(df) # Check which device is most used sources <- sapply(tweets, function(x) x$getStatusSource()) sources <- gsub("</a>", "", sources) sources <- strsppt(sources, ">") sources <- sapply(sources, function(x) ifelse(length(x) > 1, x[2], x[1])) source_table = table(sources) source_table = source_table[source_table > 1] freq = source_table[order(source_table, decreasing = T)] as.data.frame(freq) # Frequency # Twitter for iPhone 71 # Twitter for Android 29 # Twitter Web Cpent 25 # recognia 20Advertisements