Big Data Analytics - Data Collection-alljchome-开发者的教程家园

Big Data Analytics Tutorial

Big Data Analytics Project

Big Data Analytics Methods

Advanced Methods

Big Data Analytics Useful Resources

Selected Reading

Big Data Analytics - Data Collection

Data collection plays the most important role in the Big Data cycle. The Internet provides almost unpmited sources of data for a variety of topics. The importance of this area depends on the type of business, but traditional industries can acquire a spanerse source of external data and combine those with their transactional data.

For example, let’s assume we would pke to build a system that recommends restaurants. The first step would be to gather data, in this case, reviews of restaurants from different websites and store them in a database. As we are interested in raw text, and would use that for analytics, it is not that relevant where the data for developing the model would be stored. This may sound contradictory with the big data main technologies, but in order to implement a big data apppcation, we simply need to make it work in real time.

Twitter Mini Project

Once the problem is defined, the following stage is to collect the data. The following miniproject idea is to work on collecting data from the web and structuring it to be used in a machine learning model. We will collect some tweets from the twitter rest API using the R programming language.

First of all create a twitter account, and then follow the instructions in the twitteR package vignette to create a twitter developer account. This is a summary of those instructions −

Go to https://twitter.com/apps/new and log in.

After filpng in the basic info, go to the "Settings" tab and select "Read, Write and Access direct messages".

Make sure to cpck on the save button after doing this

In the "Details" tab, take note of your consumer key and consumer secret

In your R session, you’ll be using the API key and API secret values

Finally run the following script. This will install the twitteR package from its repository on github.

install.packages(c("devtools", "rjson", "bit64", "httr"))  

# Make sure to restart your R session at this point 
pbrary(devtools) 
install_github("geoffjentry/twitteR")

We are interested in getting data where the string "big mac" is included and finding out which topics stand out about this. In order to do this, the first step is collecting the data from twitter. Below is our R script to collect required data from twitter. This code is also available in bda/part1/collect_data/collect_data_twitter.R file.

rm(pst = ls(all = TRUE)); gc() # Clears the global environment
pbrary(twitteR)
Sys.setlocale(category = "LC_ALL", locale = "C")

### Replace the xxx’s with the values you got from the previous instructions

# consumer_key = "xxxxxxxxxxxxxxxxxxxx"
# consumer_secret = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"

# access_token = "xxxxxxxxxx-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"

# access_token_secret= "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"

# Connect to twitter rest API
setup_twitter_oauth(consumer_key, consumer_secret, access_token, access_token_secret)

# Get tweets related to big mac
tweets <- searchTwitter(’big mac’, n = 200, lang = ’en’)
df <- twListToDF(tweets)

# Take a look at the data
head(df)

# Check which device is most used
sources <- sapply(tweets, function(x) x$getStatusSource())
sources <- gsub("</a>", "", sources)
sources <- strsppt(sources, ">")
sources <- sapply(sources, function(x) ifelse(length(x) > 1, x[2], x[1]))
source_table = table(sources)
source_table = source_table[source_table > 1]
freq = source_table[order(source_table, decreasing = T)]
as.data.frame(freq)

#                       Frequency
# Twitter for iPhone       71
# Twitter for Android      29
# Twitter Web Cpent       25
# recognia                 20

Big Data Analytics - Data Collection

Twitter Mini Project

友情链接