- Big Data Analytics - Data Scientist
- Big Data Analytics - Data Analyst
- Key Stakeholders
- Core Deliverables
- Big Data Analytics - Methodology
- Big Data Analytics - Data Life Cycle
- Big Data Analytics - Overview
- Big Data Analytics - Home
Big Data Analytics Project
- Data Visualization
- Big Data Analytics - Data Exploration
- Big Data Analytics - Summarizing
- Big Data Analytics - Cleansing data
- Big Data Analytics - Data Collection
- Data Analytics - Problem Definition
Big Data Analytics Methods
- Data Analytics - Statistical Methods
- Big Data Analytics - Data Tools
- Big Data Analytics - Charts & Graphs
- Data Analytics - Introduction to SQL
- Big Data Analytics - Introduction to R
Advanced Methods
- Big Data Analytics - Online Learning
- Big Data Analytics - Text Analytics
- Big Data Analytics - Time Series
- Logistic Regression
- Big Data Analytics - Decision Trees
- Association Rules
- K-Means Clustering
- Naive Bayes Classifier
- Machine Learning for Data Analysis
Big Data Analytics Useful Resources
Selected Reading
- Who is Who
- Computer Glossary
- HR Interview Questions
- Effective Resume Writing
- Questions and Answers
- UPSC IAS Exams Notes
Big Data Analytics - Cleansing Data
Once the data is collected, we normally have spanerse data sources with different characteristics. The most immediate step would be to make these data sources homogeneous and continue to develop our data product. However, it depends on the type of data. We should ask ourselves if it is practical to homogenize the data.
Maybe the data sources are completely different, and the information loss will be large if the sources would be homogenized. In this case, we can think of alternatives. Can one data source help me build a regression model and the other one a classification model? Is it possible to work with the heterogeneity on our advantage rather than just lose information? Taking these decisions are what make analytics interesting and challenging.
In the case of reviews, it is possible to have a language for each data source. Again, we have two choices −
Homogenization − It involves translating different languages to the language where we have more data. The quapty of translations services is acceptable, but if we would pke to translate massive amounts of data with an API, the cost would be significant. There are software tools available for this task, but that would be costly too.
Heterogenization − Would it be possible to develop a solution for each language? As it is simple to detect the language of a corpus, we could develop a recommender for each language. This would involve more work in terms of tuning each recommender according to the amount of languages available but is definitely a viable option if we have a few languages available.
Twitter Mini Project
In the present case we need to first clean the unstructured data and then convert it to a data matrix in order to apply topics modelpng on it. In general, when getting data from twitter, there are several characters we are not interested in using, at least in the first stage of the data cleansing process.
For example, after getting the tweets we get these strange characters: "<ed><U+00A0><U+00BD><ed><U+00B8><U+008B>". These are probably emoticons, so in order to clean the data, we will just remove them using the following script. This code is also available in bda/part1/collect_data/cleaning_data.R file.
rm(pst = ls(all = TRUE)); gc() # Clears the global environment source( collect_data_twitter.R ) # Some tweets head(df$text) [1] "I’m not a big fan of turkey but baked Mac & cheese <ed><U+00A0><U+00BD><ed><U+00B8><U+008B>" [2] "@Jayoh30 Like no special sauce on a big mac. HOW" ### We are interested in the text - Let’s clean it! # We first convert the encoding of the text from latin1 to ASCII df$text <- sapply(df$text,function(row) iconv(row, "latin1", "ASCII", sub = "")) # Create a function to clean tweets clean.text <- function(tx) { tx <- gsub("htt.{1,20}", " ", tx, ignore.case = TRUE) tx = gsub("[^#[:^punct:]]|@|RT", " ", tx, perl = TRUE, ignore.case = TRUE) tx = gsub("[[:digit:]]", " ", tx, ignore.case = TRUE) tx = gsub(" {1,}", " ", tx, ignore.case = TRUE) tx = gsub("^\s+|\s+$", " ", tx, ignore.case = TRUE) return(tx) } clean_tweets <- lapply(df$text, clean.text) # Cleaned tweets head(clean_tweets) [1] " WeNeedFeminlsm MAC s new make up pne features men woc and big girls " [1] " TravelsPhoto What Happens To Your Body One Hour After A Big Mac "
The final step of the data cleansing mini project is to have cleaned text we can convert to a matrix and apply an algorithm to. From the text stored in the clean_tweets vector we can easily convert it to a bag of words matrix and apply an unsupervised learning algorithm.
Advertisements