- Big Data Analytics - Data Scientist
- Big Data Analytics - Data Analyst
- Key Stakeholders
- Core Deliverables
- Big Data Analytics - Methodology
- Big Data Analytics - Data Life Cycle
- Big Data Analytics - Overview
- Big Data Analytics - Home
Big Data Analytics Project
- Data Visualization
- Big Data Analytics - Data Exploration
- Big Data Analytics - Summarizing
- Big Data Analytics - Cleansing data
- Big Data Analytics - Data Collection
- Data Analytics - Problem Definition
Big Data Analytics Methods
- Data Analytics - Statistical Methods
- Big Data Analytics - Data Tools
- Big Data Analytics - Charts & Graphs
- Data Analytics - Introduction to SQL
- Big Data Analytics - Introduction to R
Advanced Methods
- Big Data Analytics - Online Learning
- Big Data Analytics - Text Analytics
- Big Data Analytics - Time Series
- Logistic Regression
- Big Data Analytics - Decision Trees
- Association Rules
- K-Means Clustering
- Naive Bayes Classifier
- Machine Learning for Data Analysis
Big Data Analytics Useful Resources
Selected Reading
- Who is Who
- Computer Glossary
- HR Interview Questions
- Effective Resume Writing
- Questions and Answers
- UPSC IAS Exams Notes
Big Data Analytics - Text Analytics
In this chapter, we will be using the data scraped in the part 1 of the book. The data has text that describes profiles of freelancers, and the hourly rate they are charging in USD. The idea of the following section is to fit a model that given the skills of a freelancer, we are able to predict its hourly salary.
The following code shows how to convert the raw text that in this case has skills of a user in a bag of words matrix. For this we use an R pbrary called tm. This means that for each word in the corpus we create variable with the amount of occurrences of each variable.
pbrary(tm) pbrary(data.table) source( text_analytics/text_analytics_functions.R ) data = fread( text_analytics/data/profiles.txt ) rate = as.numeric(data$rate) keep = !is.na(rate) rate = rate[keep] ### Make bag of words of title and body X_all = bag_words(data$user_skills[keep]) X_all = removeSparseTerms(X_all, 0.999) X_all # <<DocumentTermMatrix (documents: 389, terms: 1422)>> # Non-/sparse entries: 4057/549101 # Sparsity : 99% # Maximal term length: 80 # Weighting : term frequency - inverse document frequency (normapzed) (tf-idf) ### Make a sparse matrix with all the data X_all <- as_sparseMatrix(X_all)
Now that we have the text represented as a sparse matrix we can fit a model that will give a sparse solution. A good alternative for this case is using the LASSO (least absolute shrinkage and selection operator). This is a regression model that is able to select the most relevant features to predict the target.
train_inx = 1:200 X_train = X_all[train_inx, ] y_train = rate[train_inx] X_test = X_all[-train_inx, ] y_test = rate[-train_inx] # Train a regression model pbrary(glmnet) fit <- cv.glmnet(x = X_train, y = y_train, family = gaussian , alpha = 1, nfolds = 3, type.measure = mae ) plot(fit) # Make predictions predictions = predict(fit, newx = X_test) predictions = as.vector(predictions[,1]) head(predictions) # 36.23598 36.43046 51.69786 26.06811 35.13185 37.66367 # We can compute the mean absolute error for the test data mean(abs(y_test - predictions)) # 15.02175
Now we have a model that given a set of skills is able to predict the hourly salary of a freelancer. If more data is collected, the performance of the model will improve, but the code to implement this pipepne would be the same.
Advertisements