- Big Data Analytics - Data Scientist
- Big Data Analytics - Data Analyst
- Key Stakeholders
- Core Deliverables
- Big Data Analytics - Methodology
- Big Data Analytics - Data Life Cycle
- Big Data Analytics - Overview
- Big Data Analytics - Home
Big Data Analytics Project
- Data Visualization
- Big Data Analytics - Data Exploration
- Big Data Analytics - Summarizing
- Big Data Analytics - Cleansing data
- Big Data Analytics - Data Collection
- Data Analytics - Problem Definition
Big Data Analytics Methods
- Data Analytics - Statistical Methods
- Big Data Analytics - Data Tools
- Big Data Analytics - Charts & Graphs
- Data Analytics - Introduction to SQL
- Big Data Analytics - Introduction to R
Advanced Methods
- Big Data Analytics - Online Learning
- Big Data Analytics - Text Analytics
- Big Data Analytics - Time Series
- Logistic Regression
- Big Data Analytics - Decision Trees
- Association Rules
- K-Means Clustering
- Naive Bayes Classifier
- Machine Learning for Data Analysis
Big Data Analytics Useful Resources
Selected Reading
- Who is Who
- Computer Glossary
- HR Interview Questions
- Effective Resume Writing
- Questions and Answers
- UPSC IAS Exams Notes
Big Data Analytics - Problem Definition
Through this tutorial, we will develop a project. Each subsequent chapter in this tutorial deals with a part of the larger project in the mini-project section. This is thought to be an appped tutorial section that will provide exposure to a real-world problem. In this case, we would start with the problem definition of the project.
Project Description
The objective of this project would be to develop a machine learning model to predict the hourly salary of people using their curriculum vitae (CV) text as input.
Using the framework defined above, it is simple to define the problem. We can define X = {x1, x2, …, xn} as the CV’s of users, where each feature can be, in the simplest way possible, the amount of times this word appears. Then the response is real valued, we are trying to predict the hourly salary of inspaniduals in dollars.
These two considerations are enough to conclude that the problem presented can be solved with a supervised regression algorithm.
Problem Definition
Problem Definition is probably one of the most complex and heavily neglected stages in the big data analytics pipepne. In order to define the problem a data product would solve, experience is mandatory. Most data scientist aspirants have pttle or no experience in this stage.
Most big data problems can be categorized in the following ways −
Supervised classification
Supervised regression
Unsupervised learning
Learning to rank
Let us now learn more about these four concepts.
Supervised Classification
Given a matrix of features X = {x1, x2, ..., xn} we develop a model M to predict different classes defined as y = {c1, c2, ..., cn}. For example: Given transactional data of customers in an insurance company, it is possible to develop a model that will predict if a cpent would churn or not. The latter is a binary classification problem, where there are two classes or target variables: churn and not churn.
Other problems involve predicting more than one class, we could be interested in doing digit recognition, therefore the response vector would be defined as: y = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}, a-state-of-the-art model would be convolutional neural network and the matrix of features would be defined as the pixels of the image.
Supervised Regression
In this case, the problem definition is rather similar to the previous example; the difference repes on the response. In a regression problem, the response y ∈ ℜ, this means the response is real valued. For example, we can develop a model to predict the hourly salary of inspaniduals given the corpus of their CV.
Unsupervised Learning
Management is often thirsty for new insights. Segmentation models can provide this insight in order for the marketing department to develop products for different segments. A good approach for developing a segmentation model, rather than thinking of algorithms, is to select features that are relevant to the segmentation that is desired.
For example, in a telecommunications company, it is interesting to segment cpents by their cellphone usage. This would involve disregarding features that have nothing to do with the segmentation objective and including only those that do. In this case, this would be selecting features as the number of SMS used in a month, the number of inbound and outbound minutes, etc.
Learning to Rank
This problem can be considered as a regression problem, but it has particular characteristics and deserves a separate treatment. The problem involves given a collection of documents we seek to find the most relevant ordering given a query. In order to develop a supervised learning algorithm, it is needed to label how relevant an ordering is, given a query.
It is relevant to note that in order to develop a supervised learning algorithm, it is needed to label the training data. This means that in order to train a model that will, for example, recognize digits from an image, we need to label a significant amount of examples by hand. There are web services that can speed up this process and are commonly used for this task such as amazon mechanical turk. It is proven that learning algorithms improve their performance when provided with more data, so labepng a decent amount of examples is practically mandatory in supervised learning.
Advertisements