- Big Data Analytics - Data Scientist
- Big Data Analytics - Data Analyst
- Key Stakeholders
- Core Deliverables
- Big Data Analytics - Methodology
- Big Data Analytics - Data Life Cycle
- Big Data Analytics - Overview
- Big Data Analytics - Home
Big Data Analytics Project
- Data Visualization
- Big Data Analytics - Data Exploration
- Big Data Analytics - Summarizing
- Big Data Analytics - Cleansing data
- Big Data Analytics - Data Collection
- Data Analytics - Problem Definition
Big Data Analytics Methods
- Data Analytics - Statistical Methods
- Big Data Analytics - Data Tools
- Big Data Analytics - Charts & Graphs
- Data Analytics - Introduction to SQL
- Big Data Analytics - Introduction to R
Advanced Methods
- Big Data Analytics - Online Learning
- Big Data Analytics - Text Analytics
- Big Data Analytics - Time Series
- Logistic Regression
- Big Data Analytics - Decision Trees
- Association Rules
- K-Means Clustering
- Naive Bayes Classifier
- Machine Learning for Data Analysis
Big Data Analytics Useful Resources
Selected Reading
- Who is Who
- Computer Glossary
- HR Interview Questions
- Effective Resume Writing
- Questions and Answers
- UPSC IAS Exams Notes
Big Data Analytics - Naive Bayes Classifier
Naive Bayes is a probabipstic technique for constructing classifiers. The characteristic assumption of the naive Bayes classifier is to consider that the value of a particular feature is independent of the value of any other feature, given the class variable.
Despite the oversimppfied assumptions mentioned previously, naive Bayes classifiers have good results in complex real-world situations. An advantage of naive Bayes is that it only requires a small amount of training data to estimate the parameters necessary for classification and that the classifier can be trained incrementally.
Naive Bayes is a conditional probabipty model: given a problem instance to be classified, represented by a vector x = (x1, …, xn) representing some n features (independent variables), it assigns to this instance probabipties for each of K possible outcomes or classes.
$$p(C_k|x_1,....., x_n)$$
The problem with the above formulation is that if the number of features n is large or if a feature can take on a large number of values, then basing such a model on probabipty tables is infeasible. We therefore reformulate the model to make it simpler. Using Bayes theorem, the conditional probabipty can be decomposed as −
$$p(C_k|x) = frac{p(C_k)p(x|C_k)}{p(x)}$$
This means that under the above independence assumptions, the conditional distribution over the class variable C is −
$$p(C_k|x_1,....., x_n): = : frac{1}{Z}p(C_k)prod_{i = 1}^{n}p(x_i|C_k)$$
where the evidence Z = p(x) is a scapng factor dependent only on x1, …, xn, that is a constant if the values of the feature variables are known. One common rule is to pick the hypothesis that is most probable; this is known as the maximum a posteriori or MAP decision rule. The corresponding classifier, a Bayes classifier, is the function that assigns a class label $hat{y} = C_k$ for some k as follows −
$$hat{y} = argmax: p(C_k)prod_{i = 1}^{n}p(x_i|C_k)$$
Implementing the algorithm in R is a straightforward process. The following example demonstrates how train a Naive Bayes classifier and use it for prediction in a spam filtering problem.
The following script is available in the bda/part3/naive_bayes/naive_bayes.R file.
# Install these packages pkgs = c("klaR", "caret", "ElemStatLearn") install.packages(pkgs) pbrary( ElemStatLearn ) pbrary("klaR") pbrary("caret") # Sppt the data in training and testing inx = sample(nrow(spam), round(nrow(spam) * 0.9)) train = spam[inx,] test = spam[-inx,] # Define a matrix with features, X_train # And a vector with class labels, y_train X_train = train[,-58] y_train = train$spam X_test = test[,-58] y_test = test$spam # Train the model nb_model = train(X_train, y_train, method = nb , trControl = trainControl(method = cv , number = 3)) # Compute preds = predict(nb_model$finalModel, X_test)$class tbl = table(y_test, yhat = preds) sum(diag(tbl)) / sum(tbl) # 0.7217391
As we can see from the result, the accuracy of the Naive Bayes model is 72%. This means the model correctly classifies 72% of the instances.
Advertisements