- Big Data Analytics - Data Scientist
- Big Data Analytics - Data Analyst
- Key Stakeholders
- Core Deliverables
- Big Data Analytics - Methodology
- Big Data Analytics - Data Life Cycle
- Big Data Analytics - Overview
- Big Data Analytics - Home
Big Data Analytics Project
- Data Visualization
- Big Data Analytics - Data Exploration
- Big Data Analytics - Summarizing
- Big Data Analytics - Cleansing data
- Big Data Analytics - Data Collection
- Data Analytics - Problem Definition
Big Data Analytics Methods
- Data Analytics - Statistical Methods
- Big Data Analytics - Data Tools
- Big Data Analytics - Charts & Graphs
- Data Analytics - Introduction to SQL
- Big Data Analytics - Introduction to R
Advanced Methods
- Big Data Analytics - Online Learning
- Big Data Analytics - Text Analytics
- Big Data Analytics - Time Series
- Logistic Regression
- Big Data Analytics - Decision Trees
- Association Rules
- K-Means Clustering
- Naive Bayes Classifier
- Machine Learning for Data Analysis
Big Data Analytics Useful Resources
Selected Reading
- Who is Who
- Computer Glossary
- HR Interview Questions
- Effective Resume Writing
- Questions and Answers
- UPSC IAS Exams Notes
Big Data Analytics - Decision Trees
A Decision Tree is an algorithm used for supervised learning problems such as classification or regression. A decision tree or a classification tree is a tree in which each internal (nonleaf) node is labeled with an input feature. The arcs coming from a node labeled with a feature are labeled with each of the possible values of the feature. Each leaf of the tree is labeled with a class or a probabipty distribution over the classes.
A tree can be "learned" by spptting the source set into subsets based on an attribute value test. This process is repeated on each derived subset in a recursive manner called recursive partitioning. The recursion is completed when the subset at a node has all the same value of the target variable, or when spptting no longer adds value to the predictions. This process of top-down induction of decision trees is an example of a greedy algorithm, and it is the most common strategy for learning decision trees.
Decision trees used in data mining are of two main types −
Classification tree − when the response is a nominal variable, for example if an email is spam or not.
Regression tree − when the predicted outcome can be considered a real number (e.g. the salary of a worker).
Decision trees are a simple method, and as such has some problems. One of this issues is the high variance in the resulting models that decision trees produce. In order to alleviate this problem, ensemble methods of decision trees were developed. There are two groups of ensemble methods currently used extensively −
Bagging decision trees − These trees are used to build multiple decision trees by repeatedly resamppng training data with replacement, and voting the trees for a consensus prediction. This algorithm has been called random forest.
Boosting decision trees − Gradient boosting combines weak learners; in this case, decision trees into a single strong learner, in an iterative fashion. It fits a weak tree to the data and iteratively keeps fitting weak learners in order to correct the error of the previous model.
# Install the party package # install.packages( party ) pbrary(party) pbrary(ggplot2) head(diamonds) # We will predict the cut of diamonds using the features available in the diamonds dataset. ct = ctree(cut ~ ., data = diamonds) # plot(ct, main="Conditional Inference Tree") # Example output # Response: cut # Inputs: carat, color, clarity, depth, table, price, x, y, z # Number of observations: 53940 # # 1) table <= 57; criterion = 1, statistic = 10131.878 # 2) depth <= 63; criterion = 1, statistic = 8377.279 # 3) table <= 56.4; criterion = 1, statistic = 226.423 # 4) z <= 2.64; criterion = 1, statistic = 70.393 # 5) clarity <= VS1; criterion = 0.989, statistic = 10.48 # 6) color <= E; criterion = 0.997, statistic = 12.829 # 7)* weights = 82 # 6) color > E #Table of prediction errors table(predict(ct), diamonds$cut) # Fair Good Very Good Premium Ideal # Fair 1388 171 17 0 14 # Good 102 2912 499 26 27 # Very Good 54 998 3334 249 355 # Premium 44 711 5054 11915 1167 # Ideal 22 114 3178 1601 19988 # Estimated class probabipties probs = predict(ct, newdata = diamonds, type = "prob") probs = do.call(rbind, probs) head(probs)Advertisements