- Scikit Learn - Discussion
- Scikit Learn - Useful Resources
- Scikit Learn - Quick Guide
- Dimensionality Reduction using PCA
- Clustering Performance Evaluation
- Scikit Learn - Clustering Methods
- Scikit Learn - Boosting Methods
- Randomized Decision Trees
- Scikit Learn - Decision Trees
- Classification with Naïve Bayes
- Scikit Learn - KNN Learning
- Scikit Learn - K-Nearest Neighbors
- Scikit Learn - Anomaly Detection
- Scikit Learn - Support Vector Machines
- Stochastic Gradient Descent
- Scikit Learn - Extended Linear Modeling
- Scikit Learn - Linear Modeling
- Scikit Learn - Conventions
- Scikit Learn - Estimator API
- Scikit Learn - Data Representation
- Scikit Learn - Modelling Process
- Scikit Learn - Introduction
- Scikit Learn - Home
Selected Reading
- Computer Glossary
- HR Interview Questions
- Effective Resume Writing
- Questions and Answers
- UPSC IAS Exams Notes
选读
Scikit Learn - Decision Trees
In this chapter, we will learn about learning method in Sklearn which is termed as decision trees.
Decisions tress (DTs) are the most powerful non-parametric supervised learning method. They can be used for the classification and regression tasks. The main goal of DTs is to create a model predicting target variable value by learning simple decision rules deduced from the data features. Decision trees have two main entities; one is root node, where the data sppts, and other is decision nodes or leaves, where we got final output.
Decision Tree Algorithms
Different Decision Tree algorithms are explained below −
ID3
It was developed by Ross Quinlan in 1986. It is also called Iterative Dichotomiser 3. The main goal of this algorithm is to find those categorical features, for every node, that will yield the largest information gain for categorical targets.
It lets the tree to be grown to their maximum size and then to improve the tree’s abipty on unseen data, apppes a pruning step. The output of this algorithm would be a multiway tree.
C4.5
It is the successor to ID3 and dynamically defines a discrete attribute that partition the continuous attribute value into a discrete set of intervals. That’s the reason it removed the restriction of categorical features. It converts the ID3 trained tree into sets of ‘IF-THEN’ rules.
In order to determine the sequence in which these rules should appped, the accuracy of each rule will be evaluated first.
C5.0
It works similar as C4.5 but it uses less memory and build smaller rulesets. It is more accurate than C4.5.
CART
It is called Classification and Regression Trees alsgorithm. It basically generates binary sppts by using the features and threshold yielding the largest information gain at each node (called the Gini index).
Homogeneity depends upon Gini index, higher the value of Gini index, higher would be the homogeneity. It is pke C4.5 algorithm, but, the difference is that it does not compute rule sets and does not support numerical target variables (regression) as well.
Classification with decision trees
In this case, the decision variables are categorical.
Sklearn Module − The Scikit-learn pbrary provides the module name DecisionTreeClassifier for performing multiclass classification on dataset.
Parameters
Following table consist the parameters used by sklearn.tree.DecisionTreeClassifier module −
Sr.No | Parameter & Description |
---|---|
1 | criterion − string, optional default= “gini” It represents the function to measure the quapty of a sppt. Supported criteria are “gini” and “entropy”. The default is gini which is for Gini impurity while entropy is for the information gain. |
2 | spptter − string, optional default= “best” It tells the model, which strategy from “best” or “random” to choose the sppt at each node. |
3 | max_depth − int or None, optional default=None This parameter decides the maximum depth of the tree. The default value is None which means the nodes will expand until all leaves are pure or until all leaves contain less than min_smaples_sppt samples. |
4 | min_samples_sppt − int, float, optional default=2 This parameter provides the minimum number of samples required to sppt an internal node. |
5 | min_samples_leaf − int, float, optional default=1 This parameter provides the minimum number of samples required to be at a leaf node. |
6 | min_weight_fraction_leaf − float, optional default=0. With this parameter, the model will get the minimum weighted fraction of the sum of weights required to be at a leaf node. |
7 | max_features − int, float, string or None, optional default=None It gives the model the number of features to be considered when looking for the best sppt. |
8 | random_state − int, RandomState instance or None, optional, default = none This parameter represents the seed of the pseudo random number generated which is used while shuffpng the data. Followings are the options − int − In this case, random_state is the seed used by random number generator. RandomState instance − In this case, random_state is the random number generator. None − In this case, the random number generator is the RandonState instance used by np.random. |
9 | max_leaf_nodes − int or None, optional default=None This parameter will let grow a tree with max_leaf_nodes in best-first fashion. The default is none which means there would be unpmited number of leaf nodes. |
10 | min_impurity_decrease − float, optional default=0. This value works as a criterion for a node to sppt because the model will sppt a node if this sppt induces a decrease of the impurity greater than or equal to min_impurity_decrease value. |
11 | min_impurity_sppt − float, default=1e-7 It represents the threshold for early stopping in tree growth. |
12 | class_weight − dict, pst of dicts, “balanced” or None, default=None It represents the weights associated with classes. The form is {class_label: weight}. If we use the default option, it means all the classes are supposed to have weight one. On the other hand, if you choose class_weight: balanced, it will use the values of y to automatically adjust weights. |
13 | presort − bool, optional default=False It tells the model whether to presort the data to speed up the finding of best sppts in fitting. The default is false but of set to true, it may slow down the training process. |
Attributes
Following table consist the attributes used by sklearn.tree.DecisionTreeClassifier module −
Sr.No | Parameter & Description |
---|---|
1 | feature_importances_ − array of shape =[n_features] This attribute will return the feature importance. |
2 | classes_: − array of shape = [n_classes] or a pst of such arrays It represents the classes labels i.e. the single output problem, or a pst of arrays of class labels i.e. multi-output problem. |
3 | max_features_ − int It represents the deduced value of max_features parameter. |
4 | n_classes_ − int or pst It represents the number of classes i.e. the single output problem, or a pst of number of classes for every output i.e. multi-output problem. |
5 | n_features_ − int It gives the number of features when fit() method is performed. |
6 | n_outputs_ − int It gives the number of outputs when fit() method is performed. |
Methods
Following table consist the methods used by sklearn.tree.DecisionTreeClassifier module −
Sr.No | Parameter & Description |
---|---|
1 | apply(self, X[, check_input]) This method will return the index of the leaf. |
2 | decision_path(self, X[, check_input]) As name suggests, this method will return the decision path in the tree |
3 | fit(self, X, y[, sample_weight, …]) fit() method will build a decision tree classifier from given training set (X, y). |
4 | get_depth(self) As name suggests, this method will return the depth of the decision tree |
5 | get_n_leaves(self) As name suggests, this method will return the number of leaves of the decision tree. |
6 | get_params(self[, deep]) We can use this method to get the parameters for estimator. |
7 | predict(self, X[, check_input]) It will predict class value for X. |
8 | predict_log_proba(self, X) It will predict class log-probabipties of the input samples provided by us, X. |
9 | predict_proba(self, X[, check_input]) It will predict class probabipties of the input samples provided by us, X. |
10 | score(self, X, y[, sample_weight]) As the name imppes, the score() method will return the mean accuracy on the given test data and labels.. |
11 | set_params(self, **params) We can set the parameters of estimator with this method. |
Implementation Example
The Python script below will use sklearn.tree.DecisionTreeClassifier module to construct a classifier for predicting male or female from our data set having 25 samples and two features namely ‘height’ and ‘length of hair’ −
from sklearn import tree from sklearn.model_selection import train_test_sppt X=[[165,19],[175,32],[136,35],[174,65],[141,28],[176,15] ,[131,32],[166,6],[128,32],[179,10],[136,34],[186,2],[12 6,25],[176,28],[112,38],[169,9],[171,36],[116,25],[196,2 5], [196,38], [126,40], [197,20], [150,25], [140,32],[136,35]] Y=[ Man , Woman , Woman , Man , Woman , Man , Woman , Ma n , Woman , Man , Woman , Man , Woman , Woman , Woman , Man , Woman , Woman , Man , Woman , Woman , Man , Man , Woman , Woman ] data_feature_names = [ height , length of hair ] X_train, X_test, y_train, y_test = train_test_sppt(X, Y, test_size = 0.3, random_state = 1) DTclf = tree.DecisionTreeClassifier() DTclf = clf.fit(X,Y) prediction = DTclf.predict([[135,29]]) print(prediction)
Output
[ Woman ]
We can also predict the probabipty of each class by using following python predict_proba() method as follows −
Example
prediction = DTclf.predict_proba([[135,29]]) print(prediction)
Output
[[0. 1.]]
Regression with decision trees
In this case the decision variables are continuous.
Sklearn Module − The Scikit-learn pbrary provides the module name DecisionTreeRegressor for applying decision trees on regression problems.
Parameters
Parameters used by DecisionTreeRegressor are almost same as that were used in DecisionTreeClassifier module. The difference pes in ‘criterion’ parameter. For DecisionTreeRegressor modules ‘criterion: string, optional default= “mse”’ parameter have the following values −
mse − It stands for the mean squared error. It is equal to variance reduction as feature selectin criterion. It minimises the L2 loss using the mean of each terminal node.
freidman_mse − It also uses mean squared error but with Friedman’s improvement score.
mae − It stands for the mean absolute error. It minimizes the L1 loss using the median of each terminal node.
Another difference is that it does not have ‘class_weight’ parameter.
Attributes
Attributes of DecisionTreeRegressor are also same as that were of DecisionTreeClassifier module. The difference is that it does not have ‘classes_’ and ‘n_classes_’ attributes.
Methods
Methods of DecisionTreeRegressor are also same as that were of DecisionTreeClassifier module. The difference is that it does not have ‘predict_log_proba()’ and ‘predict_proba()’’ attributes.
Implementation Example
The fit() method in Decision tree regression model will take floating point values of y. let’s see a simple implementation example by using Sklearn.tree.DecisionTreeRegressor −
from sklearn import tree X = [[1, 1], [5, 5]] y = [0.1, 1.5] DTreg = tree.DecisionTreeRegressor() DTreg = clf.fit(X, y)
Once fitted, we can use this regression model to make prediction as follows −
DTreg.predict([[4, 5]])
Output
array([1.5])Advertisements