- Scikit Learn - Discussion
- Scikit Learn - Useful Resources
- Scikit Learn - Quick Guide
- Dimensionality Reduction using PCA
- Clustering Performance Evaluation
- Scikit Learn - Clustering Methods
- Scikit Learn - Boosting Methods
- Randomized Decision Trees
- Scikit Learn - Decision Trees
- Classification with Naïve Bayes
- Scikit Learn - KNN Learning
- Scikit Learn - K-Nearest Neighbors
- Scikit Learn - Anomaly Detection
- Scikit Learn - Support Vector Machines
- Stochastic Gradient Descent
- Scikit Learn - Extended Linear Modeling
- Scikit Learn - Linear Modeling
- Scikit Learn - Conventions
- Scikit Learn - Estimator API
- Scikit Learn - Data Representation
- Scikit Learn - Modelling Process
- Scikit Learn - Introduction
- Scikit Learn - Home
Selected Reading
- Computer Glossary
- HR Interview Questions
- Effective Resume Writing
- Questions and Answers
- UPSC IAS Exams Notes
选读
Scikit Learn - Modelpng Process
This chapter deals with the modelpng process involved in Sklearn. Let us understand about the same in detail and begin with dataset loading.
Dataset Loading
A collection of data is called dataset. It is having the following two components −
Features − The variables of data are called its features. They are also known as predictors, inputs or attributes.
Feature matrix − It is the collection of features, in case there are more than one.
Feature Names − It is the pst of all the names of the features.
Response − It is the output variable that basically depends upon the feature variables. They are also known as target, label or output.
Response Vector − It is used to represent response column. Generally, we have just one response column.
Target Names − It represent the possible values taken by a response vector.
Scikit-learn have few example datasets pke iris and digits for classification and the Boston house prices for regression.
Example
Following is an example to load iris dataset −
from sklearn.datasets import load_iris iris = load_iris() X = iris.data y = iris.target feature_names = iris.feature_names target_names = iris.target_names print("Feature names:", feature_names) print("Target names:", target_names) print(" First 10 rows of X: ", X[:10])
Output
Feature names: [ sepal length (cm) , sepal width (cm) , petal length (cm) , petal width (cm) ] Target names: [ setosa versicolor virginica ] First 10 rows of X: [ [5.1 3.5 1.4 0.2] [4.9 3. 1.4 0.2] [4.7 3.2 1.3 0.2] [4.6 3.1 1.5 0.2] [5. 3.6 1.4 0.2] [5.4 3.9 1.7 0.4] [4.6 3.4 1.4 0.3] [5. 3.4 1.5 0.2] [4.4 2.9 1.4 0.2] [4.9 3.1 1.5 0.1] ]
Spptting the dataset
To check the accuracy of our model, we can sppt the dataset into two pieces-a training set and a testing set. Use the training set to train the model and testing set to test the model. After that, we can evaluate how well our model did.
Example
The following example will sppt the data into 70:30 ratio, i.e. 70% data will be used as training data and 30% will be used as testing data. The dataset is iris dataset as in above example.
from sklearn.datasets import load_iris iris = load_iris() X = iris.data y = iris.target from sklearn.model_selection import train_test_sppt X_train, X_test, y_train, y_test = train_test_sppt( X, y, test_size = 0.3, random_state = 1 ) print(X_train.shape) print(X_test.shape) print(y_train.shape) print(y_test.shape)
Output
(105, 4) (45, 4) (105,) (45,)
As seen in the example above, it uses train_test_sppt() function of scikit-learn to sppt the dataset. This function has the following arguments −
X, y − Here, X is the feature matrix and y is the response vector, which need to be sppt.
test_size − This represents the ratio of test data to the total given data. As in the above example, we are setting test_data = 0.3 for 150 rows of X. It will produce test data of 150*0.3 = 45 rows.
random_size − It is used to guarantee that the sppt will always be the same. This is useful in the situations where you want reproducible results.
Train the Model
Next, we can use our dataset to train some prediction-model. As discussed, scikit-learn has wide range of Machine Learning (ML) algorithms which have a consistent interface for fitting, predicting accuracy, recall etc.
Example
In the example below, we are going to use KNN (K nearest neighbors) classifier. Don’t go into the details of KNN algorithms, as there will be a separate chapter for that. This example is used to make you understand the implementation part only.
from sklearn.datasets import load_iris iris = load_iris() X = iris.data y = iris.target from sklearn.model_selection import train_test_sppt X_train, X_test, y_train, y_test = train_test_sppt( X, y, test_size = 0.4, random_state=1 ) from sklearn.neighbors import KNeighborsClassifier from sklearn import metrics classifier_knn = KNeighborsClassifier(n_neighbors = 3) classifier_knn.fit(X_train, y_train) y_pred = classifier_knn.predict(X_test) # Finding accuracy by comparing actual response values(y_test)with predicted response value(y_pred) print("Accuracy:", metrics.accuracy_score(y_test, y_pred)) # Providing sample data and the model will make prediction out of that data sample = [[5, 5, 3, 2], [2, 4, 3, 5]] preds = classifier_knn.predict(sample) pred_species = [iris.target_names[p] for p in preds] print("Predictions:", pred_species)
Output
Accuracy: 0.9833333333333333 Predictions: [ versicolor , virginica ]
Model Persistence
Once you train the model, it is desirable that the model should be persist for future use so that we do not need to retrain it again and again. It can be done with the help of dump and load features of jobpb package.
Consider the example below in which we will be saving the above trained model (classifier_knn) for future use −
from sklearn.externals import jobpb jobpb.dump(classifier_knn, iris_classifier_knn.jobpb )
The above code will save the model into file named iris_classifier_knn.jobpb. Now, the object can be reloaded from the file with the help of following code −
jobpb.load( iris_classifier_knn.jobpb )
Preprocessing the Data
As we are deapng with lots of data and that data is in raw form, before inputting that data to machine learning algorithms, we need to convert it into meaningful data. This process is called preprocessing the data. Scikit-learn has package named preprocessing for this purpose. The preprocessing package has the following techniques −
Binarisation
This preprocessing technique is used when we need to convert our numerical values into Boolean values.
Example
import numpy as np from sklearn import preprocessing Input_data = np.array( [2.1, -1.9, 5.5], [-1.5, 2.4, 3.5], [0.5, -7.9, 5.6], [5.9, 2.3, -5.8]] ) data_binarized = preprocessing.Binarizer(threshold=0.5).transform(input_data) print(" Binarized data: ", data_binarized)
In the above example, we used threshold value = 0.5 and that is why, all the values above 0.5 would be converted to 1, and all the values below 0.5 would be converted to 0.
Output
Binarized data: [ [ 1. 0. 1.] [ 0. 1. 1.] [ 0. 0. 1.] [ 1. 1. 0.] ]
Mean Removal
This technique is used to epminate the mean from feature vector so that every feature centered on zero.
Example
import numpy as np from sklearn import preprocessing Input_data = np.array( [2.1, -1.9, 5.5], [-1.5, 2.4, 3.5], [0.5, -7.9, 5.6], [5.9, 2.3, -5.8]] ) #displaying the mean and the standard deviation of the input data print("Mean =", input_data.mean(axis=0)) print("Stddeviation = ", input_data.std(axis=0)) #Removing the mean and the standard deviation of the input data data_scaled = preprocessing.scale(input_data) print("Mean_removed =", data_scaled.mean(axis=0)) print("Stddeviation_removed =", data_scaled.std(axis=0))
Output
Mean = [ 1.75 -1.275 2.2 ] Stddeviation = [ 2.71431391 4.20022321 4.69414529] Mean_removed = [ 1.11022302e-16 0.00000000e+00 0.00000000e+00] Stddeviation_removed = [ 1. 1. 1.]
Scapng
We use this preprocessing technique for scapng the feature vectors. Scapng of feature vectors is important, because the features should not be synthetically large or small.
Example
import numpy as np from sklearn import preprocessing Input_data = np.array( [ [2.1, -1.9, 5.5], [-1.5, 2.4, 3.5], [0.5, -7.9, 5.6], [5.9, 2.3, -5.8] ] ) data_scaler_minmax = preprocessing.MinMaxScaler(feature_range=(0,1)) data_scaled_minmax = data_scaler_minmax.fit_transform(input_data) print (" Min max scaled data: ", data_scaled_minmax)
Output
Min max scaled data: [ [ 0.48648649 0.58252427 0.99122807] [ 0. 1. 0.81578947] [ 0.27027027 0. 1. ] [ 1. 0.99029126 0. ] ]
Normapsation
We use this preprocessing technique for modifying the feature vectors. Normapsation of feature vectors is necessary so that the feature vectors can be measured at common scale. There are two types of normapsation as follows −
L1 Normapsation
It is also called Least Absolute Deviations. It modifies the value in such a manner that the sum of the absolute values remains always up to 1 in each row. Following example shows the implementation of L1 normapsation on input data.
Example
import numpy as np from sklearn import preprocessing Input_data = np.array( [ [2.1, -1.9, 5.5], [-1.5, 2.4, 3.5], [0.5, -7.9, 5.6], [5.9, 2.3, -5.8] ] ) data_normapzed_l1 = preprocessing.normapze(input_data, norm= l1 ) print(" L1 normapzed data: ", data_normapzed_l1)
Output
L1 normapzed data: [ [ 0.22105263 -0.2 0.57894737] [-0.2027027 0.32432432 0.47297297] [ 0.03571429 -0.56428571 0.4 ] [ 0.42142857 0.16428571 -0.41428571] ]
L2 Normapsation
Also called Least Squares. It modifies the value in such a manner that the sum of the squares remains always up to 1 in each row. Following example shows the implementation of L2 normapsation on input data.
Example
import numpy as np from sklearn import preprocessing Input_data = np.array( [ [2.1, -1.9, 5.5], [-1.5, 2.4, 3.5], [0.5, -7.9, 5.6], [5.9, 2.3, -5.8] ] ) data_normapzed_l2 = preprocessing.normapze(input_data, norm= l2 ) print(" L1 normapzed data: ", data_normapzed_l2)
Output
L2 normapzed data: [ [ 0.33946114 -0.30713151 0.88906489] [-0.33325106 0.53320169 0.7775858 ] [ 0.05156558 -0.81473612 0.57753446] [ 0.68706914 0.26784051 -0.6754239 ] ]Advertisements