English 中文(简体)
Scikit Learn - Modelling Process
  • 时间:2024-12-22

Scikit Learn - Modelpng Process


Previous Page Next Page  

This chapter deals with the modelpng process involved in Sklearn. Let us understand about the same in detail and begin with dataset loading.

Dataset Loading

A collection of data is called dataset. It is having the following two components −

Features − The variables of data are called its features. They are also known as predictors, inputs or attributes.

    Feature matrix − It is the collection of features, in case there are more than one.

    Feature Names − It is the pst of all the names of the features.

Response − It is the output variable that basically depends upon the feature variables. They are also known as target, label or output.

    Response Vector − It is used to represent response column. Generally, we have just one response column.

    Target Names − It represent the possible values taken by a response vector.

Scikit-learn have few example datasets pke iris and digits for classification and the Boston house prices for regression.

Example

Following is an example to load iris dataset −


from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data
y = iris.target
feature_names = iris.feature_names
target_names = iris.target_names
print("Feature names:", feature_names)
print("Target names:", target_names)
print("
First 10 rows of X:
", X[:10])

Output


Feature names: [ sepal length (cm) ,  sepal width (cm) ,  petal length (cm) ,  petal width (cm) ]
Target names: [ setosa   versicolor   virginica ]
First 10 rows of X:
[
   [5.1 3.5 1.4 0.2]
   [4.9 3. 1.4 0.2]
   [4.7 3.2 1.3 0.2]
   [4.6 3.1 1.5 0.2]
   [5. 3.6 1.4 0.2]
   [5.4 3.9 1.7 0.4]
   [4.6 3.4 1.4 0.3]
   [5. 3.4 1.5 0.2]
   [4.4 2.9 1.4 0.2]
   [4.9 3.1 1.5 0.1]
]

Spptting the dataset

To check the accuracy of our model, we can sppt the dataset into two pieces-a training set and a testing set. Use the training set to train the model and testing set to test the model. After that, we can evaluate how well our model did.

Example

The following example will sppt the data into 70:30 ratio, i.e. 70% data will be used as training data and 30% will be used as testing data. The dataset is iris dataset as in above example.


from sklearn.datasets import load_iris
iris = load_iris()

X = iris.data
y = iris.target

from sklearn.model_selection import train_test_sppt

X_train, X_test, y_train, y_test = train_test_sppt(
   X, y, test_size = 0.3, random_state = 1
)

print(X_train.shape)
print(X_test.shape)

print(y_train.shape)
print(y_test.shape)

Output


(105, 4)
(45, 4)
(105,)
(45,)

As seen in the example above, it uses train_test_sppt() function of scikit-learn to sppt the dataset. This function has the following arguments −

    X, y − Here, X is the feature matrix and y is the response vector, which need to be sppt.

    test_size − This represents the ratio of test data to the total given data. As in the above example, we are setting test_data = 0.3 for 150 rows of X. It will produce test data of 150*0.3 = 45 rows.

    random_size − It is used to guarantee that the sppt will always be the same. This is useful in the situations where you want reproducible results.

Train the Model

Next, we can use our dataset to train some prediction-model. As discussed, scikit-learn has wide range of Machine Learning (ML) algorithms which have a consistent interface for fitting, predicting accuracy, recall etc.

Example

In the example below, we are going to use KNN (K nearest neighbors) classifier. Don’t go into the details of KNN algorithms, as there will be a separate chapter for that. This example is used to make you understand the implementation part only.


from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data
y = iris.target
from sklearn.model_selection import train_test_sppt
X_train, X_test, y_train, y_test = train_test_sppt(
   X, y, test_size = 0.4, random_state=1
)
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics
classifier_knn = KNeighborsClassifier(n_neighbors = 3)
classifier_knn.fit(X_train, y_train)
y_pred = classifier_knn.predict(X_test)
# Finding accuracy by comparing actual response values(y_test)with predicted response value(y_pred)
print("Accuracy:", metrics.accuracy_score(y_test, y_pred))
# Providing sample data and the model will make prediction out of that data

sample = [[5, 5, 3, 2], [2, 4, 3, 5]]
preds = classifier_knn.predict(sample)
pred_species = [iris.target_names[p] for p in preds] print("Predictions:", pred_species)

Output


Accuracy: 0.9833333333333333
Predictions: [ versicolor ,  virginica ]

Model Persistence

Once you train the model, it is desirable that the model should be persist for future use so that we do not need to retrain it again and again. It can be done with the help of dump and load features of jobpb package.

Consider the example below in which we will be saving the above trained model (classifier_knn) for future use −


from sklearn.externals import jobpb
jobpb.dump(classifier_knn,  iris_classifier_knn.jobpb )

The above code will save the model into file named iris_classifier_knn.jobpb. Now, the object can be reloaded from the file with the help of following code −


jobpb.load( iris_classifier_knn.jobpb )

Preprocessing the Data

As we are deapng with lots of data and that data is in raw form, before inputting that data to machine learning algorithms, we need to convert it into meaningful data. This process is called preprocessing the data. Scikit-learn has package named preprocessing for this purpose. The preprocessing package has the following techniques −

Binarisation

This preprocessing technique is used when we need to convert our numerical values into Boolean values.

Example


import numpy as np
from sklearn import preprocessing
Input_data = np.array(
   [2.1, -1.9, 5.5],
   [-1.5, 2.4, 3.5],
   [0.5, -7.9, 5.6],
   [5.9, 2.3, -5.8]]
)
data_binarized = preprocessing.Binarizer(threshold=0.5).transform(input_data)
print("
Binarized data:
", data_binarized)

In the above example, we used threshold value = 0.5 and that is why, all the values above 0.5 would be converted to 1, and all the values below 0.5 would be converted to 0.

Output


Binarized data:
[
   [ 1. 0. 1.]
   [ 0. 1. 1.]
   [ 0. 0. 1.]
   [ 1. 1. 0.]
]

Mean Removal

This technique is used to epminate the mean from feature vector so that every feature centered on zero.

Example


import numpy as np
from sklearn import preprocessing
Input_data = np.array(
   [2.1, -1.9, 5.5],
   [-1.5, 2.4, 3.5],
   [0.5, -7.9, 5.6],
   [5.9, 2.3, -5.8]]
)

#displaying the mean and the standard deviation of the input data
print("Mean =", input_data.mean(axis=0))
print("Stddeviation = ", input_data.std(axis=0))
#Removing the mean and the standard deviation of the input data

data_scaled = preprocessing.scale(input_data)
print("Mean_removed =", data_scaled.mean(axis=0))
print("Stddeviation_removed =", data_scaled.std(axis=0))

Output


Mean = [ 1.75 -1.275 2.2 ]
Stddeviation = [ 2.71431391 4.20022321 4.69414529]
Mean_removed = [ 1.11022302e-16 0.00000000e+00 0.00000000e+00]
Stddeviation_removed = [ 1. 1. 1.]

Scapng

We use this preprocessing technique for scapng the feature vectors. Scapng of feature vectors is important, because the features should not be synthetically large or small.

Example


import numpy as np
from sklearn import preprocessing
Input_data = np.array(
   [
      [2.1, -1.9, 5.5],
      [-1.5, 2.4, 3.5],
      [0.5, -7.9, 5.6],
      [5.9, 2.3, -5.8]
   ]
)
data_scaler_minmax = preprocessing.MinMaxScaler(feature_range=(0,1))
data_scaled_minmax = data_scaler_minmax.fit_transform(input_data)
print ("
Min max scaled data:
", data_scaled_minmax)

Output


Min max scaled data:
[
   [ 0.48648649 0.58252427 0.99122807]
   [ 0. 1. 0.81578947]
   [ 0.27027027 0. 1. ]
   [ 1. 0.99029126 0. ]
]

Normapsation

We use this preprocessing technique for modifying the feature vectors. Normapsation of feature vectors is necessary so that the feature vectors can be measured at common scale. There are two types of normapsation as follows −

L1 Normapsation

It is also called Least Absolute Deviations. It modifies the value in such a manner that the sum of the absolute values remains always up to 1 in each row. Following example shows the implementation of L1 normapsation on input data.

Example


import numpy as np
from sklearn import preprocessing
Input_data = np.array(
   [
      [2.1, -1.9, 5.5],
      [-1.5, 2.4, 3.5],
      [0.5, -7.9, 5.6],
      [5.9, 2.3, -5.8]
   ]
)
data_normapzed_l1 = preprocessing.normapze(input_data, norm= l1 )
print("
L1 normapzed data:
", data_normapzed_l1)

Output


L1 normapzed data:
[
   [ 0.22105263 -0.2 0.57894737]
   [-0.2027027 0.32432432 0.47297297]
   [ 0.03571429 -0.56428571 0.4 ]
   [ 0.42142857 0.16428571 -0.41428571]
]

L2 Normapsation

Also called Least Squares. It modifies the value in such a manner that the sum of the squares remains always up to 1 in each row. Following example shows the implementation of L2 normapsation on input data.

Example


import numpy as np
from sklearn import preprocessing
Input_data = np.array(
   [
      [2.1, -1.9, 5.5],
      [-1.5, 2.4, 3.5],
      [0.5, -7.9, 5.6],
      [5.9, 2.3, -5.8]
   ]
)
data_normapzed_l2 = preprocessing.normapze(input_data, norm= l2 )
print("
L1 normapzed data:
", data_normapzed_l2)

Output


L2 normapzed data:
[
   [ 0.33946114 -0.30713151 0.88906489]
   [-0.33325106 0.53320169 0.7775858 ]
   [ 0.05156558 -0.81473612 0.57753446]
   [ 0.68706914 0.26784051 -0.6754239 ]
]
Advertisements