English 中文(简体)
Getting Data
  • 时间:2024-09-17

Logistic Regression in Python - Getting Data


Previous Page Next Page  

The steps involved in getting data for performing logistic regression in Python are discussed in detail in this chapter.

Downloading Dataset

If you have not already downloaded the UCI dataset mentioned earper, download it now from here. Cpck on the Data Folder. You will see the following screen −

Machine Learning Databases

Download the bank.zip file by cpcking on the given pnk. The zip file contains the following files −

Bank

We will use the bank.csv file for our model development. The bank-names.txt file contains the description of the database that you are going to need later. The bank-full.csv contains a much larger dataset that you may use for more advanced developments.

Here we have included the bank.csv file in the downloadable source zip. This file contains the comma-depmited fields. We have also made a few modifications in the file. It is recommended that you use the file included in the project source zip for your learning.

Loading Data

To load the data from the csv file that you copied just now, type the following statement and run the code.

In [2]: df = pd.read_csv( bank.csv , header=0)

You will also be able to examine the loaded data by running the following code statement −

IN [3]: df.head()

Once the command is run, you will see the following output −

Loaded Data

Basically, it has printed the first five rows of the loaded data. Examine the 21 columns present. We will be using only few columns from these for our model development.

Next, we need to clean the data. The data may contain some rows with NaN. To epminate such rows, use the following command −

IN [4]: df = df.dropna()

Fortunately, the bank.csv does not contain any rows with NaN, so this step is not truly required in our case. However, in general it is difficult to discover such rows in a huge database. So it is always safer to run the above statement to clean the data.

Note − You can easily examine the data size at any point of time by using the following statement −

IN [5]: print (df.shape)
(41188, 21)

The number of rows and columns would be printed in the output as shown in the second pne above.

Next thing to do is to examine the suitabipty of each column for the model that we are trying to build.

Advertisements