- Discussion
- Useful Resources
- Quick Guide
- Summary
- Limitations
- Testing
- Building Classifier
- Splitting Data
- Preparing Data
- Restructuring Data
- Getting Data
- Setting up a Project
- Case Study
- Introduction
- Home
Selected Reading
- Who is Who
- Computer Glossary
- HR Interview Questions
- Effective Resume Writing
- Questions and Answers
- UPSC IAS Exams Notes
Logistic Regression in Python - Restructuring Data
Whenever any organization conducts a survey, they try to collect as much information as possible from the customer, with the idea that this information would be useful to the organization one way or the other, at a later point of time. To solve the current problem, we have to pick up the information that is directly relevant to our problem.
Displaying All Fields
Now, let us see how to select the data fields useful to us. Run the following statement in the code editor.
In [6]: print(pst(df.columns))
You will see the following output −
[ age , job , marital , education , default , housing , loan , contact , month , day_of_week , duration , campaign , pdays , previous , poutcome , emp_var_rate , cons_price_idx , cons_conf_idx , euribor3m , nr_employed , y ]
The output shows the names of all the columns in the database. The last column “y” is a Boolean value indicating whether this customer has a term deposit with the bank. The values of this field are either “y” or “n”. You can read the description and purpose of each column in the banks-name.txt file that was downloaded as part of the data.
Epminating Unwanted Fields
Examining the column names, you will know that some of the fields have no significance to the problem at hand. For example, fields such as month, day_of_week, campaign, etc. are of no use to us. We will epminate these fields from our database. To drop a column, we use the drop command as shown below −
In [8]: #drop columns which are not needed. df.drop(df.columns[[0, 3, 7, 8, 9, 10, 11, 12, 13, 15, 16, 17, 18, 19]], axis = 1, inplace = True)
The command says that drop column number 0, 3, 7, 8, and so on. To ensure that the index is properly selected, use the following statement −
In [7]: df.columns[9] Out[7]: day_of_week
This prints the column name for the given index.
After dropping the columns which are not required, examine the data with the head statement. The screen output is shown here −
In [9]: df.head() Out[9]: job marital default housing loan poutcome y 0 blue-collar married unknown yes no nonexistent 0 1 technician married no no no nonexistent 0 2 management single no yes no success 1 3 services married no no no nonexistent 0 4 retired married no yes no success 1
Now, we have only the fields which we feel are important for our data analysis and prediction. The importance of Data Scientist comes into picture at this step. The data scientist has to select the appropriate columns for model building.
For example, the type of job though at the first glance may not convince everybody for inclusion in the database, it will be a very useful field. Not all types of customers will open the TD. The lower income people may not open the TDs, while the higher income people will usually park their excess money in TDs. So the type of job becomes significantly relevant in this scenario. Likewise, carefully select the columns which you feel will be relevant for your analysis.
In the next chapter, we will prepare our data for building the model.
Advertisements