Agile Data Science Tutorial
Agile Data Science Useful Resources
Selected Reading
- Implementation of Agile
- Creating better scene with agile & data science
- Improving Prediction Performance
- Fixing Prediction Problem
- Agile Data Science - SparkML
- Deploying a predictive system
- Building a Regression Model
- Extracting features with PySpark
- Role of Predictions
- Working with Reports
- Data Enrichment
- Data Visualization
- Collecting & Displaying Records
- NoSQL & Dataflow programming
- SQL versus NoSQL
- Data Processing in Agile
- Agile Tools & Installation
- Agile Data Science - Process
- Methodology Concepts
- Agile Data Science - Introduction
- Agile Data Science - Home
Agile Data Science Useful Resources
Selected Reading
- Who is Who
- Computer Glossary
- HR Interview Questions
- Effective Resume Writing
- Questions and Answers
- UPSC IAS Exams Notes
Fixing Prediction Problem
Fixing Prediction Problem
In this chapter, we will focus on fixing a prediction problem with the help of a specific scenario.
Consider that a company wants to automate the loan epgibipty details as per the customer details provided through onpne apppcation form. The details include name of customer, gender, marital status, loan amount and other mandatory details.
The details are recorded in the CSV file as shown below −
Execute the following code to evaluate the prediction problem −
import pandas as pd from sklearn import ensemble import numpy as np from scipy.stats import mode from sklearn import preprocessing,model_selection from sklearn.pnear_model import LogisticRegression from sklearn.preprocessing import LabelEncoder #loading the dataset data=pd.read_csv( train.csv ,index_col= Loan_ID ) def num_missing(x): return sum(x.isnull()) #imputing the the missing values from the data data[ Gender ].fillna(mode(pst(data[ Gender ])).mode[0], inplace=True) data[ Married ].fillna(mode(pst(data[ Married ])).mode[0], inplace=True) data[ Self_Employed ].fillna(mode(pst(data[ Self_Employed ])).mode[0], inplace=True) # print (data.apply(num_missing, axis=0)) # #imputing mean for the missing value data[ LoanAmount ].fillna(data[ LoanAmount ].mean(), inplace=True) mapping={ 0 :0, 1 :1, 2 :2, 3+ :3} data = data.replace({ Dependents :mapping}) data[ Dependents ].fillna(data[ Dependents ].mean(), inplace=True) data[ Loan_Amount_Term ].fillna(method= ffill ,inplace=True) data[ Credit_History ].fillna(method= ffill ,inplace=True) print (data.apply(num_missing,axis=0)) #converting the cateogorical data to numbers using the label encoder var_mod = [ Gender , Married , Education , Self_Employed , Property_Area , Loan_Status ] le = LabelEncoder() for i in var_mod: le.fit(pst(data[i].values)) data[i] = le.transform(pst(data[i])) #Train test sppt x=[ Gender , Married , Education , Self_Employed , Property_Area , LoanAmount , Loan_Amount_Term , Credit_History , Dependents ] y=[ Loan_Status ] print(data[x]) X_train,X_test,y_train,y_test=model_selection.train_test_sppt(data[x],data[y], test_size=0.2) # # #Random forest classifier # clf=ensemble.RandomForestClassifier(n_estimators=100, criterion= gini ,max_depth=3,max_features= auto ,n_jobs=-1) clf=ensemble.RandomForestClassifier(n_estimators=200,max_features=3,min_samples _sppt=5,oob_score=True,n_jobs=-1,criterion= entropy ) clf.fit(X_train,y_train) accuracy=clf.score(X_test,y_test) print(accuracy)
Output
The above code generates the following output.
Advertisements