In: Computer Science
You will utilize a large dataset to create a
predictive analytics algorithm in Python.
For this assignment, complete the following:
Utilize one of the following Web sites to identify a dataset to use, preferably over 500K from Google databases, kaggle, or the .gov data website
Utilize a machine learning algorithm to create a
prediction. K-nearest neighbors is recommended as an introductory
algorithm. Your algorithm should read in the dataset, segmenting
the data with 70% used for training and 30% used for testing.
Illustrate the results as output.
Create a Microsoft Word document that describes how
this algorithm could be used to predict new records that are
generated. This is often called scoring new
records
Sample Dataset Downloaded and used is: https://bit.ly/2IWDUi0
Below is the code for KNN (K-nearest neighbors), ML algorithm to predict the target. In the code below, I have read the dataset using pandas and then separated the independent variables as X and dependent as y and further segmented the whole dataset in 70% and 30% for training and testing respectively.
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
# Reading the dataset:
df = pd.read_csv("diabetes.csv")
# Separating the independent and dependent variables as X and y
X = df.iloc[:,:-1]
y = df.iloc[:,-1]
#Spliting the dataset with 30% of test size:
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.30, random_state=0)
#Fitting the data to the model:
knn = KNeighborsClassifier(n_neighbors = 15).fit(X_train,y_train)
# Functions to review the performance of the model:
def acc(a,b):
print('Accuracy(test Data): ',accuracy_score(a,b))
print('Confusion Matrix: \n',pd.DataFrame(confusion_matrix(a,b)))
print('Classification Repost:\n',classification_report(a,b,digits=3))
# Checking the performance:
acc(y_test,knn.predict(X_test))
With the above code and n-neighbors=15, we got the below performance of the model:
Now let's talk a little above improving the model (Suggestions):
To improve or have a good prediction model requires the proper processing of the data before fitting it to the model. Above dataset is little-bit imbalanced because it has almost double the number of 0's as of 1's. So, whenever a dataset is imbalanced, we need to either remove some samples to have equal samples of both categories or apply some methods to generate some similar samples (SMOTE method can be used.)
Next we should check for collinearity of columns in the dataset and apply methods for its treatment. We should also apply some normalization method to have similar range of values.
With this I hope, You got the idea to go ahead on this.
Thanks & Have a nice day !
Below is the screenshot of the codes to avoid any indentation issues: