Question

In: Computer Science

You will utilize a large dataset to create a predictive analytics algorithm in Python. For this...

You will utilize a large dataset to create a predictive analytics algorithm in Python.

For this assignment, complete the following:


Utilize one of the following Web sites to identify a dataset to use, preferably over 500K from Google databases, kaggle, or the .gov data website

Utilize a machine learning algorithm to create a prediction. K-nearest neighbors is recommended as an introductory algorithm. Your algorithm should read in the dataset, segmenting the data with 70% used for training and 30% used for testing. Illustrate the results as output.


Create a Microsoft Word document that describes how this algorithm could be used to predict new records that are generated. This is often called scoring new records


Solutions

Expert Solution

Sample Dataset Downloaded and used is: https://bit.ly/2IWDUi0

Below is the code for KNN (K-nearest neighbors), ML algorithm to predict the target. In the code below, I have read the dataset using pandas and then separated the independent variables as X and dependent as y and further segmented the whole dataset in 70% and 30% for training and testing respectively.

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Reading the dataset:
df = pd.read_csv("diabetes.csv")

# Separating the independent and dependent variables as X and y 
X = df.iloc[:,:-1]
y = df.iloc[:,-1]

#Spliting the dataset with 30% of test size:
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.30, random_state=0)

#Fitting the data to the model:
knn = KNeighborsClassifier(n_neighbors = 15).fit(X_train,y_train)

# Functions to review the performance of the model:
def acc(a,b):
    print('Accuracy(test Data): ',accuracy_score(a,b))
    print('Confusion Matrix: \n',pd.DataFrame(confusion_matrix(a,b)))
    print('Classification Repost:\n',classification_report(a,b,digits=3))

# Checking the performance:
acc(y_test,knn.predict(X_test))

With the above code and n-neighbors=15, we got the below performance of the model:

Now let's talk a little above improving the model (Suggestions):

To improve or have a good prediction model requires the proper processing of the data before fitting it to the model. Above dataset is little-bit imbalanced because it has almost double the number of 0's as of 1's. So, whenever a dataset is imbalanced, we need to either remove some samples to have equal samples of both categories or apply some methods to generate some similar samples (SMOTE method can be used.)

Next we should check for collinearity of columns in the dataset and apply methods for its treatment. We should also apply some normalization method to have similar range of values.

With this I hope, You got the idea to go ahead on this.

Thanks & Have a nice day !

Below is the screenshot of the codes to avoid any indentation issues:


Related Solutions

Question 1: Using Python 3 Create an algorithm The goal is to create an algorithm that...
Question 1: Using Python 3 Create an algorithm The goal is to create an algorithm that can sort a singly-linked-list with Merge-sort. The program should read integers from file (hw-extra.txt) and create an unsorted singly-linked list. Then, the list should be sorted using merge sort algorithm. The merge-sort function should take the head of a linked list, and the size of the linked list as parameters. hw-extra.txt provided: 37 32 96 2 25 71 432 132 76 243 6 32...
Part A: You have been hired by a company to build a predictive analytics model to...
Part A: You have been hired by a company to build a predictive analytics model to increase their sales. Before building the model, your manager asked you to start with exploratory data analysis and report the findings. Which visualization tools would you use to display important properties of data such as outliers, range of data, mean, IQR, and distribution of data, etc.? Be specific about the tools and methods that display the skewness, similarity of distributions, and whether data comes...
Part 2: What is the difference between historical analytics or predictive analytics in detail and provide...
Part 2: What is the difference between historical analytics or predictive analytics in detail and provide some example?
Explain the relationship between data mining and predictive analytics.
Explain the relationship between data mining and predictive analytics.
Explain the use of neural networking modeling in predictive analytics.
Explain the use of neural networking modeling in predictive analytics.
Do some research into workforce analytics and predictive analytics. What can we expect for the future?...
Do some research into workforce analytics and predictive analytics. What can we expect for the future? Be creative, think possibilities.
Write a Python program that: Create the algorithm in both flowchart and pseudocode forms for the...
Write a Python program that: Create the algorithm in both flowchart and pseudocode forms for the following requirements: Reads in a series of positive integers,  one number at a time;  and Calculate the product (multiplication) of all the integers less than 25,  and Calculate the sum (addition) of all the integers greater than or equal to 25. Use 0 as a sentinel value, which stops the input loop. [ If the input is 0 that means the end of the input list. ]...
Utilize Python to create a key-value program similar to MapReduce on Hadoop. For this assignment, do...
Utilize Python to create a key-value program similar to MapReduce on Hadoop. For this assignment, do the following: Create a dictionary (also called an associative array) that contains at least 50 values. An example of a key may be state and value as capital. Another example may be number in an alphabet and letter in an alphabet. Write a command that enumerates the contents of key-values in the dictionary. Write a command that lists all keys. Write a command that...
How is data analytics different from statistics? Analytics tools fall into 3 categories:descriptive, predictive, and prescriptive....
How is data analytics different from statistics? Analytics tools fall into 3 categories:descriptive, predictive, and prescriptive. What are the main differences among these categories? Explain how businesses use analytics to convert raw operational data into actionable information. Provide at least 1 example
As a business owner, consider how you could use regression analysis and predictive analytics to increase...
As a business owner, consider how you could use regression analysis and predictive analytics to increase your company's sales. If you were creating a study, what would be the dependent variable, and then what independent variables could be included in the study, and why?
ADVERTISEMENT
ADVERTISEMENT
ADVERTISEMENT