Question

In: Computer Science

You will utilize a large dataset to create a predictive analytics algorithm in Python. For this...

You will utilize a large dataset to create a predictive analytics algorithm in Python.

For this assignment, complete the following:


Utilize one of the following Web sites to identify a dataset to use, preferably over 500K from Google databases, kaggle, or the .gov data website

Utilize a machine learning algorithm to create a prediction. K-nearest neighbors is recommended as an introductory algorithm. Your algorithm should read in the dataset, segmenting the data with 70% used for training and 30% used for testing. Illustrate the results as output.


Create a Microsoft Word document that describes how this algorithm could be used to predict new records that are generated. This is often called scoring new records


Solutions

Expert Solution

Sample Dataset Downloaded and used is: https://bit.ly/2IWDUi0

Below is the code for KNN (K-nearest neighbors), ML algorithm to predict the target. In the code below, I have read the dataset using pandas and then separated the independent variables as X and dependent as y and further segmented the whole dataset in 70% and 30% for training and testing respectively.

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Reading the dataset:
df = pd.read_csv("diabetes.csv")

# Separating the independent and dependent variables as X and y 
X = df.iloc[:,:-1]
y = df.iloc[:,-1]

#Spliting the dataset with 30% of test size:
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.30, random_state=0)

#Fitting the data to the model:
knn = KNeighborsClassifier(n_neighbors = 15).fit(X_train,y_train)

# Functions to review the performance of the model:
def acc(a,b):
    print('Accuracy(test Data): ',accuracy_score(a,b))
    print('Confusion Matrix: \n',pd.DataFrame(confusion_matrix(a,b)))
    print('Classification Repost:\n',classification_report(a,b,digits=3))

# Checking the performance:
acc(y_test,knn.predict(X_test))

With the above code and n-neighbors=15, we got the below performance of the model:

Now let's talk a little above improving the model (Suggestions):

To improve or have a good prediction model requires the proper processing of the data before fitting it to the model. Above dataset is little-bit imbalanced because it has almost double the number of 0's as of 1's. So, whenever a dataset is imbalanced, we need to either remove some samples to have equal samples of both categories or apply some methods to generate some similar samples (SMOTE method can be used.)

Next we should check for collinearity of columns in the dataset and apply methods for its treatment. We should also apply some normalization method to have similar range of values.

With this I hope, You got the idea to go ahead on this.

Thanks & Have a nice day !

Below is the screenshot of the codes to avoid any indentation issues:


Related Solutions

Question 1: Using Python 3 Create an algorithm The goal is to create an algorithm that...
Question 1: Using Python 3 Create an algorithm The goal is to create an algorithm that can sort a singly-linked-list with Merge-sort. The program should read integers from file (hw-extra.txt) and create an unsorted singly-linked list. Then, the list should be sorted using merge sort algorithm. The merge-sort function should take the head of a linked list, and the size of the linked list as parameters. hw-extra.txt provided: 37 32 96 2 25 71 432 132 76 243 6 32...
1. Create a flow chart from the algorithm then create python code. Algorithm: Tell the user...
1. Create a flow chart from the algorithm then create python code. Algorithm: Tell the user to enter their name, how many cars the user has, and the average number of cars per family in the users state. Create a variable to show the difference between the users number of cars and the average number of cars per family. Print the values of name, number of cars, average number of cars, and the difference in 2 decimal places.
Part A: You have been hired by a company to build a predictive analytics model to...
Part A: You have been hired by a company to build a predictive analytics model to increase their sales. Before building the model, your manager asked you to start with exploratory data analysis and report the findings. Which visualization tools would you use to display important properties of data such as outliers, range of data, mean, IQR, and distribution of data, etc.? Be specific about the tools and methods that display the skewness, similarity of distributions, and whether data comes...
Part 2: What is the difference between historical analytics or predictive analytics in detail and provide...
Part 2: What is the difference between historical analytics or predictive analytics in detail and provide some example?
Explain the relationship between data mining and predictive analytics.
Explain the relationship between data mining and predictive analytics.
Explain the use of neural networking modeling in predictive analytics.
Explain the use of neural networking modeling in predictive analytics.
Do some research into workforce analytics and predictive analytics. What can we expect for the future?...
Do some research into workforce analytics and predictive analytics. What can we expect for the future? Be creative, think possibilities.
You are an analytics developer, and you need to write the searching algorithm to find the...
You are an analytics developer, and you need to write the searching algorithm to find the element. Your program should perform the following: Implement the Binary Search function. Write a random number generator that creates 1,000 elements, and store them in the array. Write a random number generator that generates a single element called searched value. Pass the searched value and array into the Binary Search function. If the searched value is available in the array, then the output is...
Write a Python program that: Create the algorithm in both flowchart and pseudocode forms for the...
Write a Python program that: Create the algorithm in both flowchart and pseudocode forms for the following requirements: Reads in a series of positive integers,  one number at a time;  and Calculate the product (multiplication) of all the integers less than 25,  and Calculate the sum (addition) of all the integers greater than or equal to 25. Use 0 as a sentinel value, which stops the input loop. [ If the input is 0 that means the end of the input list. ]...
How is data analytics different from statistics? Analytics tools fall into 3 categories:descriptive, predictive, and prescriptive....
How is data analytics different from statistics? Analytics tools fall into 3 categories:descriptive, predictive, and prescriptive. What are the main differences among these categories? Explain how businesses use analytics to convert raw operational data into actionable information. Provide at least 1 example
ADVERTISEMENT
ADVERTISEMENT
ADVERTISEMENT