Question

In: Computer Science

Python 3 Rewrite KNN sample code using KNeighborsClassifier . ● Repeat KNN Step 1 – 5,...

Python 3

Rewrite KNN sample code using KNeighborsClassifier .

● Repeat KNN Step 1 – 5, for at least five times and calculate average accuracy to be your result.

● If you use the latest version of scikit -learn, you need to program with Python >= 3.5.

● Use the same dataset: “ iris.data ”

● Split your data: 67% for training and 33% for testing

● Draw a line chart: Use a “for loop” to change k from 1 to 10 and check your model accuracy.

Requirements:

scipy
numpy
pandas
matplotlib
seaborn
sklearn

Code:

import csv
import random
import math
import operator

def loadDataset(filename, split, trainingSet=[] , testSet=[]):
with open(filename, 'rb') as csvfile:
lines = csv.reader(csvfile)
dataset = list(lines)
for x in range(len(dataset)-1):
for y in range(4):
dataset[x][y] = float(dataset[x][y])
if random.random() < split:
trainingSet.append(dataset[x])
else:
testSet.append(dataset[x])


def euclideanDistance(instance1, instance2, length):
distance = 0
for x in range(length):
distance += pow((instance1[x] - instance2[x]), 2)
return math.sqrt(distance)

def getNeighbors(trainingSet, testInstance, k):
distances = []
length = len(testInstance)-1
for x in range(len(trainingSet)):
dist = euclideanDistance(testInstance, trainingSet[x], length)
distances.append((trainingSet[x], dist))
distances.sort(key=operator.itemgetter(1))
neighbors = []
for x in range(k):
neighbors.append(distances[x][0])
return neighbors

def getResponse(neighbors):
classVotes = {}
for x in range(len(neighbors)):
response = neighbors[x][-1]
if response in classVotes:
classVotes[response] += 1
else:
classVotes[response] = 1
sortedVotes = sorted(classVotes.iteritems(), key=operator.itemgetter(1), reverse=True)
return sortedVotes[0][0]

def getAccuracy(testSet, predictions):
correct = 0
for x in range(len(testSet)):
if testSet[x][-1] == predictions[x]:
correct += 1
return (correct/float(len(testSet))) * 100.0
  
def main():
# prepare data
trainingSet=[]
testSet=[]
split = 0.67
loadDataset('iris.data', split, trainingSet, testSet)
print 'Train set: ' + repr(len(trainingSet))
print 'Test set: ' + repr(len(testSet))
# generate predictions
  
predictions=[]
k_values = range(10)
for k in k_values:
for x in range(len(testSet)):
neighbors = getNeighbors(trainingSet, testSet[x], k+1)
result = getResponse(neighbors)
predictions.append(result)

accuracy = getAccuracy(testSet, predictions)
print('Accuracy: ' + repr(accuracy) + '%')
  

main()

Solutions

Expert Solution

KNN (K-Nearest Neighbor) is a simple supervised classification algorithm we can use to assign a class to new data point. It can be used for regression as well, KNN does not make any assumptions on the data distribution, hence it is non-parametric. It keeps all the training data to make future predictions by computing the similarity between an input sample and each training instance.

KNN can be summarized as below:

  1. Computes the distance between the new data point with every training example.
  2. For computing the distance measures such as Euclidean distance, Hamming distance or Manhattan distance will be used.
  3. Model picks K entries in the database which are closest to the new data point.
  4. Then it does the majority vote i.e the most common class/label among those K entries will be the class of the new data point.

With K=3, Class B will be assigned, with K=6 Class A will be assigned

Detailed documentation on KNN is available here.

Below example shows implementation of KNN on iris dataset using scikit-learn library. Iris dataset has 50 samples for each different species of Iris flower(total of 150). For each sample we have sepal length, width and petal length and width and a species name(class/label).

Iris flower: sepal length, sepal width, petal length and width

  • 150 observations
  • 4 features(sepal length, sepal width, petal length, petal width)
  • Response variable is the iris species
  • Classification problem since response is categorical.

Our task is to build a KNN model which classifies the new species based on the sepal and petal measurements. Iris dataset is available in scikit-learn and we can make use of it build our KNN.

Complete code can be found in the Git Repo.

Step1: Import the required data and check the features.

Import the load_iris function form scikit-learen datasets module and create a iris Bunch object(bunch is a scikitlearn’s special object type for storing datasets and its attributes).

Each observation represents one flower and 4 columns represents 4 measurements.We can see the features(measures) under ‘data’ attribute, where as labels under ‘features_names’. As we can see below, labels/responses are encoded as 0,1 and 2. Because the features and repose should be numeric (Numpy arrays) for scikit-learn models and they should have a specific shape.

Step2: Split the data and Train the Model.

Training and testing on the same data is not an optimal approach, so we do split the data into two pieces, training set and testing set. We use ‘train_test_split’ function to split the data. Optional parameter ‘test-size’ determines the split percentage. ‘random_state’ parameter makes the data split the same way every time you run. Since we are training and testing on different sets of data, the resulting testing accuracy will be a better estimate of how well the model is likely to perform on unseen data.

Scikit-learn is carefully organized into modules, so that we can import the relevant classes easily. Import the class ‘KNeighborsClassifer’ from ‘neighbors’ module and Instantiate the estimator (‘estimator’ is scikit-learn’s term for a model). We are calling model as estimator because their primary role is to estimate unknown quantities.

In our example we are creating an instance (‘knn’ ) of the class ‘KNeighborsClassifer’, in other words we have created an object called ‘knn’ which knows how to do KNN classification once we provide the data. The parameter ‘n_neighbors’ is the tuning parameter/hyper parameter (k) . All other parameters are set to default values.

‘fit’ method is used to train the model on training data (X_train,y_train) and ‘predict’ method to do the testing on testing data (X_test). Choosing the optimal value of K is critical, so we fit and test the model for different values for K (from 1 to 25) using a for loop and record the KNN’s testing accuracy in a variable (scores).

Plot the relationship between the values of K and the corresponding testing accuracy using the matplotlib library. As we can see there is a raise and fall in the accuracy and it is quite typical when examining the model complexity with the accuracy. In general as the value of K increase there appears to be a raise in the accuracy and again it falls.

In general the Training accuracy rises as the model complexity increases, for KNN the model complexity is determined by the value of K. Larger K value leads to smoother decision boundary (less complex model). Smaller K leads to more complex model (may lead to overfitting). Testing accuracy penalizes models that are too complex(over fitting) or not complex enough(underfit). We get the maximum testing accuracy when the model has right level of complexity, in our case we can see that for a K value of 3 to 19 our model accuracy is 96.6%.

For our final model we can choose a optimal value of K as 5 (which falls between 3 and 19) and retrain the model with all the available data. And that will be our final model which is ready to make predictions.


Related Solutions

1. Please program the following in Python 3 code. 2. Please share your code. 3. Please...
1. Please program the following in Python 3 code. 2. Please share your code. 3. Please show all outputs. Instructions: Run Python code  List as Stack  and verify the following calculations; submit screen shots in a single file. Postfix Expression                Result 4 5 7 2 + - * = -16 3 4 + 2  * 7 / = 2 5 7 + 6 2 -  * = 48 4 2 3 5 1 - + * + = 18   List as Stack Code: """...
All code should be in Python 3. Implement the Stack Class, using the push, pop, str,...
All code should be in Python 3. Implement the Stack Class, using the push, pop, str, init methods, and the insurance variable 'list'.
USING THE TRADITIONAL 5-STEP PROCESS (NO P VALUE) --> A random sample of 450 visitors to...
USING THE TRADITIONAL 5-STEP PROCESS (NO P VALUE) --> A random sample of 450 visitors to a local shopping mall found that 90 were there to just walk the halls for exercise. Is this sufficient evidence to conclude that more than 15% of visitors to the mall were there simply for the exercise? Let α = 0.05. --> My grandfather, Roscoe, needed a new band saw for his carpentry workshop, and asked me to call two different stores to check...
Can you rewrite this MATLAB code using a for loop instead of a while loop? %formatting...
Can you rewrite this MATLAB code using a for loop instead of a while loop? %formatting clc, clear, format compact; %define variables k=1; b=-2; x=-1; y=-2; %while loop initialization for k <= 3 disp([num2str(k), ' ',num2str(b),' ',num2str(x),' ',num2str(y),]); y = x^2 -3; if y< b b = y; end x = x+1; k = k+1; end
Hi, I'm trying to rewrite the code below (code #1) by changing delay() to millis(). void...
Hi, I'm trying to rewrite the code below (code #1) by changing delay() to millis(). void loop() { // Print the value inside of myBPM. Serial.begin(9600); int myBPM = pulseSensor.getBeatsPerMinute(); // Calls function on our pulseSensor object that returns BPM as an "int". // "myBPM" hold this BPM value now. if (pulseSensor.sawStartOfBeat()) { // Constantly test to see if "a beat happened". Serial.println("♥ A HeartBeat Happened ! "); // If test is "true", print a message "a heartbeat happened". Serial.print("BPM:...
Python Practice Sample:   Write code to replace every occurrence of THE or the with ### and...
Python Practice Sample:   Write code to replace every occurrence of THE or the with ### and every word ending with the letter s to end with a $. Print the resulting text four words per line (and any remaining words from each paragraph on the last one of each paragraph) "The modern business world goes way beyond the balance sheet. Whether your passion is finance or fashion, economics or the environment, you need an education built for business. At Bentley,...
Using R Question 3. kNN Classification 3.1 Read in iris dataset using “data(iris)”. Describe the features...
Using R Question 3. kNN Classification 3.1 Read in iris dataset using “data(iris)”. Describe the features in the data using summary 3.2 Randomize the iris data set, mix it up and normalize it 3.3 split data into training & testing (70/30 split) 3.4 Train model in data and use crosstable function to evaluate the results 3.5 Rerun your code for K=10 and 100. Compare results and explain
For these of string functions, write the code for it in C++ or Python (without using...
For these of string functions, write the code for it in C++ or Python (without using any of thatlanguage's built-in functions) You may assume there is a function to convert Small string into the language string type and a function to convert your language's string type back to Small string type. 1. int [] searchA,ll(string in...str, string sub): returns an array of positions of sub in in...str or an one element array with -1 if sub doesn't exist in in...str
Python Code for 8-queens using random restart algorithms
Python Code for 8-queens using random restart algorithms
Write the python code that generates a normal sample with given μ and σ, and the...
Write the python code that generates a normal sample with given μ and σ, and the code that calculates m (sample mean) and s (sample standard deviation) from the sample.
ADVERTISEMENT
ADVERTISEMENT
ADVERTISEMENT