Question

In: Computer Science

Project 4 – The Nearest Neighbors Classification Algorithm This project will require you to implement a...

Project 4 – The Nearest Neighbors Classification Algorithm This project will require you to implement a version of the Nearest Neighbors classification algorithm. This version, the Three Nearest Neighbors (or 3NN for short), is one of the more intuitive classification algorithms, and one of the easier ones to code. This Nearest Neighbors family of algorithms is useful because it does not require any special “training” to work. You simply need previous data to compare to. What is classification? With classification, the goal is to use previously collected data to train an algorithm to correctly classify future data. If I have a data set on the measurements of the sepal (the part of a flower that encloses a petal) length and width of two kinds of irises, setosa and versicolor, I can use that data to predict what kind of iris another flower is from sepal measurements alone. Our classification algorithm would be trained to draw a line between the two regions we can see in the scatterplot, as so: Figure 1 Figure 2 Deciding that future data on one side of the line should be setosa, and the other side should be versicolor. Note that we would not expect such a classifier to be perfect: You can see that one of the data points for setosa is on the versicolor side of the boundary in this example. Rather, we would expect the classification algorithm to work an acceptable amount of the time. What is acceptable depends on the application. Definitions Training Data – Data used when implementing the classification algorithm to draw boundaries. In the above example, the x and o data points are the training data. Class – Different categories that training data has been divided into. In the above example, setosa and versicolor are classes. Test Data – Data for which we do not know the class. We will use the classification algorithm to help us guess which class it might come from. Data Space – The data involved with training and testing, in all of its demensions Nearest Neighbors Classifier The nearest neighbor classifier is relatively straightforward. First, it takes all of the dimensions of training data available and normalizes each dimension. An incoming piece of test data is also normalized based on the means and standard deviations of the training data. Once both test and training data are normalized, we wind up with a data space that might look like this: In the nearest neighbors algorithm, we look at the Euclidian distance between the test data and the training data. We then look at the closest K neighbors, where K is determined before training. In this example, and in this project, K = 3. Those closest neighbors then “vote,” with the class that has the largest number of closest neighbors being assigned as the class of the test data. Figure 3 In the above example (test data was sepal length 6 cm, sepal width 4cm), the 3NN algorithm classifies the test data as setosa: Because the three closest neighbors are all of the class setosa. In another test data case, with sepal length 5.75 cm and sepal width 3.5 cm, we also find that the class is setosa: As two of the three closest neighbors are of that class. In a third test data case, with sepal length as 5.2 cm and sepal width as 2.9 cm, we find the class is versicolor: Figure 4 Figure 5 As two of the three closest neighbors are of that class. Inputs and Outputs In this project, you will implement the 3NN algorithm in a MATLAB function. This algorithm will only work with data that is in 2 dimensions with 2 classes. This function will have two inputs: dataSet – A structure with the following fields: - data1 – A numerical array of doubles, the first dimension of a data set, of length N - data2 – A numerical array of doubles, the second dimension of a data set, also of length N - label – A cell array, the class labels of your data set. Each element is a vector string. Also of length N - name1 – A string with the first possible class label - name2 – A string with the second possible class label testData – A 1 x 2 array of doubles, where the first element is the value of your test data in dimension 1, and the second element is the value of your test data in dimension 2. This function will have one output: winner – A string with the label of the class assigned to testData. Your function should also produce a plot of the normalized Data space similar to figure 3. To create a plot of points instead of lines, use the following inputs to plot: plot(x,y,'Marker','o','LineStyle','none') Where the string after the input “Marker” determines the shape of each point, and “ ‘LineStyle’, ‘none’ “ prevents any lines from being drawn between points. Methods Figure 6 Please use the following methods to implement the 3NN algorithm: Step 1 Normalize both dimensions of your training data set. Then normalize your test data using the mean and standard deviation of your training data. You now have enough data to generate the required plot. Step 2 Find the Euclidian distance between your test data and all training data using the Pythagorean theorem with your normalized data. Step 3 Find the three points in the training dataset that are closest to your test data. Step 4 Have the three nearest neighbors “vote” to see which class has more representation in the nearest neighbors. Hint: This is where you will want to use the label, name1, and name2 fields. Use the result of voting to set the output. Testing Code I have prepared three test datasets for you, that you can find in the files: - iris.mat, which is the data set in the above examples - diabetes.mat, which is blood glucose and ages of patients 5 years prior for people with and without diabetes - wheat.mat, which is the area and compactness of the kernals of two varieties of wheat seed You should be able to use your visualizations to check whether your nearest neighbors algorithms are working properly. Hint: Try choosing test data in some of the less “crowded” locations of your data space.

Solutions

Expert Solution


Related Solutions

Describe how and why you should partition your data when using classification techniques like k-nearest neighbors...
Describe how and why you should partition your data when using classification techniques like k-nearest neighbors and logistic regression.
In this lab, you will implement Heap Sort algorithm for the same inputs. For each algorithm,...
In this lab, you will implement Heap Sort algorithm for the same inputs. For each algorithm, and for each n = 100, 200, 300, 400, 500, 1000, 4000, 10000, measure its running time and number of steps when the input is (1) already sort, i.e. n, n-1, …, 3, 2,1; (2) reversely sorted 1, 2, 3, … n; (3) random permutation of 1, 2, …, n; (4) 50 instances of n random numbers generated in the range of [1..n]. Note:...
4) Implement the Selection Sort algorithm discussed in class to sort a list of a certain...
4) Implement the Selection Sort algorithm discussed in class to sort a list of a certain size. The list is to be implemented using a dynamic array as well as a singly linked list. You are required to compare the performance (sorting time) for the dynamic array and singly-linked list-based implementations. You are given the startup codes for the dynamic array and singly-linked list based implementations. You are required to implement the Selection Sort function in each of these codes....
Project Description: In groups of four students, implement (New Encryption Algorithm) character oriented and apply it...
Project Description: In groups of four students, implement (New Encryption Algorithm) character oriented and apply it using any programming Language that you are familiar with to encrypt and decrypt your name. Your work should cover: 1. Create the algorithm and explain (encryption-decryption) using block diagram. 2. Mention whether will you use the keys or not in your algorithm. 3- If you will use keys, explain how to generate the keys. 2. Display the generated keys 3. Encrypt your name 4....
The project requires you to implement 4 functions of your choice in a file called Utilities.java....
The project requires you to implement 4 functions of your choice in a file called Utilities.java. The functions may relate to statistics, graphics/animation, audio, text, or image applications. Deliverables (files that need to be submitted): 1. A word document that briefly describes the functions that were implemented and the API. 2. Utilities.java file with the 4 functions and a Test.java file that provides examples of using the functions in Utilities.java.
In this lab, you will implement Heap Sort algorithm in C++ and Report the number of...
In this lab, you will implement Heap Sort algorithm in C++ and Report the number of steps and the CPU running time in a table, please show the code and output Approximation the constant c in the complexity of heap sort (cnlgn) by inspecting the results For each algorithm, and for each n = 100, 200, 300, 400, 500, 1000, 4000, 10000, measure its running time and number of steps when the input is (1) already sort, i.e. n, n-1,...
Suppose you are using a new black-box classification algorithm and you have computed training error and...
Suppose you are using a new black-box classification algorithm and you have computed training error and test error using this algorithm. (a) When will you say overfitting is taking place? (b) When will you say underfitting is taking place?
The Down Towner is considering a project with a life of 4 years that will require...
The Down Towner is considering a project with a life of 4 years that will require $164,800 for fixed assets and $42,400 for net working capital. The fixed assets will be depreciated using the Year 2018 bonus depreciation method. At the end of the project, the fixed assets can be sold for $37,500 cash and the net working capital will return to its original level. The project is expected to generate annual sales of $195,000 and costs of $117,500. The...
(50’) Implement Quick-Sort algorithm in quickSort.cpp, where you are expected to implement three functions, swap(), partition()...
(50’) Implement Quick-Sort algorithm in quickSort.cpp, where you are expected to implement three functions, swap(), partition() and quickSort(). You are expected to call swap() within partition(), to call partition() within quickSort(), and you are not expected to declare/ implement other additional functions nor change the main() function. OPTIONAL: If you don’t need/ want to use swap() in your implementation, that is fine. Just delete/ comment it. quickSort.cpp #include <iostream> using namespace std;    // A helper function to facilitate swapping...
For this assignment, you will implement the minimax algorithm with alpha-beta pruning in order to find...
For this assignment, you will implement the minimax algorithm with alpha-beta pruning in order to find the optimal move for a game of generalized tic-tac-toe. In this version of the game, the players can choose different board sizes, for example 4x4 or 5x5, instead of the normal 3x3. The game proceeds with the usual rules for tic-tac-toe (see https://en.wikipedia.org/wiki/Tic-tac-toe). Requirements You are to modify the mp2basecode (below) program to implement the alpha-beta search for making the computer’s move. This will...
ADVERTISEMENT
ADVERTISEMENT
ADVERTISEMENT