In: Computer Science
Project 4 – The Nearest Neighbors Classification Algorithm This project will require you to implement a version of the Nearest Neighbors classification algorithm. This version, the Three Nearest Neighbors (or 3NN for short), is one of the more intuitive classification algorithms, and one of the easier ones to code. This Nearest Neighbors family of algorithms is useful because it does not require any special “training” to work. You simply need previous data to compare to. What is classification? With classification, the goal is to use previously collected data to train an algorithm to correctly classify future data. If I have a data set on the measurements of the sepal (the part of a flower that encloses a petal) length and width of two kinds of irises, setosa and versicolor, I can use that data to predict what kind of iris another flower is from sepal measurements alone. Our classification algorithm would be trained to draw a line between the two regions we can see in the scatterplot, as so: Figure 1 Figure 2 Deciding that future data on one side of the line should be setosa, and the other side should be versicolor. Note that we would not expect such a classifier to be perfect: You can see that one of the data points for setosa is on the versicolor side of the boundary in this example. Rather, we would expect the classification algorithm to work an acceptable amount of the time. What is acceptable depends on the application. Definitions Training Data – Data used when implementing the classification algorithm to draw boundaries. In the above example, the x and o data points are the training data. Class – Different categories that training data has been divided into. In the above example, setosa and versicolor are classes. Test Data – Data for which we do not know the class. We will use the classification algorithm to help us guess which class it might come from. Data Space – The data involved with training and testing, in all of its demensions Nearest Neighbors Classifier The nearest neighbor classifier is relatively straightforward. First, it takes all of the dimensions of training data available and normalizes each dimension. An incoming piece of test data is also normalized based on the means and standard deviations of the training data. Once both test and training data are normalized, we wind up with a data space that might look like this: In the nearest neighbors algorithm, we look at the Euclidian distance between the test data and the training data. We then look at the closest K neighbors, where K is determined before training. In this example, and in this project, K = 3. Those closest neighbors then “vote,” with the class that has the largest number of closest neighbors being assigned as the class of the test data. Figure 3 In the above example (test data was sepal length 6 cm, sepal width 4cm), the 3NN algorithm classifies the test data as setosa: Because the three closest neighbors are all of the class setosa. In another test data case, with sepal length 5.75 cm and sepal width 3.5 cm, we also find that the class is setosa: As two of the three closest neighbors are of that class. In a third test data case, with sepal length as 5.2 cm and sepal width as 2.9 cm, we find the class is versicolor: Figure 4 Figure 5 As two of the three closest neighbors are of that class. Inputs and Outputs In this project, you will implement the 3NN algorithm in a MATLAB function. This algorithm will only work with data that is in 2 dimensions with 2 classes. This function will have two inputs: dataSet – A structure with the following fields: - data1 – A numerical array of doubles, the first dimension of a data set, of length N - data2 – A numerical array of doubles, the second dimension of a data set, also of length N - label – A cell array, the class labels of your data set. Each element is a vector string. Also of length N - name1 – A string with the first possible class label - name2 – A string with the second possible class label testData – A 1 x 2 array of doubles, where the first element is the value of your test data in dimension 1, and the second element is the value of your test data in dimension 2. This function will have one output: winner – A string with the label of the class assigned to testData. Your function should also produce a plot of the normalized Data space similar to figure 3. To create a plot of points instead of lines, use the following inputs to plot: plot(x,y,'Marker','o','LineStyle','none') Where the string after the input “Marker” determines the shape of each point, and “ ‘LineStyle’, ‘none’ “ prevents any lines from being drawn between points. Methods Figure 6 Please use the following methods to implement the 3NN algorithm: Step 1 Normalize both dimensions of your training data set. Then normalize your test data using the mean and standard deviation of your training data. You now have enough data to generate the required plot. Step 2 Find the Euclidian distance between your test data and all training data using the Pythagorean theorem with your normalized data. Step 3 Find the three points in the training dataset that are closest to your test data. Step 4 Have the three nearest neighbors “vote” to see which class has more representation in the nearest neighbors. Hint: This is where you will want to use the label, name1, and name2 fields. Use the result of voting to set the output. Testing Code I have prepared three test datasets for you, that you can find in the files: - iris.mat, which is the data set in the above examples - diabetes.mat, which is blood glucose and ages of patients 5 years prior for people with and without diabetes - wheat.mat, which is the area and compactness of the kernals of two varieties of wheat seed You should be able to use your visualizations to check whether your nearest neighbors algorithms are working properly. Hint: Try choosing test data in some of the less “crowded” locations of your data space.