In: Statistics and Probability
Suppose that you take a data set, divide it into equally-sized training and test sets, and
then try out two different classification procedures. First you use logistic regression and
get an error rate of 20% on the training data and 30% on the test data. Next, you use
1-nearest neighbors (i.e. K = 1) and get an average error rate (averaged over both test
and training data sets) of 18%. Based on these results, which method should we prefer
to use for classification of new observations? Why?
Solution:
So,In this case, It is better to use logistic regression for classification of new observation
Explanation:
Because of if we use logistic regression of equal size of training and test data set, and get an error rate of 20% on the training data and 30% on the test data, So an average error rate(averaged over both testing and training data sets) is equal to (20+30)/2=25 and test data error rate is 30%.
The KNN is a simplest, non -parametric and lazy model. If k=1, then the object is simply assigned to the class of that single nearest neighbour. When K=1, you will choose the closet training sample to your test sample. Since your test sample is in the training dataset. It will choose itself as the closest and never make mistake. For this reason, the training error will be zero when k=1 irrespective of the dataset, that explains the 0% training error rate. However, we have an average rate of 18% which indicates a test error rate of 18*2=36% of KNN which is greater than test error rate for logistic regression on 30%.If error rate iof test data set higher then we can say that model is less classified lower the error rate then better model.