Suppose that you take a data set, divide it into equally-sized training and test sets, and...

Suppose that you take a data set, divide it into equally-sized training and test sets, and

then try out two different classification procedures. First you use logistic regression and

get an error rate of 20% on the training data and 30% on the test data. Next, you use

1-nearest neighbors (i.e. K = 1) and get an average error rate (averaged over both test

and training data sets) of 18%. Based on these results, which method should we prefer

to use for classification of new observations? Why?

Expert Solution

Solution:

So,In this case, It is better to use logistic regression for classification of new observation

Explanation:

Because of if we use logistic regression of equal size of training and test data set, and get an error rate of 20% on the training data and 30% on the test data, So an average error rate(averaged over both testing and training data sets) is equal to (20+30)/2=25 and test data error rate is 30%.

The KNN is a simplest, non -parametric and lazy model. If k=1, then the object is simply assigned to the class of that single nearest neighbour. When K=1, you will choose the closet training sample to your test sample. Since your test sample is in the training dataset. It will choose itself as the closest and never make mistake. For this reason, the training error will be zero when k=1 irrespective of the dataset, that explains the 0% training error rate. However, we have an average rate of 18% which indicates a test error rate of 18*2=36% of KNN which is greater than test error rate for logistic regression on 30%.If error rate iof test data set higher then we can say that model is less classified lower the error rate then better model.

orchestra answered 2 years ago

A large data set is separated into a training set and a test set. (a) Is...

A large data set is separated into a training set and a test set. (a) Is it necessary to do this randomly? Why or why not? (b) In R how might this separation be done in a reproducible way? (c) The statistician chooses 20% of the data for training and 80% for testing. Comment briefly on this—2 or 3 lines would be plenty.

suppose you have two sets of data to work with.The first set is a list of...

suppose you have two sets of data to work with.The first set is a list of all the injuries that were seen in a clinic in a month's time.The second set contains data on the number of minutes that each patient spent in the waiting room of a doctor's office. Propose your idea of how to represent the key information.To organize your data would you choose to use a frequency table,a culmative frequency table, or avrelative frequency table?Why?

Some modelers prefer to partition the data into three data sets (training/validation/test) vs. the more typical...

Some modelers prefer to partition the data into three data sets (training/validation/test) vs. the more typical two data sets (training/validation). A test set can be used during the final modeling step to measure the expected prediction error in practice given that it has been totally separated from the modeling/validation process. Do you think it is important to partition the data into three data sets (training/validation/test) or just two (training/validation)? Justify your opinion by discussing the pros and cons of each...

Healthcare data sets is an interesting topic. What are data sets? Why would a data set...

Healthcare data sets is an interesting topic. What are data sets? Why would a data set be developed? Provide one to two examples only not a list.

Suppose that a student in this class uses their personalized class data set to test the...

Suppose that a student in this class uses their personalized class data set to test the hypothesis that more than 50% of the people in this class are in Business, and rejects the null hypothesis at the 2% significance level. Consider the following statements. (i) The p-value is greater than .02. (ii) If another student in this class tested the same hypothesis with their personalized class data set, using the same significance level, then that student might not reject the...

You decide to volunteer at a community garden and are assigned to tend six equally sized...

You decide to volunteer at a community garden and are assigned to tend six equally sized vegetable beds. Three of the beds contain only tomato plants, while the other three contain tomatoes as well as snow peas, bell peppers, cucumbers, okra, and broccoli. You work hard all summer and fall, tending the plants through insect attacks, diseases, and severe summer storms. You harvest vegetables every three days and start to notice a pattern. The mixed vegetable plots consistently produce 4...

Given two sets of data, A and B. i) Data set A has an r value...

Given two sets of data, A and B. i) Data set A has an r value of -.81 and data set B has an r value of .94 Describe the differences between the two data sets as completely as you can using the regression information we have learned. ii) Which linear regression equation, the one for A or the one for B, would probably be a better predictor? Why?

The data set airquality is one of R’s included data sets. It shows daily measurements of...

The data set airquality is one of R’s included data sets. It shows daily measurements of ozone concentration (Ozone), solar radiation (Solar.R), wind speed (Wind), and temperature (Temp) for 5 summer months in 1977 in New York City. Some of the observations are missing and are recorded as NA, meaning not available. View an overall summary of the variables in airquality with the command > summary(airquality) Ignore the summaries for Month and Day since those variables should be factors, not...

You bend a rod made of three equally-sized nonconducting segments (total length πR) into a semicircular...

You bend a rod made of three equally-sized nonconducting segments (total length πR) into a semicircular arc. The outer two segments carry uniformly distributed charges of −q and the center segment is electrically neutral. Find the electric field at the center of the arc

Find a data set on the internet. Some suggested search terms: Free Data Sets, Medical Data...

Find a data set on the internet. Some suggested search terms: Free Data Sets, Medical Data Sets, Education Data Sets. Introduce your Data Set and Cite the Source. What trends do you notice in your data set? Based on the trends and the history of your data set, make a claim. What kind of test (left, right, two tailed) would you have to complete? Explain the steps needed to complete the Hypothesis Test. What is needed?

Question