In: Statistics and Probability
A large data set is separated into a training set and a test set.
(a) Is it necessary to do this randomly? Why or why not?
(b) In R how might this separation be done in a reproducible way?
(c) The statistician chooses 20% of the data for training and 80% for testing. Comment briefly on this—2 or 3 lines would be plenty.
(a) Yes, it is necessary to separate the data into a training set and a test set randomly.
If it is not done randomly and say, first 80% are taken as training and next 20% as test, it is possible that some type of observations are present only in the initial part of the data. It may also happen that soem peculiar observations are present only in the last part of the dataset. If these none of these peculiar test observations from the test set are used while forming the model, then we would get huge error while predicting any observation of this kind.
Hence, it is necessary to spearate randomly so that both the train data and the test data have most of the kinds of observations.
(b)
(c) Suppose there are 100 observations in the whole dataset. Choosing 20% as training and 80% as test will lead to 20 of the 100 observations to go to the training set and 80 to go to the test set. But this ratio of separation is not a good idea. Doing so will lead to a model built basis of very few observation. The major part of dataset is not used while building the model and hence this would lead to a large prediction error.
Hope this was helpful. Please leave back any comment.