Question

In: Statistics and Probability

A large data set is separated into a training set and a test set. (a) Is...

A large data set is separated into a training set and a test set.

(a) Is it necessary to do this randomly? Why or why not?

(b) In R how might this separation be done in a reproducible way?

(c) The statistician chooses 20% of the data for training and 80% for testing. Comment briefly on this—2 or 3 lines would be plenty.

Solutions

Expert Solution

(a) Yes, it is necessary to separate the data into a training set and a test set randomly.

If it is not done randomly and say, first 80% are taken as training and next 20% as test, it is possible that some type of observations are present only in the initial part of the data. It may also happen that soem peculiar observations are present only in the last part of the dataset. If these none of these peculiar test observations from the test set are used while forming the model, then we would get huge error while predicting any observation of this kind.

Hence, it is necessary to spearate randomly so that both the train data and the test data have most of the kinds of observations.

(b)

(c) Suppose there are 100 observations in the whole dataset. Choosing 20% as training and 80% as test will lead to 20 of the 100 observations to go to the training set and 80 to go to the test set. But this ratio of separation is not a good idea. Doing so will lead to a model built basis of very few observation. The major part of dataset is not used while building the model and hence this would lead to a large prediction error.

Hope this was helpful. Please leave back any comment.


Related Solutions

Suppose that you take a data set, divide it into equally-sized training and test sets, and...
Suppose that you take a data set, divide it into equally-sized training and test sets, and then try out two different classification procedures. First you use logistic regression and get an error rate of 20% on the training data and 30% on the test data. Next, you use 1-nearest neighbors (i.e. K = 1) and get an average error rate (averaged over both test and training data sets) of 18%. Based on these results, which method should we prefer to...
Test a hypothesis using variables in the data set for which ANOVA is the appropriate test...
Test a hypothesis using variables in the data set for which ANOVA is the appropriate test (do NOT use the variables assigned for the final project). Data: Gender abuse female 7.00    female .00 female 7.00 male 7.00 male .00 male 7.00 female 7.00 female 7.00 female .00 female .00 State the null and research hypotheses in statistical terms, including the appropriate notation. Explain why ANOVA is the appropriate test. In your explanation, describe the formula (13.1) for the statistic...
The following data set represents the test scores of the freshmen on the first in a...
The following data set represents the test scores of the freshmen on the first in a statistics course at a local university. 62 67 74 48 100 93 49 57 77 63 82 10 78 88 99 44 51 80 71 39 58 76 89 94 70 41 66 82 18 73 a. Calculate the z-score for the observation 63 and interpret it. b. Find the median of the data set. If the z-score for an observation is -1.22, the...
Why can’t you depend on using the range of a large data set?
Why can’t you depend on using the range of a large data set?
Some modelers prefer to partition the data into three data sets (training/validation/test) vs. the more typical...
Some modelers prefer to partition the data into three data sets (training/validation/test) vs. the more typical two data sets (training/validation). A test set can be used during the final modeling step to measure the expected prediction error in practice given that it has been totally separated from the modeling/validation process. Do you think it is important to partition the data into three data sets (training/validation/test) or just two (training/validation)? Justify your opinion by discussing the pros and cons of each...
The test scores of Statistics are listed below, find the mean of the data set that...
The test scores of Statistics are listed below, find the mean of the data set that excluding the outliers: 75,48, 83, 55, 70, 78, 50, 52, 53, 40, 54, 60, 48, 65, 53, 47, 33, 53, 28, 50, 48,55 A. 54.35 B. 56.45 C. 61.15 D. 53.15
Suppose that a student in this class uses their personalized class data set to test the...
Suppose that a student in this class uses their personalized class data set to test the hypothesis that more than 50% of the people in this class are in Business, and rejects the null hypothesis at the 2% significance level. Consider the following statements. (i) The p-value is greater than .02. (ii) If another student in this class tested the same hypothesis with their personalized class data set, using the same significance level, then that student might not reject the...
Given a data set with 100 observations, a goodness of fit test to see if a...
Given a data set with 100 observations, a goodness of fit test to see if a sample follows a uniform distribution or a poisson distribution or a normal distribution will have the same number of degrees of freedom. true or false and When a contingency table of expected frequencies is constructed, the null hypothesis is that all of the cells in the table are equally likely. true or false thank you :)
Find, or come up with, a data set to test the equality of means of 3...
Find, or come up with, a data set to test the equality of means of 3 categories. Provide the sample statistics for each category. Using technology (Ti-84 or Excel) find the critical value, test statistic and p-value of the ANOVA test. Then interpret the results in the context of the problem.
A large data set on Toledo workers was collected and the first three workers are characterized...
A large data set on Toledo workers was collected and the first three workers are characterized by: Worker Age Hourly Wage Female Union High School 1 33 $20 1 0 0 2 30 $24 0 1 1 3 36 $16 0 0 0 For the entire data set the average age is 31, the standard deviation of the age is 5, the average hourly wage is $15, and the standard deviation of the hourly wage is $4. What is the...
ADVERTISEMENT
ADVERTISEMENT
ADVERTISEMENT