In: Statistics and Probability
The library would like to compare the regression and exponential smoothing models to determine which is a better predictor, using the mean absolute error | (books borrowed) – (model’s estimate)|/n as a measure of prediction quality.
Select the best of the following four options for splitting the data:
A. 15% for training, 15% for validation, 70% for test
B. 15% for training, 70% for validation, 15% for test
C. 70% for training, 15% for validation, 15% for test
D. 55% for training, 15% for cross-validation, 15% for validation,
15% for test
The person who built these models discovered that although the regression model performed much better on the training set, the two models performed about the same on the validation set:
Mean absolute error (training set) Mean absolute error (validation set)
Regression model 110 140
Exponenetial Smoothing Model 140 150
Select all reasonable suggestions below:
A. To choose between the models, we should see which one does better on the training set.
B. The regression model is clearly better, because it does better on the training set and about the same on the validation set.
C. The regression model is probably fit too much to random patterns (i.e., it is overfit), because it performs much worse on the validation set than on the training set.
D. If there had been 20 models, the one that performed best on the validation set would probably not perform as well on the test set as it did on the validation set.
The best spllitting stratetgy is to have more data for training and equal amount of data for testing and validation
hence choice C is most appropriate
c 70% training , 15% validation , 15% for test
A. To choose between the models, we should see which one does better on the training set.
no , we should choose which performs better on the testing set or
validation set . The model can be overfitted on the training
set
B. The regression model is clearly better, because it does better
on the training set and about the same on the validation set.
Regression model 110 140
Exponenetial Smoothing Model 140 150
no there is a difference and the validation error is abou 140
C. The regression model is probably fit too much to random patterns
(i.e., it is overfit), because it performs much worse on the
validation set than on the training set.
Yes , this could be the case while the training error is low
validation error is high , hence it is a classical case of
overfitting
D. If there had been 20 models, the one that performed best on the validation set would probably not perform as well on the test set as it did on the validation set.
No this is not true , if the model performs well on the validation set (which is new data) hence the model would perform equally good at the test set