Question

In: Statistics and Probability

Show that as the number of bootstrap samples B gets large, the oob error estimate for...

Show that as the number of bootstrap samples B gets large, the oob error estimate for a random forest approaches its N-fold CV error estimate, and that in the limit, the identity is exact.

Solutions

Expert Solution

  • training error (as in predict(model, data=train)) is typically useless. Unless you do (non-standard) pruning of the trees, it cannot be much above 0 by design of the algorithm. Random forest uses bootstrap aggregation of decision trees, which are known to be overfit badly. This is like training error for a 1-nearest-neighbour classifier.

  • However, the algorithm offers a very elegant way of computing the out-of-bag error estimate which is essentially an out-of-bootstrap estimate of the aggregated model's error). The out-of-bag error is the estimated error for aggregating the predictions of the ≈1/e

  • fraction of the trees that were trained without that particular case.
    The models aggregated for the out-of-bag error will only be independent, if there is no dependence between the input data rows. I.e. each row = one independent case, no hierarchical data structure / no clustering / no repeated measurements.

    So the out-of-bag error is not exactly the same (less trees for aggregating, more training case copies) as a cross validation error, but for practical purposes it is close enough.

  • What would make sense to look at in order to detect overfitting is comparing out-of-bag error with an external validation. However, unless you know about clustering in your data, a "simple" cross validation error will be prone to the same optimistic bias as the out-of-bag error: the splitting is done according to very similar principles.
    You'd need to compare out-of-bag or cross validation with error for a well-designed test experiment to detect this.

Out-of-bag error is useful, and may replace other performance estimation protocols (like cross-validation), but should be used with care.

Like cross-validation, performance estimation using out-of-bag samples is computed using data that were not used for learning. If the data have been processed in a way that transfers information across samples, the estimate will (probably) be biased. Simple examples that come to mind are performing feature selection or missing value imputation. In both cases (and especially for feature selection) the data are transformed using information from the whole data set, biasing the estimate.


Related Solutions

B – Compute the Standard Error of the estimate.
    Xi 3 12 6 20 14 Yi 55 40 55 10 15                 B – Compute the Standard Error of the estimate.           
Now generate at least 5000 bootstrap samples and observe the bootstrap distribution.a.What does each dot in...
Now generate at least 5000 bootstrap samples and observe the bootstrap distribution.a.What does each dot in the distribution represent?b.Where is the middle of the distribution?c.What is the standard error for the distribution?d.Use the standard errorto compute a 95% confidence interval for the correlation.e.Now use the percentile method to compute a 95% confidence interval. (remember, click on the ‘Two-Tail’ box in the distribution plot). Are the two 95% confidence intervalsvery different?5.Using the confidence interval from part 4 (either one), can we...
2. ‘As the number of assets, n, in a portfolio gets large the contribution to the...
2. ‘As the number of assets, n, in a portfolio gets large the contribution to the portfolio variance of the individual asset variances approaches zero but the contribution of the asset covariance terms approaches the average covariance, i.e. the individual risk of assets can be diversified away but the risk caused by asset covariance cannot’. Explain this statement in order to demonstrate how the relationship between assets in a portfolio affects the portfolio risk.
How large a sample should be selected so that the margin of error of estimate for...
How large a sample should be selected so that the margin of error of estimate for a 98% confidence interval for p is .03 when the value of the sample proportion obtained from a preliminary sample is .75? n = 2013 n = 876 n = 1990 n = 1132
How large a sample should be selected so that the maximum error of estimate for a...
How large a sample should be selected so that the maximum error of estimate for a 95% confidence interval for the population mean is 2.1 Assume the population standard deviation is 10.5.
For these data, what is the unbiased estimate of the error variance? (Give a number.) x...
For these data, what is the unbiased estimate of the error variance? (Give a number.) x y 77.5 45 80 73 78 43 78.5 61 77.5 52 83 56 83.5 70 81.5 70 75.5 53 69.5 51 70 39 73.5 55 77.5 55 79 57 80 68 79 73 76 57 76 51 75.5 55 79.5 56 78.5 72 82 73 71.5 69 70 38 68 50 66.5 37 69 43 70.5 42 63 25 64 31 64.5 31 65...
How large a sample n would we need to estimate p with a margin of error...
How large a sample n would we need to estimate p with a margin of error of 0.05 with 95% confidence?
Random samples of female and male UVA undergraduates are asked to estimate the number of alcoholic...
Random samples of female and male UVA undergraduates are asked to estimate the number of alcoholic drinks that each consumes on a typical weekend. The data is below: Females (Population 1): 4, 5, 3, 6, 6, 4, 5, 3, 4, 2 Males (Population 2): 4, 6, 8, 4, 5, 6, 7, 6, 6, 7 Give a 95% confidence interval for the difference between mean female and male drink consumption. For each confidence interval, enter your answer in the form (LCL,...
A botanist wishes to estimate the typical number of seeds for a certain fruit. She samples...
A botanist wishes to estimate the typical number of seeds for a certain fruit. She samples 36 specimens and counts the number of seeds in each. Use her sample results (mean = 76.3, standard deviation = 10.8) to find the 99% confidence interval for the number of seeds for the species. Enter your answer as an open-interval (i.e., parentheses) accurate to 1 decimal place.   99% C.I. =
Random samples of female and male UVA undergraduates are asked to estimate the number of alcoholic...
Random samples of female and male UVA undergraduates are asked to estimate the number of alcoholic drinks that each consumes on a typical weekend. The data is below: Females (Population 1): 2, 2, 2, 1, 3, 1, 3, 5, 5, 3 Males (Population 2): 5, 5, 6, 4, 4, 6, 3, 3, 3, 5 Give a 92.3% confidence interval for the difference between mean female and male drink consumption. (Assume that the population variances are equal.) Confidence Interval =
ADVERTISEMENT
ADVERTISEMENT
ADVERTISEMENT