In: Statistics and Probability
Show that as the number of bootstrap samples B gets large, the oob error estimate for a random forest approaches its N-fold CV error estimate, and that in the limit, the identity is exact.
training error (as in predict(model, data=train)) is typically useless. Unless you do (non-standard) pruning of the trees, it cannot be much above 0 by design of the algorithm. Random forest uses bootstrap aggregation of decision trees, which are known to be overfit badly. This is like training error for a 1-nearest-neighbour classifier.
However, the algorithm offers a very elegant way of computing the out-of-bag error estimate which is essentially an out-of-bootstrap estimate of the aggregated model's error). The out-of-bag error is the estimated error for aggregating the predictions of the ≈1/e
fraction of the trees that were trained without that particular
case.
The models aggregated for the out-of-bag error will only be
independent, if there is no dependence between the input data rows.
I.e. each row = one independent case, no hierarchical data
structure / no clustering / no repeated measurements.
So the out-of-bag error is not exactly the same (less trees for aggregating, more training case copies) as a cross validation error, but for practical purposes it is close enough.
What would make sense to look at in order to detect overfitting
is comparing out-of-bag error with an external validation. However,
unless you know about clustering in your data, a "simple" cross
validation error will be prone to the same optimistic bias as the
out-of-bag error: the splitting is done according to very similar
principles.
You'd need to compare out-of-bag or cross validation with error for
a well-designed test experiment to detect this.
Out-of-bag error is useful, and may replace other performance estimation protocols (like cross-validation), but should be used with care.
Like cross-validation, performance estimation using out-of-bag samples is computed using data that were not used for learning. If the data have been processed in a way that transfers information across samples, the estimate will (probably) be biased. Simple examples that come to mind are performing feature selection or missing value imputation. In both cases (and especially for feature selection) the data are transformed using information from the whole data set, biasing the estimate.