In: Economics
Some data mining algorithms work so “well” that they have a tendency to overfit the training data. What does the term "overfit" mean, and what difficulties does overlooking it cause for the data scientist?
Training set is the set of data which is used for the training of the alogorithm. The overfitting of data means that your algorithm has a problem of high variance. In this case, although you algorithm may work well for your training set, however for test set it may not work so well. This is because in the case of overfitting, the algorithm fits the training set too well and even tries to fit any outlier data which is not desirable. In short the algorithm is not able to generalize the results and makes your algorithm less robust.
The problem of overfit can be detected by looking at the difference between training set error and test set error (training set error is generally less than that of test set error). If the difference is very high then it implies that overfitting exists.
If you overlook this problem then may be your algorithm is working fine with your training set however with test set it may not be doing so well
Overfitting can be reduced by using regularization or increasing training dataset size.