In: Economics
Some data mining algorithms work so "well" that they have a tendency to overfit the training data. What does the term overfit mean, and what difficulties does overlooking it cause for the data scientist?
Overfitting refers to a model that models the training data well.
Overfitting occurs when a model learns to the degree that it adversely impacts the model's performance on new data, the information and noise in the training data. This means that the model picks up the noise or natural variations in the training data and learns them as concepts. The problem is that such principles do not extend to new data and the potential to generalize models is negatively impacted. Overfitting with nonparametric and nonlinear models is more likely to have more flexibility when learning a target function. As such, many nonparametric algorithms for learning machines also include parameters or techniques to limit and restrict how much detail the model learns.
If we overfit the model, our software won't work on future data that will come through to our software. If we are underfitting, we leave a lot of data in the table, there is still a lot of information that can be obtained that we only leave on the table without a more complex model being developed.
The best way to assess overfitting is if our model performs very poorly on testing data but very well on training data, this is a clear indication that we mainly overfit our model.