Some data mining algorithms work so "well" that they have a tendency to overfit the training...

Some data mining algorithms work so "well" that they have a tendency to overfit the training data. What does the term overfit mean, and what difficulties does overlooking it cause for the data scientist?

Expert Solution

Overfitting refers to a model that models the training data well.

Overfitting occurs when a model learns to the degree that it adversely impacts the model's performance on new data, the information and noise in the training data. This means that the model picks up the noise or natural variations in the training data and learns them as concepts. The problem is that such principles do not extend to new data and the potential to generalize models is negatively impacted. Overfitting with nonparametric and nonlinear models is more likely to have more flexibility when learning a target function. As such, many nonparametric algorithms for learning machines also include parameters or techniques to limit and restrict how much detail the model learns.

If we overfit the model, our software won't work on future data that will come through to our software. If we are underfitting, we leave a lot of data in the table, there is still a lot of information that can be obtained that we only leave on the table without a more complex model being developed.

The best way to assess overfitting is if our model performs very poorly on testing data but very well on training data, this is a clear indication that we mainly overfit our model.

Rahul Sunny answered 2 years ago

Some data mining algorithms work so “well” that they have a tendency to overfit the training...

Some data mining algorithms work so “well” that they have a tendency to overfit the training data. What does the term "overfit" mean, and what difficulties does overlooking it cause for the data scientist?

The Iris data set is a well-known data set among data mining analysts. Please provide some...

The Iris data set is a well-known data set among data mining analysts. Please provide some background of this data set and the information contained in it.

What are some pros and cons to data mining? Provide an example of when data mining...

What are some pros and cons to data mining? Provide an example of when data mining was used and the outcome provided an incorrect assumption or issue. How can these types of situations be avoided in the future?

Reflect on the data mining concepts, strategies, and best practices explored so far. Consider data mining...

Reflect on the data mining concepts, strategies, and best practices explored so far. Consider data mining from both a global perspective in the management of big data and the impact of data mining on individual organizations.

Why does the economic concept of “cost shifting” work well in healthcare, but not so well...

Why does the economic concept of “cost shifting” work well in healthcare, but not so well with business?

What is Data mining application and how does it work in telemedicine?

If a classifier performs well on training data but poorly in production, what's the most likely...

If a classifier performs well on training data but poorly in production, what's the most likely problem? 1. High variance 2. High bias 3. High entropy 4. High measurement noise

There is a strong linkage between statistical data analysis and data mining. Some people think of...

There is a strong linkage between statistical data analysis and data mining. Some people think of data mining as automated and scalable methods for statistical data analysis. Do you agree or disagree with this perception? Present one statistical analysis method that can be automated and/or scaled up nicely by integration with current data mining methodology.

Data mining i have a data column where the required mark is 20 but the data...

Data mining i have a data column where the required mark is 20 but the data fed in to the column is more than 40 , how do you advise me to clean that data, and also please state what data is it for instance noisy or what and also name the method to solve it and please tell how to clean it thanks

Cracking Fraud with Government’s Big Data What are some ways that data mining could be used...

Cracking Fraud with Government’s Big Data What are some ways that data mining could be used to detect fraud in health insurance claims? How could private insurance companies and public government agencies collaborate to combat insurance fraud? What types of business skills would be necessary to define the rules for and analyze the results from data mining? What business processes are necessary to complement the IS component of data mining?

Question