Some data mining algorithms work so “well” that they have a tendency to overfit the training...

Some data mining algorithms work so “well” that they have a tendency to overfit the training data. What does the term "overfit" mean, and what difficulties does overlooking it cause for the data scientist?

Expert Solution

Training set is the set of data which is used for the training of the alogorithm. The overfitting of data means that your algorithm has a problem of high variance. In this case, although you algorithm may work well for your training set, however for test set it may not work so well. This is because in the case of overfitting, the algorithm fits the training set too well and even tries to fit any outlier data which is not desirable. In short the algorithm is not able to generalize the results and makes your algorithm less robust.

The problem of overfit can be detected by looking at the difference between training set error and test set error (training set error is generally less than that of test set error). If the difference is very high then it implies that overfitting exists.

If you overlook this problem then may be your algorithm is working fine with your training set however with test set it may not be doing so well

Overfitting can be reduced by using regularization or increasing training dataset size.

Rahul Sunny answered 2 years ago

Some data mining algorithms work so "well" that they have a tendency to overfit the training...

Some data mining algorithms work so "well" that they have a tendency to overfit the training data. What does the term overfit mean, and what difficulties does overlooking it cause for the data scientist?

The Iris data set is a well-known data set among data mining analysts. Please provide some...

The Iris data set is a well-known data set among data mining analysts. Please provide some background of this data set and the information contained in it.

What are some pros and cons to data mining? Provide an example of when data mining...

What are some pros and cons to data mining? Provide an example of when data mining was used and the outcome provided an incorrect assumption or issue. How can these types of situations be avoided in the future?

Reflect on the data mining concepts, strategies, and best practices explored so far. Consider data mining...

Reflect on the data mining concepts, strategies, and best practices explored so far. Consider data mining from both a global perspective in the management of big data and the impact of data mining on individual organizations.

Why does the economic concept of “cost shifting” work well in healthcare, but not so well...

Why does the economic concept of “cost shifting” work well in healthcare, but not so well with business?

What is Data mining application and how does it work in telemedicine?

If a classifier performs well on training data but poorly in production, what's the most likely...

If a classifier performs well on training data but poorly in production, what's the most likely problem? 1. High variance 2. High bias 3. High entropy 4. High measurement noise

There is a strong linkage between statistical data analysis and data mining. Some people think of...

There is a strong linkage between statistical data analysis and data mining. Some people think of data mining as automated and scalable methods for statistical data analysis. Do you agree or disagree with this perception? Present one statistical analysis method that can be automated and/or scaled up nicely by integration with current data mining methodology.

Data mining i have a data column where the required mark is 20 but the data...

Data mining i have a data column where the required mark is 20 but the data fed in to the column is more than 40 , how do you advise me to clean that data, and also please state what data is it for instance noisy or what and also name the method to solve it and please tell how to clean it thanks

Cracking Fraud with Government’s Big Data What are some ways that data mining could be used...

Cracking Fraud with Government’s Big Data What are some ways that data mining could be used to detect fraud in health insurance claims? How could private insurance companies and public government agencies collaborate to combat insurance fraud? What types of business skills would be necessary to define the rules for and analyze the results from data mining? What business processes are necessary to complement the IS component of data mining?

Question