In: Computer Science
2. ___ is an unsupervised data mining technique requiring
no a priori hypothesis or model about initial patterns or
relationships that may exist within the data.
A) Regression Analysis
B) Clustering Analysis
C) Neural Networks
D) Decision Trees
3. Which of the following are potential data quality
concerns?
A) Dirty data
B) Missing values
C) Inconsistent data
D) Data not integrated
E) All of the above
6. An analyst attempts to investigate how the customer’s
yearly incomes influence their average spending regarding
health insurance. The analyst received the following
information:
SST = 19.35 & SSE = 3.66
What is the R square value of this model?
2. B) Clustering Analysis
Clustering is an unsupervised machine learning task that automatically divides the data into clusters, or groups of similar items. It does this without having been told how the groups should look ahead of time.
3. C) Inconsistent data
When dealing with multiple data sources, inconsistency is a big indicator that there’s a data quality problem. In many circumstances, the same records might exist multiple times in a database. Duplicate data is one of the biggest problems that exist for data-driven businesses and can bring down revenue faster than any other data issue.
6. SST = 19.35 & SSE = 3.66
R square value of this model:
R-square is the square of the correlation between the response values and the predicted response values. It is also called the square of the multiple correlation coefficient and the coefficient of multiple determination.
R-square is defined as
R square = 1 - SSE/SST = 1- 0.189147 = 0.8108
R-square can take on any value between 0 and 1, with a value closer to 1 indicating that a greater proportion of variance is accounted for by the model. For example, an R-square value of 0.8108 means that the fit explains 81.08% of the total variation in the data about the average.