Question

In: Statistics and Probability

Jalali-Heravi and Knouz (2002) give “four criteria of correlation coefficient (r), standard deviation (s), F value...

Jalali-Heravi and Knouz (2002) give “four criteria of correlation coefficient (r), standard deviation (s), F value for the statistical significance of the model and the ratio of the number of observations to the number of descrip- tors in the equation” for choosing between competing regression models. Provide a detailed critique of this suggestion.

Solutions

Expert Solution

Model selection is an important part of any statistical analysis, and indeed is central to the pursuit of science in general. Model selection is the task of selecting a statistical model from a set of candidate models, given data. Standard techniques for comparing regression results include familiar methods: bias, slope, standard error of calibration, standard error of prediction, standard error of cross validation, standard error of validation, correlation, coefficient of determination, r-to-z transform (that is, Fisher's z-transform), prediction errors (of various types), and the coefficient of variation . However, We will critically evaluate only each of the four criteria given by Jalali-Heravi and Knouz (2002) for selecting regression model from various competing regression models as follows:

  1. Criteria of correlation coefficient (r) – Correlation coefficient r measures the strength and direction of a linear relationship between two variables (Dependent and independent variables) on a scatterplot. The value of r is always between +1 and –1.

Value nearing to +1 denotes strong positive correlation and that nearing to -1 denotes strong negative correlation between dependent and independent variables. Now using this criteria for regression model selection will face following problems:

a) This denotes only linear relationship and not suitable for non linear regression.

b) More than one r values will have to be compared depending on number of independent variables used in particular model. This is not reliable or comparable criteria when number of independent variables vary across competing model.

  1. Coefficient of determination (r squared) may be used to explain how much of particular independent variable explain variation in dependent variable. But this does not mean or ensure causal relationship between variables. This means that there may be third variable existing that is actually responsible for relationship observed between variables in the equation and that third variable is not part of the model.

Due to above general problems , correlation coefficient may not be a effective criteria for regression model selection. Although, it may be used in support of other model selection criteria such as AIC and BIC, not mentioned by Jalali-Heravi and Knouz (2002).

  1. Criteria of standard deviation (s) – This criteria may be used for a model to compare variations in predicted values and given historical values to select the model with minimum variation. But this criteria of regression model selection faces following problems:

a) Standard deviation is a measure of uncertainty. It is useful in incorporating model uncertainty into the estimation of regression coefficients and their standard deviations. But assessing accuracy of the model is best accomplished by analyzing the standard error of estimate (SEE) and the percentage that the SEE represents of the predicted mean (SEE %). The SEE represents the degree to which the predicted scores vary from the observed scores on the criterion measure, similar to the standard deviation used in other statistical procedures. Lower values of the SEE indicate greater accuracy in prediction. Comparison of the SEE for different models using the same sample allows for determination of the most accurate model to use for prediction.

b) The standard error of the regression tells how far the observations tend to fall from the fitted values. It’s essentially the standard deviation for the population of residuals. This can as well be obtained from prediction intervals. The standard error of the slope tells the standard deviation of the sampling distribution for the slope. And for the constant, it’s the standard deviation of the sampling distribution for the constant. The confidence intervals for the slopes and the constant are better at providing this information. The RMSE is the square root of the variance of the residuals. As the square root of a variance, RMSE can be interpreted as the standard deviation of the unexplained variance. Lower values of RMSE indicate better fit.The statistics discussed above are applicable to regression models that use OLS estimation. Many types of regression models, however, such as mixed models, generalized linear models, and event history models, use maximum likelihood estimation. These statistics are not available for such models. The RMSE is essentially the standard deviation of what the model doesn’t explain. So if it’s close to the std deviation of Y, then the model isn’t explaining very much.

c) The accuracy of the estimated mean is measured by the standard error of the mean. The accuracy of a forecast is measured by the standard error of the forecast, which (for both the mean model and a regression model) is the square root of the sum of squares of the standard error of the model and the standard error of the mean.Confidence intervals for the mean and for the forecast are equal to the point estimate plus-or-minus the appropriate standard error multiplied by the appropriate 2-tailed critical value of the t distribution

  1. Criteria of F value for the statistical significance of the model:

The F-test evaluates the null hypothesis that all regression coefficients are equal to zero versus the alternative that at least one is not. An equivalent null hypothesis is that R-squared equals zero. A significant F-test indicates that the observed R-squared is reliable and is not a spurious result of oddities in the data set. Thus the F-test determines whether the proposed relationship between the response variable and the set of predictors is statistically reliable and can be useful when the research objective is either prediction or explanation.

  1. Criteria of the ratio of the number of observations to the number of descriptors in the equation:

Linear regression equations suffer from the curse of dimensionality that leads to overfitting and accidental correlation, particularly for small data sets and when many variables are present. This can lead to cases where descriptors based on random numbers exhibit higher correlations than actual descriptors.

The multiple linear regression (MLR) method calculates equations by performing standard multivariable regression calculations using multiple variables in a single equation. When you use multiple linear regression, you assume that the variables are independent (not correlated). Also, to minimize the possibility of chance correlations, the number of independent variables initially considered should not be more than one-fifth the number of compounds in the training sets -- a warning message appears if this happens. When the number of independent variables is greater than the number of observations (rows), multiple linear regression cannot be applied. The descriptors present in an MLR model should not be much intercorrelated. For a statistically reliable model, the number of observations and number of descriptors should bear a ration of at least 5:1. A MLR model that fits well the given data will lead to a scatter plot (observed vs. calculated) showing a minimum deviation of the points from the line of fit .

The partial least squares (PLS) regression method carries out regression using latent variables from the independent and dependent data that are along their axes of greatest variation and are most highly correlated. PLS can be used with more than one dependent variable. It is typically applied when the independent variables are correlated or the number of independent variables exceeds the number of observations (rows).

From the above discussion it is clear that criteria of the ratio of the number of observations to the number of descriptors in the equation, is important in model selection for particular type of regression.


Related Solutions

Construct a scatterplot, find the value of the linear correlation coefficient r, find the critical value...
Construct a scatterplot, find the value of the linear correlation coefficient r, find the critical value of r from Table A-6 by using a 0.05, and determine whether there is a linear correlation between the two variables. Song Audiences and Sales The table below lists the numbers of audience impressions (in hundreds of millions) listening to songs and the corresponding numbers of albums sold (in hundreds of thousands). The number of audience impressions is a count of the number of...
T F 5. Coefficient of correlation r tends to understate the strength of the relationship between...
T F 5. Coefficient of correlation r tends to understate the strength of the relationship between the “based on” variable X and the variable to be predicted Y. T F 6. We develop a regression equation for predicting salary based on GPA. GPA’s in the sample ranged from 1.8 to 3.7. We can safely use this equation to predict a salary for a student earning a 3.8 GPA in college. T F 7. A standard error of Sy.x = 1.71...
Determine the value of the coefficient of correlation, r, for the following data. X 3 6...
Determine the value of the coefficient of correlation, r, for the following data. X 3 6 7 11 13 17 21 Y 18 13 13 8 7 7 5 (Round the intermediate values to 3 decimal places. Round your answer to 3 decimal places.) r =
The correlation coefficient r is a sample statistic. What does it tell us about the value...
The correlation coefficient r is a sample statistic. What does it tell us about the value of the population correlation coefficient ρ (Greek letter rho)? You do not know how to build the formal structure of hypothesis tests of ρ yet. However, there is a quick way to determine if the sample evidence based on ρ is strong enough to conclude that there is some population correlation between the variables. In other words, we can use the value of r...
The correlation coefficient r is a sample statistic. What does it tell us about the value...
The correlation coefficient r is a sample statistic. What does it tell us about the value of the population correlation coefficient ρ (Greek letter rho)? You do not know how to build the formal structure of hypothesis tests of ρ yet. However, there is a quick way to determine if the sample evidence based on ρ is strong enough to conclude that there is some population correlation between the variables. In other words, we can use the value of r...
A high value of the correlation coefficient r implies that a causal relationship exists between x...
A high value of the correlation coefficient r implies that a causal relationship exists between x and y. Question 10 options: True False
The correlation coefficient r is a sample statistic. What does it tell us about the value...
The correlation coefficient r is a sample statistic. What does it tell us about the value of the population correlation coefficient ρ (Greek letter rho)? You do not know how to build the formal structure of hypothesis tests of ρ yet. However, there is a quick way to determine if the sample evidence based on ρ is strong enough to conclude that there is some population correlation between the variables. In other words, we can use the value of r...
The correlation coefficient r is a sample statistic. What does it tell us about the value...
The correlation coefficient r is a sample statistic. What does it tell us about the value of the population correlation coefficient ρ (Greek letter rho)? You do not know how to build the formal structure of hypothesis tests of ρ yet. However, there is a quick way to determine if the sample evidence based on ρ is strong enough to conclude that there is some population correlation between the variables. In other words, we can use the value of r...
The correlation coefficient r is a sample statistic. What does it tell us about the value...
The correlation coefficient r is a sample statistic. What does it tell us about the value of the population correlation coefficient ρ (Greek letter rho)? You do not know how to build the formal structure of hypothesis tests of ρ yet. However, there is a quick way to determine if the sample evidence based on ρ is strong enough to conclude that there is some population correlation between the variables. In other words, we can use the value of r...
The correlation coefficient r is a sample statistic. What does it tell us about the value...
The correlation coefficient r is a sample statistic. What does it tell us about the value of the population correlation coefficient ρ (Greek letter rho)? You do not know how to build the formal structure of hypothesis tests of ρ yet. However, there is a quick way to determine if the sample evidence based on ρ is strong enough to conclude that there is some population correlation between the variables. In other words, we can use the value of r...
ADVERTISEMENT
ADVERTISEMENT
ADVERTISEMENT