In: Statistics and Probability
The weather company you work for is performing an analysis on predicting sandstorms accurately in Phoenix. You have been given a large dataset and are asked to determine which factors could be used to accurately predict them.
The dataset you are provided with has over 50 variables. With your vast knowledge of Excel, you determine that 4 variables can be used as predictors to accurately predict sandstorms. Your boss is not convinced and decided to ask you some questions.
A. Predictor variables can be excluded from the analysis on the basis of the following:
Identify outliers and influential points - maybe exclude them at least temporarily.
The need to keep only the required predictor variables in the regression analysis because of the following reasons:
1) Unnecessary predictors will add noise to the estimation of other quantities that we are interested in. Degrees of freedom will be wasted
2) Collinearity is caused by having too many variables trying to do the same job.
3) If the model is to be used for prediction, we can save time and/or money by not measuring redundant predictors.
B. As both “Relative Humidity” and “Number of Sunny Hours” are highly correlated, it will add collinearity in the model as both the variables are doing the same job.
C. the variables to keep or discard will depend on the interest of your output. We cannot filter the correlation values. If there is a correlation between independent and dependent variables, those will be included in the analysis. If dependent and independent variables do not have any correlation, that variables will be excluded from the analysis. Hence, the variables with the following types of correlation will be included in the analysis:
variables with no correlation will be excluded from the analysis.