In: Statistics and Probability
The weather company you work for is performing an analysis on
predicting sandstorms accurately in Phoenix....
The weather company you work for is performing an analysis on
predicting sandstorms accurately in Phoenix. You have been given a
large dataset and are asked to determine which factors could be
used to accurately predict them.
The dataset you are provided with has over 50 variables. With
your vast knowledge of Excel, you determine that 4 variables can be
used as predictors to accurately predict sandstorms. Your boss is
not convinced and decided to ask you some questions.
- You mentioned that before you even begin doing this “regression
analysis” of yours, you discarded about half of the variables. How
did you choose which variables to discard, and more importantly,
why does it matter?
- You asked a friend and he told you that in his company they use
both “Relative Humidity” and “Number of Sunny Hours” to predict
sandstorms. In your report, you mention you only use the “Relative
Humidity”, because of it being highly correlated with “Number of
Sunny Hours”. However, by themselves each are good
predictors, why didn’t you use both? Wouldn’t that make the model
better?
- By looking at the correlation matrix you created between
independent variables and the dependent variable, which independent
variables based on the type of correlation did you discard or keep,
and why?
- Strong positive correlation
- Weak positive correlation
- No correlation
- Weak negative correlation
- Strong negative correlation