In: Statistics and Probability
the data BaseballTimes contains 4 quantitative variables that might be useful for predicting game "Time".
Runs, Margin, Pitchers, and Attendance
Game | League | Runs | Margin | Pitchers | Attendance | Time |
CLE-DET | AL | 14 | 6 | 6 | 38774 | 168 |
CHI-BAL | AL | 11 | 5 | 5 | 15398 | 164 |
BOS-NYY | AL | 10 | 4 | 11 | 55058 | 202 |
TOR-TAM | AL | 8 | 4 | 10 | 13478 | 172 |
TEX-KC | AL | 3 | 1 | 4 | 17004 | 151 |
OAK-LAA | AL | 6 | 4 | 4 | 37431 | 133 |
MIN-SEA | AL | 5 | 1 | 5 | 26292 | 151 |
CHI-PIT | NL | 23 | 5 | 14 | 17929 | 239 |
LAD-WAS | NL | 3 | 1 | 6 | 26110 | 156 |
FLA-ATL | NL | 19 | 1 | 12 | 17539 | 211 |
CIN-HOU | NL | 3 | 1 | 4 | 30395 | 147 |
MIL-STL | NL | 12 | 12 | 9 | 41121 | 185 |
ARI-SD | NL | 11 | 7 | 10 | 32104 | 164 |
COL-SF | NL | 9 | 5 | 7 | 32695 | 180 |
NYM-PHI | NL | 15 | 1 | 16 | 45204 | 317 |
From among the four predictors choose a model for each of the following goals
a. Maximize the coefficient of determination R^2
b. Maximize the adjusted R^2
c. Minimize Mallow's Cp
d. Considering models a-c, whamt model would you choose to predict game Time and why?
e. Using stepwise procedure(forward backwards, elimination process) find the "best" model
Using the above data , We run the Multiple linear Regression ,
We have to predict Game Time using the predictors Runs, Margin, Pitchers, and Attendance
The model is :
Time = B_0 + B_1(Runs) + B_2(Margin) + B_3(Pitchers) + B_4(Attendance)
Where B_0 , B_1 , B_2 , B_3 and B_4 are the regression coefficients.
Hypothesis :
Ho : B_0 = B_1 = B_2 = B_3 = B_0 = 0
i.e. All variables are insignificant
V/s
H1 : at leats one coefficient is not Zero.
i.e . the variables are significant.
Calculation table :
coefficients |
estimate |
t value |
p value |
B_0 |
88.0151 |
4.952 |
0.00057 |
B_1 |
1.5614 |
0.931 |
0.3736 |
B_2 |
-3.7278 |
-1.793 |
0.1032 |
B_3 |
8.7322 |
3.514 |
0.005594 |
B_4 |
0.000726 |
1.424 |
0.1848 |
at 5% level of significance we reject Ho and conclude that the variable Pitchers (B_3) is significant.
therefor the model is ,
Time = B_0 + B_3(Pitchers)
a) Maximize the coefficient of determination (R2) :-
R-squared: 0.8557
i.e. Our model is the good , the all predictors explained the 85.57% variations on dependent variable.
b) Maximize the adjusted R2 :-
Adjusted R-squared: 0.798
i.e That means the 79.8% variation is expalined by only those variable which are statistically significant . (here , our Pitchers variable is statistically significant) .
c) Minimize Mallow's Cp :-
where ,SSE = is the error sum of squares for the model with P regressors,
S2 = is the residual mean square after regression on the complete set of K regressors and can be estimated by mean square error MSE
N = is the sample size.
and P = no. pf regressors.
Using the above formula we calculate Mallow's Cp
Cp = 3
Here , Mallows' Cp-statistic = 3 is the size of the bias that is introduced into the predicted responses by having an underspecified model.
d) Choosed Model is :-
Time = B_0 + B_3(Pitchers)
Because , The coefficeients B_0 and B_3 are significantally affect to prdict the Time.
and this model have the less AIC i.e. 93.83
e) Using stepwise procedure :-
using the stepwise procedure the best model is,
Time = B_0 + B_3(Pitchers)
because the AIC=93.83 . and have a adjusted R2 =0.798
here , B_0 = 94.84 and B_3 = 10.71
the model is ,
Time = 94.84 + 10.71(Pitchers)
-----ALL THE BEST-----