In: Statistics and Probability
Please use R
"Team","WINS","HR","BA","ERA"
"Anaheim Angels",99,152,.282,3.69
"Baltimore Orioles",67,165,.246,4.46
"Boston Red Sox",93,177,.277,3.75
"Chicago White Sox",81,217,.268,4.53
"Cleveland Indians",74,192,.249,4.91
"Detroit Tigers",55,124,.248,4.93
"Kansas City Royals",62,140,.256,5.21
"Minnesota Twins",94,167,.272,4.12
"New York Yankees",103,223,.275,3.87
"Oakland Athletics",103,205,.261,3.68
"Seattle Mariners",93,152,.275,4.07
"Tampa Bay Devil Rays",55,133,.253,5.29
"Texas Rangers",72,230,.269,5.15
"Toronto Blue Jays",78,187,.261,4.8
"Arizona Diamondbacks",98,165,.267,3.92
"Atlanta Braves",101,164,.26,3.13
"Chicago Cubs",67,200,.246,4.29
"Cincinnati Reds",78,169,.253,4.27
"Colorado Rockies",73,152,.274,5.2
"Florida Marlins",79,146,.261,4.36
"Houston Astros",84,167,.262,4
"Los Angeles Dodgers",92,155,.264,3.69
"Milwaukee Brewers",56,139,.253,4.73
"Montreal Expos",83,162,.261,3.97
"New York Mets",75,160,.256,3.89
"Philadelphia Phillies",80,165,.259,4.17
"Pittsburgh Pirates",72,142,.244,4.23
"St. Louis Cardinales",97,175,.268,3.7
"San Diego Padres",66,136,.253,4.62
"San Francisco Giants",95,198,.267,3.54
data on the following variables for the 30 major league baseball teams during the 2002 season: • WINS: number of games won • HR: number of home runs hit • BA: average batting average • ERA: earned run average
(a) Using WINS as the dependent variable, run the regression relating the three predictor variables to WINS. Report the fitted regression line.
(b) Construct the ANOVA table of the above model.
(c) Plot the residuals ei against the fitted values ybi . What departures from the regression model assumptions can be studied from this plot? What are your findings? (Note: If you are not sure about the validity of any of the assumptions, perform a formal test to verify your answer.) 1
(d) Prepare a normal probability plot (QQ plot) of the residuals. Which assumption can be tested from this plot and what do you conclude? (Note: You can also use the formal test to reinforce your conclusion).
(e) If there is no problem with any of the assumptions, you can safely continue on making inference. Test for the significance of the regression using a 0.05 significance level.
(f) What percentage of the variability in y is explained by the regression?
(g) Using the individual t-tests, comment on the significance of each predictor variable, using a 0.05 significance level.
Hint: data=read.table(‘hmw6_prob2.txt’, header=T, sep=‘,’) y=data$WINS x1=data$HR x2=data$BA x3=data$ERA
Solution:
Rcode:
data <-
read.csv("C:/Users/Newfolder/Downloads/hmw6_prob.txt",
comment.char="#")
View(data)
head(data)
linmod=lm(WINS ~HR+BA +ERA,data=data)
coefficients(linmod)
anova(linmod)
summary(linmod)
plot(linmod,which=1)
plot(linmod,which=2)
Output:
Analysis of Variance Table
Response: WINS
Df Sum Sq Mean Sq F value Pr(>F)
HR 1 1126.63 1126.63 45.024 4.046e-07 ***
BA 1 2223.89 2223.89 88.875 7.133e-10 ***
ERA 1 2311.06 2311.06 92.359 4.819e-10 ***
Residuals 26 650.59 25.02
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> summary(linmod)
Call:
lm(formula = WINS ~ HR + BA + ERA, data = data)
Residuals:
Min 1Q Median 3Q Max
-9.1555 -2.5054 0.4665 2.1392 9.1389
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -21.88065 28.92622 -0.756 0.4562
HR 0.09759 0.03572 2.732 0.0112 *
BA 606.30599 100.76891 6.017 2.36e-06 ***
ERA -16.89736 1.75825 -9.610 4.82e-10 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 5.002 on 26 degrees of freedom
Multiple R-squared: 0.8969, Adjusted R-squared:
0.885
F-statistic: 75.42 on 3 and 26 DF, p-value: 5.895e-13
ANSWER(A)
the fitted regression line. is
wins=-21.88065235 +0.09759143*HR+ 606.30598979*BA -16.89735835 *ERA
ANSWER(B)
Analysis of Variance Table
Response: WINS
Df Sum Sq Mean Sq F value Pr(>F)
HR 1 1126.63 1126.63 45.024 4.046e-07 ***
BA 1 2223.89 2223.89 88.875 7.133e-10 ***
ERA 1 2311.06 2311.06 92.359 4.819e-10 ***
Residuals 26 650.59 25.02
ANSWER(C)
we dont see any pattern in residual plot.
Homogenity of variance assumption is satisfied
ANSWER(D)
From normal probbaility plot we see points are falling on a straightline.Approximately normal
There are outliers detected and outlier observations are 5,10,23
Errors follows normal distribution is satisfied