In: Statistics and Probability
In this assignment you will use the baseball salary data found in the Data Sets link on the menu to your left. Under R Instructions, see the document "Some R commands for the baseball salary data" in order to learn how to (a) read the data into R, and (b) use the command lm when you have a large number of independent variables. Please do the following:
(1) Fit a linear regression model with salary as the response and the other 16 variables (excluding names) as the independent variables.
(2) Test the null hypothesis (using level of significance 0.05) that the variables batting average, on base percentage, hits, doubles and triples are not needed in the same model with the other 11 independent variables. Is the result surprising? Give a possible explanation for the result.
(3) What percentage of the variation in salaries is explained by the linear model containing the 11 variables not named in problem (2)?
(4) Obtain residuals from the linear model fitted in (3), and produce the following three plots: (i) the residuals versus the predicted values, (ii) a kernel density estimate of the residuals, and (iii) a normal probability plot of the standardized residuals. Comment on the plots.
You have not provided the data. I am giving you the following R commands to execute the stuff.
Please let me know if you have any issue,
### Fit a linear regression model
data = baseball salary data fit <- lm(data$salary ~ .) #### in place of. include the independent variables using + sign e.g - data$number + data$income summary(fit)
Test the null hypothesis
In the summary(fit) you will get the p-values of the all the variables from which you can decide the significance of that variable in the model
if the p-value < 0.05 we will reject the hypothesis H0: beta = 0 otherwise accept it.
If we accept the H0: beta = 0, we can conclude that the variables's coeffcieint is 0 and hence the variable is not needed in the model.
What percentage of the variation explanined ?
After getting the variables from 2) again run the following with the ramining variables,
fit1 <- lm(data$salary ~ .) #### in place of. include the independent variables which are remained in the part 2) and by deleting statistically insignificant variables summary(fit1)
From summary you must get Adjusted R square value and
% of variation explained = Adjusted R Square Value
4) Plots
Getting residual-
residuals = fit1$residuals #### to get the residual from the model
pred = fit1$fitted.values #### to get the predicted from the model
The residuals versus the predicted values
plot(residuals , pred )
A normal probability plot
qqnorm( residuals , main = "Q-Q Plot of residuals vs. Gaussian Distribution", xlab = "Quantiles for Gaussian Distribution", ylab = "residuals " ) qqline(residuals)