Question

In: Statistics and Probability

In this assignment you will use the baseball salary data found in the Data Sets link...

In this assignment you will use the baseball salary data found in the Data Sets link on the menu to your left. Under R Instructions, see the document "Some R commands for the baseball salary data" in order to learn how to (a) read the data into R, and (b) use the command lm when you have a large number of independent variables. Please do the following:

(1) Fit a linear regression model with salary as the response and the other 16 variables (excluding names) as the independent variables.

(2) Test the null hypothesis (using level of significance 0.05) that the variables batting average, on base percentage, hits, doubles and triples are not needed in the same model with the other 11 independent variables. Is the result surprising? Give a possible explanation for the result.

(3) What percentage of the variation in salaries is explained by the linear model containing the 11 variables not named in problem (2)?

(4) Obtain residuals from the linear model fitted in (3), and produce the following three plots: (i) the residuals versus the predicted values, (ii) a kernel density estimate of the residuals, and (iii) a normal probability plot of the standardized residuals. Comment on the plots.

Solutions

Expert Solution

You have not provided the data. I am giving you the following R commands to execute the stuff.

Please let me know if you have any issue,

### Fit a linear regression model

data = baseball salary data

fit <- lm(data$salary ~ .)  ####  in place of. include the independent variables using + sign e.g - data$number + data$income

summary(fit)

Test the null hypothesis

In the summary(fit) you will get the p-values of the all the variables from which you can decide the significance of that variable in the model

if the p-value < 0.05 we will reject the hypothesis H0: beta = 0 otherwise accept it.

If we accept the H0: beta = 0, we can conclude that the variables's coeffcieint is 0 and hence the variable is not needed in the model.

What percentage of the variation explanined ?

After getting the variables from 2) again run the following with the ramining variables,

fit1 <- lm(data$salary ~ .)  ####  in place of. include the independent variables which are remained in the part 2) and by deleting statistically insignificant variables

summary(fit1)

From summary you must get Adjusted R square value and

% of variation explained = Adjusted R Square Value

4) Plots

Getting residual-

residuals = fit1$residuals #### to get the residual from the model

pred = fit1$fitted.values   #### to get the predicted from the model

The residuals versus the predicted values

plot(residuals , pred )

A normal probability plot

qqnorm(
  residuals ,
  main = "Q-Q Plot of residuals vs. Gaussian Distribution",
  xlab = "Quantiles for Gaussian Distribution",
  ylab = "residuals "
)
qqline(residuals)

Related Solutions

In this assignment you will use pointers and structures to create Single and double link list....
In this assignment you will use pointers and structures to create Single and double link list. Write a menu driven program with the menu options: 1. Insert in Linked List       1. Insert in Single linked list              1. Insert at front(head)              2. Insert at Index              3. Insert at end(tail)       2. Insert in double linked list              1. Insert at front(head)              2. Insert at Index              3. Insert at end(tail) 2. Print       1. Print linked...
For Part 2 of this assignment, you will use the “Assignment 1 – Linear Kinematics Data”...
For Part 2 of this assignment, you will use the “Assignment 1 – Linear Kinematics Data” excel file. In the data set you are provided with vertical position and time data for a person’s vertical center of mass motion for an unspecified movement task. You will utilize excel in all (well, some…) of its glory to calculate the vertical velocity and vertical acceleration data from the position and time data provided in the excel file. Again you will use the...
In R, Use library(MASS) to access the data sets for this test. Use the Pima.tr data...
In R, Use library(MASS) to access the data sets for this test. Use the Pima.tr data set to answer questions 1-5. What is the average age for women in this data set? What is the maximum number of pregnancies for women in this data set ? What is the median age for women who have diabetes? What is the median age for women who do not have diabetes? What is the third quartile of the skin variable?
This week, you will use two of the data sets that were posted during last week's...
This week, you will use two of the data sets that were posted during last week's discussion, as follows: 1) Refer to the data set that you posted last week (high temperatures for your area during the month of June 2019) and 2) Refer to the data set that one of your classmates posted last week (high temperatures for their area during the month of June 2019). Use these data sets to test the claim that the average high temperature...
This week, you will use two of the data sets that were posted during last week's...
This week, you will use two of the data sets that were posted during last week's discussion, as follows: 1) Refer to the data set that you posted last week (high temperatures for your area during the month of June 2019) and 2) Refer to the data set that one of your classmates posted last week (high temperatures for their area during the month of June 2019). Use these data sets to test the claim that the average high temperature...
Life Expectancy Part 3 Refer to the Data Set AllCountries. (Data sets can be found near...
Life Expectancy Part 3 Refer to the Data Set AllCountries. (Data sets can be found near the bottom of the Read, Study & Practice section of WileyPLUS.) Use the 199 life expectancies listed and StatKey to answer the following questions. a. Use an equation editor to formulate the null and alternative hypothesis to test the following claim: “The average life expectancy for all countries is not 68.9 years.” b. From the AllCountries data, do your best to randomly select 10...
Use the data below to answer the questions in this assignment. You will first need to...
Use the data below to answer the questions in this assignment. You will first need to enter the following data in SPSS with “age” and “hours” as your variable names. Age and Hours on Computer Data Age (X): 24, 23, 23,25,27, 21,21,30,21,29 Number of hours on spent on computer per week (Y): 14,24,18,23,19,23,16,10,6,15 This question is only being used to describe the data set. In the box below you only need to enter 0. Run a correlation analysis in SPSS...
1. You can use my link provided, or you are free to choose another credible link...
1. You can use my link provided, or you are free to choose another credible link for another industry. On the site https://netmarketshare.com/search-engine-market-share.aspx?options=%7B%22filter%22%3A%7B%22%24and%22%3A%5B%7B%22deviceType%22%3A%7B%22%24in%22%3A%5B%22Desktop%2Flaptop%22%5D%7D%7D%5D%7D%2C%22dateLabel%22%3A%22Trend%22%2C%22attributes%22%3A%22share%22%2C%22group%22%3A%22searchEngine%22%2C%22sort%22%3A%7B%22share%22%3A-1%7D%2C%22id%22%3A%22searchEnginesDesktop%22%2C%22dateInterval%22%3A%22Monthly%22%2C%22dateStart%22%3A%222017-02%22%2C%22dateEnd%22%3A%222018-01%22%2C%22segments%22%3A%22-1000%22%7D we see market share data for Search engines. If you use another source of market share data please include the citation in the form of a url. a. Would you describe this market as competitive, monopolistic, or oligopolistic. Fully explain why. b. Calculate the 4 firm concentration ratio and the HHI. For the HHI assume the entire...
Use the data below and find the clusters using a single link technique. Use Euclidean distance...
Use the data below and find the clusters using a single link technique. Use Euclidean distance and draw the dendrogram. X Y P1 0.35 0.48 P2 0.17 0.33 P3 0.3 0.28 P4 0.21 0.18 P5 0.08 0.29
Please use the link to the case study to answer the questions bellow. use the link...
Please use the link to the case study to answer the questions bellow. use the link to the case study please. this is due today. https://books.google.com/books?id=bzb3BQAAQBAJ&pg=PA198&lpg=PA198&dq=case:+cursory+exams+are+risky&source=bl&ots=W2jLZ9niMa&sig=uYHeQOv-uuQ7ISm0YAdp7JcURBc&hl=en&sa=X&ved=0ahUKEwjkhaPf7ZLZAhUl8IMKHfHnCN8Q6AEIMDAB#v=onepage&q=case%3A%20cursory%20exams%20are%20risky&f=false Case: Cursory Exams are Risky Questions The many lessons for discussion in Niles v. City of San Rafael include the following; 1.An organization can improve the quality of patient care rendered in the facility by establishing and adhering to policies, procedures and protocols that facilitate the delivery of high-quality care across all disciplines. 2....
ADVERTISEMENT
ADVERTISEMENT
ADVERTISEMENT