In: Statistics and Probability
a. Develop a scatter plot with income as the dependent variable and age as the independent variable. Include the estimated regression equation and the coefficient of determination on your scatter plot. Briefly comment on the relationship between the two variables, and fully interpret the coefficient of determination. b. Using the Excel’s Regression Tool, develop the estimated regression equation to show how income (y annual income in $1000s) is related to the independent variables education (?_1level of education attained in number of years), age (?_2 ?? ?????), and gender (?_3 dummy variable, 1= female, 0 = male). Develop the dummy variable for the gender variable first. c. Test whether the coefficients obtained in part (b) are significant at 5%. What is your conclusion? [5 points] d. Fully interpret the meaning of the coefficient on gender, ?_3. e. Predict the annual income for a female aged 45 with 10 years of education. How much would the predicted income have changed for a male? f. Plot the standardized residuals against predicted income, ? ̂ from regression in part (b). Check for outliers and explain whether the residual plot supports the assumptions about Ɛ. What is your conclusion? Submit the graph to earn full points.
ATTENTION - COULD ONLY FIT HALF OF DATA. OTHER HALF IS POSTED BY ME AS ANOTHER QUESTION
GENDER | EDUCATION | X3 DUMMY | AGE | Income ($1000) |
male | 12 | 0 | 42 | 120 |
female | 17 | 1 | 28 | 32.5 |
female | 16 | 1 | 36 | 6.5 |
male | 4 | 0 | 52 | 16.25 |
female | 13 | 1 | 35 | 55 |
male | 12 | 0 | 36 | 55 |
female | 13 | 1 | 47 | 45 |
male | 12 | 0 | 55 | 67.5 |
male | 14 | 0 | 54 | 67.5 |
female | 16 | 1 | 45 | 100 |
male | 15 | 0 | 22 | 18.75 |
female | 13 | 1 | 44 | 9 |
female | 14 | 1 | 63 | 55 |
male | 16 | 0 | 40 | 67.5 |
female | 14 | 1 | 42 | 45 |
male | 18 | 0 | 62 | 67.5 |
male | 11 | 0 | 52 | 67.5 |
female | 12 | 1 | 49 | 45 |
female | 17 | 1 | 27 | 32.5 |
female | 14 | 1 | 30 | 45 |
female | 18 | 1 | 29 | 100 |
female | 18 | 1 | 51 | 175 |
female | 16 | 1 | 57 | 175 |
male | 16 | 0 | 44 | 175 |
male | 16 | 0 | 68 | 175 |
a) scatter plot is under
The relationship between the variables seems not to be linear as R square = 18.1% only. The amount of the variation which is explained by the model or regression equation is R-square.
b) The regression equation is
Income ($1000) Y = - 131.77 + 8.85 EDUCATION X1 - 19.21 DUMMY X3 +
2.00 AGE X2
The excel output is given below
SUMMARY OUTPUT | ||||||||
Regression Statistics | ||||||||
Multiple R | 0.639925889 | |||||||
R Square | 0.409505143 | |||||||
Adjusted R Square | 0.325148735 | |||||||
Standard Error | 43.567122 | |||||||
Observations | 25 | |||||||
ANOVA | ||||||||
df | SS | MS | F | p.value | ||||
Regression | 3 | 27642.68849 | 9214.229498 | 4.854463961 | 0.010164928 | |||
Residual | 21 | 39859.97651 | 1898.094119 | |||||
Total | 24 | 67502.665 | ||||||
Coefficients | Standard Error | t Stat | P-value | Lower 95% | Upper 95% | Lower 95.0% | Upper 95.0% | |
Intercept | -131.7712464 | 59.02824031 | -2.232342446 | 0.036613045 | -254.5271917 | -9.015301033 | -254.5271917 | -9.015301033 |
EDUCATION (X1) | 8.848443265 | 3.115728165 | 2.839927875 | 0.009809024 | 2.368931861 | 15.32795467 | 2.368931861 | 15.32795467 |
DUMMY (X3 ) | -19.21093781 | 18.90789047 | -1.01602756 | 0.321180021 | -58.53204845 | 20.11017283 | -58.53204845 | 20.11017283 |
AGE (X2) | 2.002108147 | 0.761463744 | 2.629288868 | 0.015675596 | 0.418557607 | 3.585658686 | 0.418557607 | 3.585658686 |
c) Since P.value obtained in F.test is 0.01016 is less then 0.05 hence we conclude that the model or regression coefficients are significant.
d) The coefficient on gender, ?_3 = -19.21, which means a unit change in gender will results -19.21 units change in the response variable Y.
e) The regression equation is
Income ($1000) Y = - 131.77 + 8.85 EDUCATION X1 - 19.21 DUMMY X3 +
2.00 AGE X2
Income ($1000) Y = - 131.77 + 8.85 *10 - 19.21*0 + 2.00 * 45 = 46.73
f)
Residual plot does not supports the assumptions about error term as the error terms are not randomly distributed in the above graph and are concentrated in the center of the residual plot. Further test of homoscedasticity is also not met.
No outlier has been detected from the data as given below through qq plot