In: Math
ID Year
CornYield SoyBeanYield
1 1957
48.3 23.2
2 1958
52.8 24.2
3 1959
53.1 23.5
4 1960
54.7 23.5
5 1961
62.4 25.1
6 1962
64.7 24.2
7 1963
67.9 24.4
8 1964
62.9 22.8
9 1965
74.1 24.5
10 1966
73.1 25.4
11 1967
80.1 24.5
12 1968
79.5 26.7
13 1969
85.9 27.4
14 1970
72.4 26.7
15 1971
88.1 27.5
16 1972
97 27.8
17 1973
91.3 27.8
18 1974
71.9 23.7
19 1975
86.4 28.9
20 1976
88 26.1
21 1977
90.8 30.6
22 1978
101 29.4
23 1979
109.5 32.1
24 1980
91 26.5
25 1981
108.9 30.1
26 1982
113.2 31.5
27 1983
81.1 26.2
28 1984
106.7 28.1
29 1985
118 34.1
30 1986
119.4 33.3
31 1987
119.8 33.9
32 1988
84.6 27.0
33 1989
116.3 32.3
34 1990
118.5 34.1
35 1991
108.6 34.2
36 1992
131.5 37.6
37 1993
100.7 32.6
38 1994
138.6 41.4
39 1995
113.5 35.3
40 1996
127.1 37.6
41 1997
126.7 38.9
42 1998
134.4 38.9
43 1999
133.8 36.6
44 2000
136.9 38.1
45 2001
138.2 39.6
46 2002
129.3 38.0
47 2003
142.2 33.9
48 2004
160.3 42.2
49 2005
147.9 43.1
50 2006
149.1 42.9
51 2007
150.7 41.7
Use both predictors. From the previous two exercises, we conclude that year and soybean may be useful together in a model for predicting corn yield. Run this multiple regression.
a) Explain the results of the ANOVA F test. Give the null and alternate hypothesis, test statistic with degrees of freedom, and p-value. What do you conclude?
b) What percent of the variation in corn yield in explained by these two variables? Compare it with the percent explained in the previous simple linear regression models.
c) State the regression model. Why do the coefficients for year and soybean differ from those in the previous exercises?
d) Summarize the significance test results for the regression coefficients for year and soybean yield.
e) Give a 95% confidence interval for each of these coefficients.
f) Plot the residual versus year and soybean yield. What do you conclude?
Note that the variable year needs to be scaled properly before being used for regression as its scale is very large compared to the other variables.
Let's assume that time t=0 corresponds to the year 1957, t=1 corresponds to year 1958 and so on. Then use Excel Data Analysis to run the multiple linear regression with 'Corn Yield' as the 'Y-Range' and 'Soyabean Yield' and 'Time' as X - Range.
You will have following output -
SUMMARY OUTPUT | |||||||||
Regression Statistics | |||||||||
Multiple R | 0.969258235 | ||||||||
R Square | 0.939461527 | ||||||||
Adjusted R Square | 0.93693909 | ||||||||
Standard Error | 7.471509358 | ||||||||
Observations | 51 | ||||||||
ANOVA | |||||||||
df | SS | MS | F | Significance F | |||||
Regression | 2 | 41582.00842 | 20791.00421 | 372.4421086 | 5.87158E-30 | ||||
Residual | 48 | 2679.5257 | 55.82345209 | ||||||
Total | 50 | 44261.53412 | |||||||
Coefficients | Standard Error | t Stat | P-value | Lower 95% | Upper 95% | Lower 95.0% | Upper 95.0% | ||
Intercept | -1.604119694 | 9.859131867 | -0.16270395 | 0.871434589 | -21.42723291 | 18.21899352 | -21.42723291 | 18.21899352 | |
SoyBeanYield | 2.557739532 | 0.453729415 | 5.637147267 | 8.9397E-07 | 1.645455399 | 3.470023664 | 1.645455399 | 3.470023664 | |
Time | 0.9190585 | 0.187647498 | 4.897792453 | 1.14082E-05 | 0.541767918 | 1.296349082 | 0.541767918 | 1.296349082 | |
-------------------------------------
a)
If the multiple linear regression model is : Y = + (SoyBean Yield) + (Time), then -
Null Hypothesis - H0:
Alternate Hypothesis - At least one of the for i = 1,2 is not =0.
The test statistic = F = 372.4421086
Degree of freedom of numerator and denominator are 2 and 48 respectively.
p-value is less than 0.0001.
As p-value is less than significance level of 0.05, so we reject the null hypothesis and conclude that the regression model is significant. That means at least one of the coefficient is significant.
----------------------------------------------------------------
b)
The variation in dependent variable as predicted by the independent variables is given by the coefficient of variation i.e. R-squared value.
From the output, we can see that R-Squared value is = 0.9395
So, about 93.95% of the variation in corn yield is explained by those two variabes.
----------------------------------------
c)
The regression model is -
Corn Yield = -1.6041 + 2.5577(SoyBean Yield) + 0.9191(Time)
The coefficient are different here because of the interaction of the two predictors. Earlier you used the two variables separately to get two different models so there was no interaction of predictors. But in this case there is interaction of predictors. That causes change in coefficient value.
-------------------------------------------------------------------
d)
As the p-value for both the coefficients of 'SoyBean' and 'Time' are much smaller than the significance level of 0.05, so in both cases we would reject the null hypothesis and conclude that the coefficient values are statistically significant. Hence the two predictors are significant.
----------------------------------------
e)
The result gives 95% confidence interval for each coefficients as -
For SoyBean - (1.6455, 3.470)
For Time - (0.5418, 1.2963)
---------------------------------------------------------
f)
Following are the plots -
Note that the residuals are all within a band and do not follow a particular pattern. So, the assumption of normality of residuals is valid for both the variables. Hence the model is good to be used.
-------------------------------------
-----------------------------------------------
Note that I have scaled the year data to time with reference to year 1957. If you need the coefficients for the year itself, you don't need to scale it. You can directly use the year values.