In: Statistics and Probability
Consider the following set of observations:
Obs. |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
11 |
12 |
13 |
14 |
input |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
11 |
12 |
13 |
14 |
result |
1 |
2 |
3 |
5 |
8 |
13 |
21 |
34 |
55 |
89 |
144 |
233 |
377 |
610 |
Enter the data in L1 and L2 in your TI calculator, find the
regression line, and construct a scatterplot with the regression
line included. Does a line appear to be a good model for these
data? Be sure to check your residuals plot. (7 points: 2 points
regression line, 2 points scatter plot, 2 points for residual plot;
1 points comment)
What is r2?
What type of relationship does the data appear to have (linear,
logarithmic, exponential, etc.)?
What type of re-expression would work in this case? (1
point)
Find the natural logarithm of the
y-values.
Draw a scatterplot of x vs. ln y. Find the regression equation
on ln y on x and include it on the graph. Does it appear to be a
better fit than the fit in part (a)? Be sure to check your
residuals plot. (7 points: 2 points regression line, 2 points
scatter plot, 2 points for residual plot; 1 points
comment)
Write a prediction (regression) equation for your re-expressed
data
Use the regression equation you found in part (f) to predict the value of y when x = 10.5.
Does your answer for part (h) seem reasonable? Why or why
not?
Explain the importance of checking the residuals plot before re-expressing data and then again after re-expressing data.
We can use simple R-coding (or excel) to solve this question, as it would be no different than using the TI calculator (meaning that we do not have to solve calculation by hands). We will use R-studio for this.
The data inputs would be as below.
> input <- c(1,2,3,4,5,6,7,8,9,10,11,12,13,14) > result <- c(1,2,3,5,8,13,21,34,55,89,144,233,377,610)
The regression equation would be as below.
> lm(result ~ input) Call: lm(formula = result ~ input) Coefficients: (Intercept) input -143.69 34.35
.
The scatter plot with regression line and the residual plot would be as below.
> plot(x=input,y=result) > abline(lm(result ~ input))
plot(x=input,y=resid(summary(lm(result~input))))
By seeing the scatter plot with regression line, we may say that the line does not appears to be a good model for these data. The reason being that line does not match the pattern of the data, as the data points seems parabolic/exponential, while the regression line is linear. Moreover, the first four data points are all on the left of the regression line, and then next eight data points are on the right, and then the last data point is again on the left. Also, the data points in the residual plot are not random, but strictly seem to have a pattern.
The r-squared of the regression would be as below.
> summary(lm(result~input))$r.squared [1] 0.6385676
Hence, the r-squared is 0.6385676.
The data seems to have exponential relationship. The reason being that the data have increasing slope.
Exponential data would have the regression relationship as , and can be transformed to a linear line as . A re-expression would be such as that.
The natural log of result values would be as below.
log(result) [1] 0.0000000 0.6931472 1.0986123 1.6094379 2.0794415 [6] 2.5649494 3.0445224 3.5263605 4.0073332 4.4886364 [11] 4.9698133 5.4510385 5.9322452 6.4134590
The regression for the log of y and x would be as below.
> lm(log(result) ~ input) Call: lm(formula = log(result) ~ input) Coefficients: (Intercept) input -0.3584 0.4847
The regression equation would be or or or .
The scatter plot with the regression line and the residual plot would be as below.
> plot(x=input,y=log(result)) > abline(lm(log(result)~input))
plot(x=input, y=resid(summary(lm(log(result)~input))))
By seeing the scatter plot, it does appear to be a much better plot than the usual linear regression. Also, the reidual plot also seems better than before (but there is still a pattern after 3rd data point).
The regression equation would be or .
For input be 10.5, we have or . Using , we have .
As the value lies between 89 and 144, and is around mid point of these point, which is 116.5, the predicted result seems reasonable.
The importance of the residual plots is that it indicates the data fit and further relation of the re-expression. As the residual plot have a pattern, that suggests a non-linear relationship. Otherwise, the r-square of over 69% is not that bad at all. After re-expressing data, the residual plot plot seems to be more stable and random than before, suggesting that the fit is much better than before. The new r-square would be 99.95416%, whihc is quite high, suggesting that the re-expression gives a better model.