In: Statistics and Probability
Regression
Is there a relationship between the number of stories a building has and its height? Some statisticians compiled data on a set of n = 60 buildings reported in the World Almanac. You will use the data set to decide whether height (in feet) can be predicted from the number of stories.
(a) Load the data from buildings.txt.
(Note that this is a text file, so use the appropriate instruction. If you are having trouble uploading the data, open it to see its contents and type the data in: one vector for heights and one vector for stories. Ignore the year data.)
buildings.txt
YEAR Height Stories
1990 770 54
1980 677 47
1990 428 28
1989 410 38
1966 371 29
1976 504 38
1974 1136 80
1991 695 52
1982 551 45
1986 550 40
1931 568 49
1979 504 33
1988 560 50
1973 512 40
1981 448 31
1983 538 40
1968 410 27
1927 409 31
1969 504 35
1988 777 57
1987 496 31
1960 386 26
1984 530 39
1976 360 25
1920 355 23
1931 1250 102
1989 802 72
1907 741 57
1988 739 54
1990 650 56
1973 592 45
1983 577 42
1971 500 36
1969 469 30
1971 320 22
1988 441 31
1989 845 52
1973 435 29
1987 435 34
1931 375 20
1931 364 33
1924 340 18
1931 375 23
1991 450 30
1973 529 38
1976 412 31
1990 722 62
1983 574 48
1984 498 29
1986 493 40
1986 379 30
1992 579 42
1973 458 36
1988 454 33
1979 952 72
1972 784 57
1930 476 34
1978 453 46
1978 440 30
1977 428 21
(b) Draw a scatterplot with stories in the x-axis and height in the y-axis. Describe the trend, strength and shape of the relationship between stories and height.
(c) Find the linear correlation coefficient between these variables. How does it support the description you gave in (b)?
(d) Obtain the linear model and summary. Write down the regression equation that relates height with stories. Add the line to the scatterplot.
(e) Test for significance of the regression at = 0.05. State the null and alternative hypotheses. Can the model be used for predictions? Justify your conclusion using the summary in (d).
(f) State the coefficient of determination. What percentage of variation in height is explained by the number of stories?
(g) Draw diagnostic plots (a plot of stories vs. residuals, and a normal probability plot for the residuals). Do assumptions appear to be satisfied?
(h) Obtain a 95% confidence interval for the true value of the slope. How does the interval support your conclusion in (e)?
(i) What is the estimated height of a building that is 45 stories high? Write a concluding sentence supported by your results above.
a) The data extracted is given below.
H <- c(770, 677, 428, 410, 371, 504, 1136, 695, 551, 550, 568, 504, 560, 512, 448, 538, 410, 409, 504, 777, 496, 386, 530, 360, 355, 1250, 802, 741, 739, 650, 592, 577, 500,469,320, 441, 845,435)
S <- c(54, 47, 28, 38, 29, 38, 80, 52, 45, 40, 49, 33, 50, 40, 31, 40, 27, 31, 35, 57, 31, 26, 39, 25, 23, 102, 72, 57, 54, 56, 45, 42, 36, 30, 22, 31, 52, 29, 34)
(b) SCATTER PLOT:
Scatterplots are useful for interpreting trends in statistical data. As the data shows an uphill pattern as we move from left to right, this indicates a positive relationship between Stories and Height. That is, as the value of variable "Stories" increase (move right), the the value of "Height" tend to increase (move up). Also we could see a linear pattern in the plot, thus there is a positive linear relationship between Stories and Height.
(c) LINEAR CORRELATION COEFFICIENT:
The correlation coefficient between Height and Stories is and it is found to be significant since the p-value is less than significance level . As the sign is positive and it is nearly closer to 1, there is a strong positive linear relationship between Stories and Height.
(d) SIMPLE LINEAR REGRESSION MODEL:
ESTIMATED REGRESSION EQUATION:
Thus from the above output, the estimated regression equation is given by,
where is the predicted dependent variable "Height".
is the intercept
is the slope coefficient of the variable "Stories".
X is the independent variable "Stories".
SCATTER PLOT WITH TREND LINE:
(e) SIGNIFICANCE OF INDIVIDUAL PREDICTOR:
We use t-test to test for significance of individual predictors.
HYPOTHESIS:
The hypothesis for t test is given by,
From the regression output, the t-test p-value for the slope coefficient of the variable "Stories" is . Since it is less than the significance level , we reject and conclude that the variable "Stories" is significant variable. And the intercept term is also significant sinc ethe p-value is less than significance level .
INTERCEPT:
Since the intercept , the mean value of height without involving the variable "Stories" is .
SLOPE COEFFICIENT:
Since the slope coefficient , it can be interpreted as: As the number of stories increases by 1 unit, the mean value of height increases by units.
TEST FOR OVERALL SIGNIFICANCE OF MODEL:
We use F-test to determine overall significance of model.
The hypothesis is given by,
From the regression output, the F-test p-value is which is less than the significance level , thus we reject and conclude that the overall model performance is significant.
(f) COEFFICIENT OF DETERMINATION :
The coefficient of determination is and the value of adjusted is .
The adjusted R-squared is a modified version of R-squared that has been adjusted for the number of predictors in the model. The adjusted R-squared increases only if the new term improves the model more than would be expected by chance. It decreases when a predictor improves the model by less than expected by chance. Whereas R square increases on addition of predictors. Thus we usually prefer adjusted R-squared to interpret.
The coefficient of determination is the total amount of variability in Y explained by the independent variable X.
Thus 91% of total variability in the dependent variable "Height" is explained by the independent variable "Stories".
(g) DIAGNOSTIC PLOTS:
RESIDUAL PLOT:
In residual plot, the standardized residuals appear on the y axis and the fitted values appear on the x axis.
From the above plot, we can see that
NORMAL PROBABILITY PLOT:
A normal probability plot of the residuals is a scatter plot with the theoretical percentiles of the normal distribution on the x axis and the sample percentiles of the residuals on the y axis. We can see that the relationship between the theoretical percentiles and the sample percentiles is approximately linear. Therefore, the normal probability plot of the residuals suggests that the error terms are normally distributed.