In: Statistics and Probability
Is there a relationship between the number of stories a building has and its height? Some statisticians compiled data on a set of n = 52 buildings reported in the 1994 World Almanac. You will use the data set to decide whether height can be predicted from the number of stories. (a) Load the data from buildings.txt (Note that this is a text file, so use the appropriate instruction. If you are having trouble uploading the data, open it to see its contents and type the data in: one vector for heights and one vector for stories. Ignore the year data.) (b) Draw a scatterplot with stories in the x-axis and height in the y-axis. Does there seem to be a linear relationship between the two variables? (c) Find the linear correlation coefficient between these variables. What does it tell you about the linear relationship? (d) Obtain the linear model and summary. Write down the regression equation that relates height with stories. Add the line to the scatterplot. (e) Test for significance of the regression at = 0.05. State the null and alternative hypotheses. Can the model be used for predictions? Justify your conclusion using the summary in (d). (f) State the coefficient of determination. What percentage of variation in height is explained by the number of stories? (g) Draw diagnostic plots (a plot of stories vs. residuals, and a normal probability plot for the residuals). Do assumptions appear to be satisfied?
YEAR Height Stories 1990 770 54 1980 677 47 1990 428 28 1989 410 38 1966 371 29 1976 504 38 1974 1136 80 1991 695 52 1982 551 45 1986 550 40 1931 568 49 1979 504 33 1988 560 50 1973 512 40 1981 448 31 1983 538 40 1968 410 27 1927 409 31 1969 504 35 1988 777 57 1987 496 31 1960 386 26 1984 530 39 1976 360 25 1920 355 23 1931 1250 102 1989 802 72 1907 741 57 1988 739 54 1990 650 56 1973 592 45 1983 577 42 1971 500 36 1969 469 30 1971 320 22 1988 441 31 1989 845 52 1973 435 29 1987 435 34 1931 375 20 1931 364 33 1924 340 18 1931 375 23 1991 450 30 1973 529 38 1976 412 31 1990 722 62 1983 574 48 1984 498 29 1986 493 40 1986 379 30 1992 579 42
*********************************
Need R console code
(b) SCATTER PLOT:
Scatterplots are useful for interpreting trends in statistical data. As the data shows an uphill pattern as we move from left to right, this indicates a positive relationship between Stories and Height. That is, as the value of variable "Stories" increase (move right), the the value of "Height" tend to increase (move up). Also we could see a linear pattern in the plot, thus there is a positive linear relationship between Stories and Height.
(c) LINEAR CORRELATION COEFFICIENT:
The correlation coefficient between Height and Stories is and it is found to be significant since the p-value is less than significance level . As the sign is positive and it is nearly closer to 1, there is a strong positive linear relationship between Stories and Height.
(d) SIMPLE LINEAR REGRESSION MODEL:
ESTIMATED REGRESSION EQUATION:
Thus from the above output, the estimated regression equation is given by,
where is the predicted dependent variable "Height".
is the intercept
is the slope coefficient of the variable "Stories".
X is the independent variable "Stories".
SCATTER PLOT WITH TREND LINE:
(e) SIGNIFICANCE OF INDIVIDUAL PREDICTOR:
We use t-test to test for significance of individual predictors.
HYPOTHESIS:
The hypothesis for t test is given by,
From the regression output, the t-test p-value for the slope coefficient of the variable "Stories" is . Since it is less than the significance level , we reject and conclude that the variable "Stories" is significant variable. And the intercept term is also significant sinc ethe p-value is less than significance level .
INTERCEPT:
Since the intercept , the mean value of height without involving the variable "Stories" is .
SLOPE COEFFICIENT:
Since the slope coefficient , it can be interpreted as: As the number of stories increases by 1 unit, the mean value of height increases by units.
TEST FOR OVERALL SIGNIFICANCE OF MODEL:
We use F-test to determine overall significance of model.
The hypothesis is given by,
From the regression output, the F-test p-value is which is less than the significance level , thus we reject and conclude that the overall model performance is significant.
(f) COEFFICIENT OF DETERMINATION :
The coefficient of determination is and the value of adjusted is .
The adjusted R-squared is a modified version of R-squared that has been adjusted for the number of predictors in the model. The adjusted R-squared increases only if the new term improves the model more than would be expected by chance. It decreases when a predictor improves the model by less than expected by chance. Whereas R square increases on addition of predictors. Thus we usually prefer adjusted R-squared to interpret.
The coefficient of determination is the total amount of variability in Y explained by the independent variable X.
Thus 91% of total variability in the dependent variable "Height" is explained by the independent variable "Stories".
(g) DIAGNOSTIC PLOTS:
RESIDUAL PLOT:
In residual plot, the standardized residuals appear on the y axis and the fitted values appear on the x axis.
From the above plot, we can see that
NORMAL PROBABILITY PLOT:
A normal probability plot of the residuals is a scatter plot with the theoretical percentiles of the normal distribution on the x axis and the sample percentiles of the residuals on the y axis. We can see that the relationship between the theoretical percentiles and the sample percentiles is approximately linear. Therefore, the normal probability plot of the residuals suggests that the error terms are normally distributed.