In: Statistics and Probability
House Prices:
Notes on variables:
Price: In thousands of dollars
Ac: air conditioner (1 if yes, 0 if no)
Size: In square feet
Age: In years
Pool: (1 if yes, 0 if no)
Bedrms: Number of Bedrooms
Baths: Number of Bathrooms
4) a. Make a Scatter plot where x=size and y=price
b. Calculate the Least Squares Regression (LSR) equation:
c. Does the intercept make sense? Why or why not?
c. Find and interpret the R2
d. Predict the average price of a home that is 1000 square feet.
e. Better interpretation of the slope: If the square footage increased by 40 square feet, then how more would you expect the house to cost?
f. Plot the residuals vs. X-values.
g. Use the residuals plot or the regression line to identify any outliers. (If there are any.)
5) Looking at Heights again. Remember women have a mean height m= 65 in. and a standard deviation s =2.5 in. and men have a mean height of m=70 and a standard deviation s= 3
a. What proportion of women are taller than 69 in?
b. What height for men is the 65th percentile (larger than 65% of the data)?
c. What proportion/percentage of men have height between 68 to 75?
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
4. Using simple R coding, we may complete the required objectives. But first, we should copy the above data to a text file (like notepad), and save it.
The R code to enter the data to a data frame would be as below.
> library(readr)
> dat <- read_delim("<location of the text file>", "\t", escape_double = FALSE, trim_ws = TRUE)
(a) The scatter plot and the required R commands would be as below.
> plot(x=dat$size,y=dat$price)
(b) The R command and regression result would be as below.
> lm(price ~ size, data = dat) Call: lm(formula = price ~ size, data = dat) Coefficients: (Intercept) size 39.9621 0.1376
The regression equation would be .
The intercept represents the average price when size is zero, ie . The intercept doesn't makes sense because size of the house can not be zero.
(c) The command to find the R-squared and the resutls are as below.
> summary(lm(price ~ size, data = dat))$r.squared [1] 0.7940702
Hence, the r-squared is 0.7940702 or 0.79 (approx).
(d) For size be 1000, we have . Hence, average price of 1000 squared feet house is $177.56.
(e) The slope is the ratio of change in average price and change in size, as . If change in size is 40, then or . Hence, for an increase in size of 40, the price rises by $5.504 or $5.5.
(f) The R command for residual plot and the plot itself would be as below.
> plot(x=dat$size,y=resid(lm(price ~ size, data = dat)))
(g) The outliers can be seen in the residual plots as datapoints for which the corresponding residuals are below -50. Except those 4 points, all the residuals lies between +50 and -50.
The corresponding datapoints can be found the the R command below.
> dat[resid(lm(price ~ size, data = dat))< -50,c(1,3)]
# A tibble: 4 x 2 price size <int> <int> 1 260 2109 2 175 1661 3 191 1834 4 242 1974
The command "resid(lm(price ~ size, data = dat))< -50" checks the residuals that are below -50, and returns TRUE if they are and false if they are not, and returns a vector of TRUE and FALSE in total. The [ , ] is used to slice the rows and columns for which the pre "," is TRUE and slices 1st and 3rd column by using the vector "c(1,3)".