In: Math
Suppose the following data were collected from a sample of 1515 houses relating selling price to square footage and the architectural style of the house. Which of the following is the best equation to use relating the selling price of a house to square footage and the style of the house?
Copy Data
Selling Price | Square Footage | Colonial (1 if house is Colonial style, 0 otherwise) | Ranch (1 if house is Ranch style, 0 otherwise) | Victorian (1 if house is Victorian style, 0 otherwise) |
---|---|---|---|---|
391430391430 | 23032303 | 00 | 11 | 00 |
381002381002 | 20532053 | 11 | 00 | 00 |
403539403539 | 20132013 | 00 | 00 | 11 |
405271405271 | 25522552 | 00 | 00 | 11 |
406578406578 | 31313131 | 00 | 00 | 11 |
471858471858 | 36593659 | 00 | 11 | 00 |
392188392188 | 23322332 | 00 | 11 | 00 |
475616475616 | 35883588 | 11 | 00 | 00 |
401742401742 | 18431843 | 00 | 00 | 11 |
404836404836 | 26562656 | 11 | 00 | 00 |
333709333709 | 13371337 | 11 | 00 | 00 |
393618393618 | 23892389 | 11 | 00 | 00 |
365651365651 | 17991799 | 00 | 11 | 00 |
404239404239 | 23212321 | 00 | 00 | 11 |
375624375624 | 19461946 | 00 | 11 | 00 |
Regression analysis is a basic method used in statistical analysis of data. It’s a statistical method which allows estimating the relationships among variables. One needs to identify dependent variable which will vary based on the value of the independent variable. For example, the value of the house (dependent variable) varies based on square feet of the house (independent variable). Regression analysis is very useful tool in predictive analytics.
E(Y | X) = f(X, β)
Y = f(X) = ?0 + ?1 * X
?0 is the intercept of the line
?1 is the slope of the line
Linear regression algorithm is used to predict the relationship(line) among data points. There can be many different (linear or nonlinear) ways to define the relationship. In the linear model, it is based on the intercept and the slope. To find out the most optimal relationship, we need to train the model with the data.
Before applying the linear regression model, we should determine whether or not there is a relationship between the variables of interest. A scatterplot is a good starting point to help in determining the strength of the relationship between two variables. The correlation coefficient is a valuable measure of association between variables. Its value varies between -1 (weak relationship) and 1 (strong relationship).
Once we determine that there is a relationship between variables, next step is to identify best-fitting relationship (line) between the variables. The most common method is the Residual Sum of Squares (RSS). This method calculates the difference between observed data (actual value) and its vertical distance from the proposed best-fitting line (predicted value). It squares each difference and adds all of them.
The MSE (Mean Squared Error) is a quality measure for the estimator by dividing RSS by total observed data points. It is always a non-negative number. Values closer to zero represent a smaller error. The RMSE (Root Mean Squared Error) is the square root of the MSE. The RMSE is a measure of the average deviation of the estimates from the observed values. This is easier to observe compare to MSE, which can be a large number.
RMSE (Square root of MSE) = √ (MSE)
The additional number of variables will add more dimension to the model.
Y = f(X) = ?0 + ?1 * X1 + ?1 * X2 + ?1 * X3
Analysis of Variance
Source DF Adj SS Adj MS F-Value P-Value
Regression 3 1.48677E+22 4.95589E+21 19.17 0.000
Square Footage 1 1.47522E+22 1.47522E+22 57.08 0.000
Colonial (1 if house is Colonia 1 1.65815E+20 1.65815E+20 0.64
0.440
Ranch (1 if house is Ranch styl 1 1.12558E+20 1.12558E+20 0.44
0.523
Error 11 2.84317E+21 2.58470E+20
Total 14 1.77108E+22
Model Summary
S R-sq R-sq(adj) R-sq(pred)
1.60770E+10 83.95% 79.57% 65.77%
Coefficients
Term Coef SE Coef T-Value P-Value VIF
Constant 2.85798E+11 17251859972 16.57 0.000
Square Footage 4994 661 7.55 0.000 1.00
Colonial (1 if house is Colonia -740537218 924569907 -0.80 0.440
1.33
Ranch (1 if house is Ranch styl -610158292 924612666 -0.66 0.523
1.33
Regression Equation
Selling Price = 285797688046 + 4994 Square Footage
- 740537218 Colonial (1 if house is Colonia
- 610158292 Ranch (1 if house is Ranch styl
Fits and Diagnostics for Unusual Observations
Obs Selling Price Fit Resid Std Resid
5 4.06578E+11 4.42185E+11 -3.56063E+10 -2.64 R