In: Math
2M2_IND3. Prices of diamond jewelry are based on the “4Cs” ofdiamonds: cut, color, clarity, and carat. A jeweler is trying to estimate the price of diamond earrings based on color, carats, and clarity. The jeweler has collected some data on 22 diamond pieces and the data is shown in Worksheet IND3. The jeweler wouldlike to build a multiple regression model to estimate the price of the pieces based on color, carats, and clarity.a)Prepare a scatter plot showing the relationship betweenthe price and each of the independent variables.b)If the jeweler wanted to build a regression model using only one independent variable to predict price, which variable should be used?c)Why?d)How do you use the value of Significance F in the model with only one independent variable?e)If the jeweler wanted to build a regression model using twoindependent variables to predict price, which variable should be addedto the variable selected in the one independent variable model?f)Why?g)If the jeweler wanted to build a regression model using three independent variables to predict price, which variable should be addedto the variables selectedfor the two variable model?h)Why?i)Based on your best model, how should the jeweler price a diamond with a color of 2.75, a clarity of 3.00, and a weight of 0.85 carats?j)How do you use the value of Significance F in the multiple regression model?k)Does there appear to be any multicollinearity among the independent variables?l)How can you tell if you have multicollinearity?
Color | Clarity | Carats | Price |
2.50 | 1.50 | 0.50 | 474.99 |
3.50 | 4.00 | 0.50 | 539.99 |
3.50 | 4.50 | 0.70 | 549.99 |
3.00 | 3.50 | 0.75 | 523.99 |
3.00 | 3.50 | 0.75 | 523.99 |
3.50 | 4.00 | 0.75 | 539.99 |
1.50 | 3.50 | 0.75 | 664.99 |
1.50 | 2.00 | 0.75 | 699.99 |
2.50 | 3.50 | 0.75 | 902.99 |
2.50 | 1.50 | 0.75 | 1,128.99 |
2.50 | 1.50 | 0.75 | 1,139.99 |
3.00 | 2.00 | 0.75 | 1,125.00 |
3.50 | 4.00 | 1.00 | 799.99 |
3.50 | 4.50 | 1.00 | 899.99 |
2.50 | 3.50 | 1.00 | 999.99 |
3.00 | 3.50 | 1.00 | 1,082.99 |
3.00 | 3.50 | 1.00 | 1,082.99 |
1.50 | 3.50 | 1.00 | 1,329.99 |
2.50 | 1.50 | 1.00 | 1,329.99 |
1.50 | 3.50 | 1.00 | 1,399.99 |
2.50 | 1.50 | 1.00 | 1,624.99 |
3.50 | 3.00 | 1.00 | 1,625.00 |
a)Prepare a scatter plot showing the relationship between the price and each of the independent variables.
scatter plot between "Price" and "Color" by using R, code for scatter plot in R is "plot(data$Price, data$Color)"
scatter plot between "Price" and "Clarity" by using R, code for scatter plot in R is "plot(data$Price, data$Clarity)"
scatter plot between "Price" and "Carats" by using R, code for scatter plot in R is "plot(data$Price, data$Carats)"
b)If the jeweler wanted to build a regression model using only one independent variable to predict price, which variable should be used?
Regression model by using Price as an Dependent variable and Color as an independent Variable.
> model1= lm(data$Price ~ data$Color, data = data) > summary(model1) Call: lm(formula = data$Price ~ data$Color, data = data) Residuals: Min 1Q Median 3Q Max -8.1859 -3.7140 0.4714 4.1933 9.1287 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 7.2981 4.6340 1.575 0.131 data$Color 0.6293 1.6609 0.379 0.709 Residual standard error: 5.338 on 20 degrees of freedom Multiple R-squared: 0.007126, Adjusted R-squared: -0.04252 F-statistic: 0.1435 on 1 and 20 DF, p-value: 0.7088
Regression model by using Price as an Dependent variable and Clarity as an independent Variable.
> model2= lm(data$Price ~ data$Clarity, data = data) > summary(model2) Call: lm(formula = data$Price ~ data$Clarity, data = data) Residuals: Min 1Q Median 3Q Max -9.0396 -2.2153 -0.1832 3.3911 7.9604 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.0347 3.1941 0.637 0.5313 data$Clarity 2.2871 0.9944 2.300 0.0323 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 4.764 on 20 degrees of freedom Multiple R-squared: 0.2092, Adjusted R-squared: 0.1696 F-statistic: 5.29 on 1 and 20 DF, p-value: 0.03234
Regression model by using Price as a Dependent variable and Carats as an Independent Variable.
> model3= lm(data$Price ~ data$Carats, data = data) > summary(model3) Call: lm(formula = data$Price ~ data$Carats, data = data) Residuals: Min 1Q Median 3Q Max -7.4052 -3.2623 0.0948 3.2876 9.7377 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 12.834 5.947 2.158 0.0433 * data$Carats -4.572 6.962 -0.657 0.5189 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 5.3 on 20 degrees of freedom Multiple R-squared: 0.02111, Adjusted R-squared: -0.02784 F-statistic: 0.4312 on 1 and 20 DF, p-value: 0.5189
c)Why?
We create 3 models by using three different variables, but in the second model where an independent variable is "Clarity", we got maximum R square value as well as minimum Residual standard error value.
d)How do you use the value of Significance F in the model with only one independent variable?
we got p-value = 0.03234 if the p-value is smaller than 0.05 it means that our variable is significant to predict.
e)If the jeweler wanted to build a regression model using two independent variables to predict price, which variable should be added to the variable selected in the one independent variable model?
Regression model by using Price as a Dependent variable, Color+Clarity as an Independent Variable.
> model4= lm(data$Price ~ Color+Clarity, data = data) > summary(model4) Call: lm(formula = data$Price ~ Color + Clarity, data = data) Residuals: Min 1Q Median 3Q Max -8.9019 -2.0739 0.0981 3.2459 7.7171 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.5094 4.5386 0.773 0.4489 Color -0.7619 1.6322 -0.467 0.6460 Clarity 2.4795 1.0949 2.265 0.0354 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 4.86 on 19 degrees of freedom Multiple R-squared: 0.2182, Adjusted R-squared: 0.1359 F-statistic: 2.651 on 2 and 19 DF, p-value: 0.09653
Regression model by using Price as a Dependent variable, Color+Carats as an Independent Variable.
> model5= lm(data$Price ~ Color+Carats, data = data) > summary(model5) Call: lm(formula = data$Price ~ Color + Carats, data = data) Residuals: Min 1Q Median 3Q Max -7.550 -3.045 -0.322 3.690 9.819 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 11.2036 7.9176 1.415 0.173 Color 0.5448 1.6930 0.322 0.751 Carats -4.3847 7.1469 -0.614 0.547 Residual standard error: 5.423 on 19 degrees of freedom Multiple R-squared: 0.02641, Adjusted R-squared: -0.07607 F-statistic: 0.2577 on 2 and 19 DF, p-value: 0.7755
Regression model by using Price as a Dependent variable, Clarity+Carats as an Independent Variable.
> model6= lm(data$Price ~ Clarity+Carats, data = data) > summary(model6) Call: lm(formula = data$Price ~ Clarity + Carats, data = data) Residuals: Min 1Q Median 3Q Max -8.0507 -2.7564 -0.6747 2.6698 8.9493 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 7.085 5.843 1.212 0.2402 Clarity 2.418 1.001 2.416 0.0259 * Carats -6.496 6.298 -1.031 0.3153 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 4.756 on 19 degrees of freedom Multiple R-squared: 0.2511, Adjusted R-squared: 0.1723 F-statistic: 3.186 on 2 and 19 DF, p-value: 0.06411
We create 3 different models by using three different combinations of dependent variables, we select the "Carats" variable for adding with "Clarity" variable.
f)Why?
Because the combination of these two variables gives us maximum R square, as well as minimum error, compared to other models.
g)If the jeweler wanted to build a regression model using three independent variables to predict price, which variable should be added to the variables selected for the two-variable model?
" Color" variable
h)Why?
In the two-variable model, we have the "Carats" variable and "Clarity" variable so we add a " Color" variable for the three-variable model because only variable "Colour" is remaining.
i)Based on your best model, how should the jeweler price a diamond with a color of 2.75, a clarity of 3.00, and a weight of 0.85 carats?
Final model is
> model7= lm(data$Price ~ Clarity+Carats+Color, data = data) > summary(model7) Call: lm(formula = data$Price ~ Clarity + Carats + Color, data = data) Residuals: Min 1Q Median 3Q Max -7.7825 -2.5493 -0.4644 2.6984 8.7111 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 9.481 7.095 1.336 0.1982 Clarity 2.685 1.106 2.428 0.0259 * Carats -7.056 6.467 -1.091 0.2896 Color -1.013 1.640 -0.617 0.5447 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 4.836 on 18 degrees of freedom Multiple R-squared: 0.2667, Adjusted R-squared: 0.1444 F-statistic: 2.182 on 3 and 18 DF, p-value: 0.1255
So the equation of the model is
Price = 9.481 + ( -1.013) * Color + (2.685) * Clarity + ( -7.056) * Carats
Price = 9.481 + ( -1.013) * 2.75 + (2.685) * 3.00 + ( -7.056) * 0.85
Price = 8.75265
j)How do you use the value of Significance F in the multiple regression model?
if the p-value is smaller than 0.05 it means that our variable is significant to predict.
k)Does there appear to be any multicollinearity among the independent variables?
cor(data) Color Clarity Carats Price Color 1.00000000 0.3763679 -0.08126966 0.08441445 Clarity 0.37636788 1.0000000 0.12648024 0.45737140 Carats -0.08126966 0.1264802 1.00000000 -0.14527925 Price 0.08441445 0.4573714 -0.14527925 1.00000000
No multicollinearity appear in this data set.