In: Statistics and Probability
#First Install these packages: gtable,scales,munsell,lazyeval,plyr,withr,fansi,utf8,cli,assertthat
#Then Install and load ggplot2 package
install.packages("ggplot2")
library(ggplot2)
diamonds
# A tibble: 53,940 x 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl>
<dbl> <int> <dbl> <dbl> <dbl>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
4 0.290 Premium I VS2 62.4 58 334 4.2 4.23 2.63
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39
# ... with 53,930 more rows
?diamonds ##This gives output which is provided in the question.
##Summary of each variable is as follows:
> summary(diamonds)
carat
Min. :0.2000
1st Qu.:0.4000
Median :0.7000
Mean :0.7979
3rd Qu.:1.0400
Max. :5.0100
cut
Fair : 1610
Good : 4906
Very Good:12082
Premium :13791
Ideal :21551
color
D: 6775
E: 9797
F: 9542
G:11292
H: 8304
I: 5422
J: 2808
clarity
SI1 :13065
VS2 :12258
SI2 : 9194
VS1 : 8171
VVS2 : 5066
VVS1 : 3655
(Other): 2531
depth
Min. :43.00
1st Qu.:61.00
Median :61.80
Mean :61.75
3rd Qu.:62.50
Max. :79.00
table
Min. :43.00
1st Qu.:56.00
Median :57.00
Mean :57.46
3rd Qu.:59.00
Max. :95.00
price
Min. : 326
1st Qu.: 950
Median : 2401
Mean : 3933
3rd Qu.: 5324
Max. :18823
x
Min. : 0.000
1st Qu.: 4.710
Median : 5.700
Mean : 5.731
3rd Qu.: 6.540
Max. :10.740
y
Min. : 0.000
1st Qu.: 4.720
Median : 5.710
Mean : 5.735
3rd Qu.: 6.540
Max. :58.900
z
Min. : 0.000
1st Qu.: 2.910
Median : 3.530
Mean : 3.539
3rd Qu.: 4.040
Max. :31.800
d=diamonds
pairs(d)
##Extract variables for performing regression analysis:
price=d$price ##Dependent (Response) variable
head(price)
carat=d$carat
cut=d$cut
color=d$color
clarity=d$clarity
depth=d$depth
table=d$table
x=d$x
y=d$y
z=d$z
##color, cut, clarity are categorical variables.
> unique(color)
[1] E I J H F G D
Levels: D < E < F < G < H < I < J
> unique(cut)
[1] Ideal Premium Good Very Good Fair
Levels: Fair < Good < Very Good < Premium < Ideal
> unique(clarity)
[1] SI2 SI1 VS1 VS2 VVS2 VVS1 I1 IF
Levels: I1 < SI2 < SI1 < VS2 < VS1 < VVS2 < VVS1
< IF
> plot(price,carat)
> ##Similar plots can be drawn for other variables
> plot(x,price)
fit=lm(price~carat+cut+color+clarity+depth+table+x+y+z)
s=summary(fit)
s
Call:
lm(formula = price ~ carat + cut + color + clarity + depth + table
+ x + y + z)
Residuals:
Min 1Q Median 3Q Max
-21376.0 -592.4 -183.5 376.4 10694.2
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5753.762 396.630 14.507 < 2e-16 ***
carat 11256.978 48.628 231.494 < 2e-16 ***
cut.L 584.457 22.478 26.001 < 2e-16 ***
cut.Q -301.908 17.994 -16.778 < 2e-16 ***
cut.C 148.035 15.483 9.561 < 2e-16 ***
cut^4 -20.794 12.377 -1.680 0.09294 .
color.L -1952.160 17.342 -112.570 < 2e-16 ***
color.Q -672.054 15.777 -42.597 < 2e-16 ***
color.C -165.283 14.725 -11.225 < 2e-16 ***
color^4 38.195 13.527 2.824 0.00475 **
color^5 -95.793 12.776 -7.498 6.59e-14 ***
color^6 -48.466 11.614 -4.173 3.01e-05 ***
clarity.L 4097.431 30.259 135.414 < 2e-16 ***
clarity.Q -1925.004 28.227 -68.197 < 2e-16 ***
clarity.C 982.205 24.152 40.668 < 2e-16 ***
clarity^4 -364.918 19.285 -18.922 < 2e-16 ***
clarity^5 233.563 15.752 14.828 < 2e-16 ***
clarity^6 6.883 13.715 0.502 0.61575
clarity^7 90.640 12.103 7.489 7.06e-14 ***
depth -63.806 4.535 -14.071 < 2e-16 ***
table -26.474 2.912 -9.092 < 2e-16 ***
x -1008.261 32.898 -30.648 < 2e-16 ***
y 9.609 19.333 0.497 0.61918
z -50.119 33.486 -1.497 0.13448
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1130 on 53916 degrees of freedom
Multiple R-squared: 0.9198, Adjusted R-squared: 0.9198
F-statistic: 2.688e+04 on 23 and 53916 DF, p-value: <
2.2e-16
###From summary above, the model with
all variables has Adjusted R-squared=0.9198.
###H0: variable is not significant.
###Also, p-value for variables cut^4=0.09294 ; clarity^6=0.61575
; y=0.61918 ; z=0.13448 which is greater than alpha=0.05.
###Thus, these variables (cut^4,clarity^6,y,z) are not significant,
remaining variables are all significant.
par(mfrow=c(2,2))
plot(fit)
##Normal Q-Q plot indicates that it is not exactly normal. (Observe tails)
####
fit1=lm(price~carat+cut+color+clarity+depth+table+x)
s1=summary(fit1)
s1
Call:
lm(formula = price ~ carat + cut + color + clarity + depth + table
+ x)
Residuals:
Min 1Q Median 3Q Max
-21385.0 -592.4 -183.7 376.5 10694.6
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5935.107 378.328 15.688 < 2e-16 ***
carat 11256.968 48.600 231.626 < 2e-16 ***
cut.L 584.717 22.476 26.015 < 2e-16 ***
cut.Q -302.037 17.983 -16.795 < 2e-16 ***
cut.C 148.065 15.459 9.578 < 2e-16 ***
cut^4 -21.253 12.364 -1.719 0.08562 .
color.L -1952.128 17.342 -112.568 < 2e-16 ***
color.Q -672.207 15.777 -42.608 < 2e-16 ***
color.C -165.451 14.724 -11.236 < 2e-16 ***
color^4 38.261 13.526 2.829 0.00468 **
color^5 -95.816 12.776 -7.500 6.50e-14 ***
color^6 -48.441 11.614 -4.171 3.04e-05 ***
clarity.L 4096.912 30.253 135.423 < 2e-16 ***
clarity.Q -1924.681 28.224 -68.192 < 2e-16 ***
clarity.C 982.004 24.149 40.664 < 2e-16 ***
clarity^4 -364.870 19.285 -18.920 < 2e-16 ***
clarity^5 233.449 15.751 14.822 < 2e-16 ***
clarity^6 6.973 13.715 0.508 0.61114
clarity^7 90.738 12.103 7.497 6.63e-14 ***
depth -66.769 4.091 -16.322 < 2e-16 ***
table -26.457 2.911 -9.089 < 2e-16 ***
x -1029.478 20.549 -50.098 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1130 on 53918 degrees of freedom
Multiple R-squared: 0.9198, Adjusted R-squared: 0.9198
F-statistic: 2.944e+04 on 21 and 53918 DF, p-value: <
2.2e-16
par(mfrow=c(2,2))
plot(fit1)
###Interpretations are more or less same.
> s$adj.r.squared
[1] 0.9197573
> s1$adj.r.squared
[1] 0.9197568
> s$sigma
[1] 1130.094
> s1$sigma
[1] 1130.098
##Eliminated variable cut,y,z
fit2=lm(price~carat+color+clarity+depth+table+x)
s2=summary(fit2)
s2
Call:
lm(formula = price ~ carat + color + clarity + depth + table +
x)
Residuals:
Min 1Q Median 3Q Max
-21828.7 -591.3 -184.1 381.3 10610.4
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 10428.768 325.184 32.070 < 2e-16 ***
carat 11286.547 48.877 230.916 < 2e-16 ***
color.L -1949.727 17.448 -111.745 < 2e-16 ***
color.Q -671.705 15.871 -42.323 < 2e-16 ***
color.C -171.515 14.812 -11.580 < 2e-16 ***
color^4 35.575 13.607 2.614 0.00894 **
color^5 -93.948 12.854 -7.309 2.73e-13 ***
color^6 -52.346 11.685 -4.480 7.48e-06 ***
clarity.L 4193.474 30.160 139.039 < 2e-16 ***
clarity.Q -2002.530 28.155 -71.125 < 2e-16 ***
clarity.C 1036.495 24.168 42.888 < 2e-16 ***
clarity^4 -399.156 19.325 -20.655 < 2e-16 ***
clarity^5 245.525 15.837 15.503 < 2e-16 ***
clarity^6 -0.855 13.793 -0.062 0.95057
clarity^7 95.949 12.175 7.881 3.31e-15 ***
depth -110.281 3.730 -29.564 < 2e-16 ***
table -54.258 2.363 -22.960 < 2e-16 ***
x -1044.128 20.665 -50.525 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1137 on 53922 degrees of freedom
Multiple R-squared: 0.9188, Adjusted R-squared: 0.9188
F-statistic: 3.588e+04 on 17 and 53922 DF, p-value: <
2.2e-16
par(mfrow=c(2,2))
plot(fit2)
##########
##Eliminated variable clarity,y,z
fit3=lm(price~carat+cut+color+depth+table+x)
s3=summary(fit3)
s3
Call:
lm(formula = price ~ carat + cut + color + depth + table + x)
Residuals:
Min 1Q Median 3Q Max
-23496.1 -588.9 -105.7 391.8 12452.3
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 11586.776 462.998 25.026 < 2e-16 ***
carat 11330.866 59.371 190.847 < 2e-16 ***
cut.L 1019.277 27.415 37.179 < 2e-16 ***
cut.Q -480.919 21.934 -21.926 < 2e-16 ***
cut.C 321.039 18.962 16.930 < 2e-16 ***
cut^4 43.433 15.205 2.857 0.00428 **
color.L -1646.134 21.181 -77.716 < 2e-16 ***
color.Q -772.264 19.329 -39.953 < 2e-16 ***
color.C -104.514 18.125 -5.766 8.15e-09 ***
color^4 98.782 16.648 5.934 2.98e-09 ***
color^5 -147.328 15.736 -9.362 < 2e-16 ***
color^6 -151.867 14.274 -10.639 < 2e-16 ***
depth -115.554 5.015 -23.040 < 2e-16 ***
table -40.388 3.584 -11.267 < 2e-16 ***
x -1349.739 24.916 -54.171 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1393 on 53925 degrees of freedom
Multiple R-squared: 0.8782, Adjusted R-squared: 0.8782
F-statistic: 2.777e+04 on 14 and 53925 DF, p-value: <
2.2e-16
par(mfrow=c(2,2))
plot(fit3)
##Eliminated variable cut,clarity,y,z
fit4=lm(price~carat+color+depth+table+x)
s4=summary(fit4)
s4
Call:
lm(formula = price ~ carat + color + depth + table + x)
Residuals:
Min 1Q Median 3Q Max
-24411.9 -582.4 -97.2 387.0 12343.8
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 20389.826 395.039 51.615 < 2e-16 ***
carat 11373.151 60.145 189.096 < 2e-16 ***
color.L -1636.388 21.462 -76.245 < 2e-16 ***
color.Q -769.320 19.583 -39.285 < 2e-16 ***
color.C -113.409 18.363 -6.176 6.62e-10 ***
color^4 92.702 16.867 5.496 3.90e-08 ***
color^5 -146.797 15.944 -9.207 < 2e-16 ***
color^6 -161.379 14.462 -11.158 < 2e-16 ***
depth -193.960 4.575 -42.399 < 2e-16 ***
table -101.608 2.907 -34.948 < 2e-16 ***
x -1382.453 25.231 -54.792 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1411 on 53929 degrees of freedom
Multiple R-squared: 0.8749, Adjusted R-squared: 0.8749
F-statistic: 3.772e+04 on 10 and 53929 DF, p-value: <
2.2e-16
##Clearly, we observe in this model when only carat,
color, depth, table and x variables are present, the Adjusted
R-squared=0.8749 which is less than the previous models.
##Also, the residuals error=1411 is greater than that of previous
models.
##Previous models fit better than this model.
par(mfrow=c(2,2))
plot(fit4)
##Model which has greatest Adjusted R-squared vale and
least Residual error is the BEST model.
##Therefore, here the model with all variables is the best
model.
(price ~ carat + cut + color + clarity + depth + table + x + y + z)
is the best model.