Question

In: Statistics and Probability

Assignment: Install and load the ggplot2 package. load the "diamonds" dataset RCode: install.packages("ggplot2") library(ggplot2) ?diamonds 1....

Assignment:

Install and load the ggplot2 package.

load the "diamonds" dataset

RCode:

install.packages("ggplot2")
library(ggplot2)
?diamonds

1. Explore the dataset & state insights

2. Create plots for dataset

3: Provide summary of descriptive stats)

4. Run the regressions, research, Investigate & comment on R^2 & on regression plots - 1 line each.

#===========================================
# DV = Price, IV or IVs = your choice
# Can we create and compare models to predict "Price"?
# Question- Investigate & comment on R^2 & on plots
#Compare regression models & discuss R^2 -any improvement?
# Based on your understanding of regression models, select the best model
#to predict the price of diamonds based on the dataset

#Name your R file as LastNameFirstInitial.R and include your full name in the first line of the script.

diamonds {ggplot2} R Documentation
Prices of 50,000 round cut diamonds

Description

A dataset containing the prices and other attributes of almost 54,000 diamonds. The variables are as follows:

Usage

diamonds
Format

A data frame with 53940 rows and 10 variables:

price
price in US dollars (\$326–\$18,823)

carat
weight of the diamond (0.2–5.01)

cut
quality of the cut (Fair, Good, Very Good, Premium, Ideal)

color
diamond colour, from J (worst) to D (best)

clarity
a measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best))

x
length in mm (0–10.74)

y
width in mm (0–58.9)

z
depth in mm (0–31.8)

depth
total depth percentage = z / mean(x, y) = 2 * z / (x + y) (43–79)

table
width of top of diamond relative to widest point (43–95)

Solutions

Expert Solution

#First Install these packages: gtable,scales,munsell,lazyeval,plyr,withr,fansi,utf8,cli,assertthat

#Then Install and load ggplot2 package
install.packages("ggplot2")
library(ggplot2)
diamonds
# A tibble: 53,940 x 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
4 0.290 Premium I VS2 62.4 58 334 4.2 4.23 2.63
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39
# ... with 53,930 more rows

?diamonds ##This gives output which is provided in the question.

##Summary of each variable is as follows:
> summary(diamonds)
carat   
Min. :0.2000
1st Qu.:0.4000
Median :0.7000
Mean :0.7979
3rd Qu.:1.0400
Max. :5.0100

cut   
Fair : 1610
Good : 4906
Very Good:12082
Premium :13791
Ideal :21551
  
color
D: 6775
E: 9797
F: 9542
G:11292
H: 8304
I: 5422
J: 2808
clarity   
SI1 :13065
VS2 :12258
SI2 : 9194
VS1 : 8171
VVS2 : 5066
VVS1 : 3655
(Other): 2531
depth
Min. :43.00
1st Qu.:61.00
Median :61.80
Mean :61.75
3rd Qu.:62.50
Max. :79.00
  
table
Min. :43.00
1st Qu.:56.00
Median :57.00
Mean :57.46
3rd Qu.:59.00
Max. :95.00
  
price
Min. : 326
1st Qu.: 950
Median : 2401
Mean : 3933
3rd Qu.: 5324
Max. :18823
  
x   
Min. : 0.000
1st Qu.: 4.710
Median : 5.700
Mean : 5.731
3rd Qu.: 6.540
Max. :10.740

y   
Min. : 0.000
1st Qu.: 4.720
Median : 5.710
Mean : 5.735
3rd Qu.: 6.540
Max. :58.900

z   
Min. : 0.000
1st Qu.: 2.910
Median : 3.530
Mean : 3.539
3rd Qu.: 4.040
Max. :31.800   

d=diamonds
pairs(d)

##Extract variables for performing regression analysis:
price=d$price ##Dependent (Response) variable
head(price)
carat=d$carat
cut=d$cut
color=d$color
clarity=d$clarity
depth=d$depth
table=d$table
x=d$x
y=d$y
z=d$z

##color, cut, clarity are categorical variables.

> unique(color)
[1] E I J H F G D
Levels: D < E < F < G < H < I < J
> unique(cut)
[1] Ideal Premium Good Very Good Fair   
Levels: Fair < Good < Very Good < Premium < Ideal
> unique(clarity)
[1] SI2 SI1 VS1 VS2 VVS2 VVS1 I1 IF
Levels: I1 < SI2 < SI1 < VS2 < VS1 < VVS2 < VVS1 < IF
> plot(price,carat)

> ##Similar plots can be drawn for other variables

> plot(x,price)

fit=lm(price~carat+cut+color+clarity+depth+table+x+y+z)
s=summary(fit)
s
Call:
lm(formula = price ~ carat + cut + color + clarity + depth + table + x + y + z)
Residuals:
Min 1Q Median 3Q Max
-21376.0 -592.4 -183.5 376.4 10694.2
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5753.762 396.630 14.507 < 2e-16 ***
carat 11256.978 48.628 231.494 < 2e-16 ***
cut.L 584.457 22.478 26.001 < 2e-16 ***
cut.Q -301.908 17.994 -16.778 < 2e-16 ***
cut.C 148.035 15.483 9.561 < 2e-16 ***
cut^4 -20.794 12.377 -1.680 0.09294 .
color.L -1952.160 17.342 -112.570 < 2e-16 ***
color.Q -672.054 15.777 -42.597 < 2e-16 ***
color.C -165.283 14.725 -11.225 < 2e-16 ***
color^4 38.195 13.527 2.824 0.00475 **
color^5 -95.793 12.776 -7.498 6.59e-14 ***
color^6 -48.466 11.614 -4.173 3.01e-05 ***
clarity.L 4097.431 30.259 135.414 < 2e-16 ***
clarity.Q -1925.004 28.227 -68.197 < 2e-16 ***
clarity.C 982.205 24.152 40.668 < 2e-16 ***
clarity^4 -364.918 19.285 -18.922 < 2e-16 ***
clarity^5 233.563 15.752 14.828 < 2e-16 ***
clarity^6 6.883 13.715 0.502 0.61575
clarity^7 90.640 12.103 7.489 7.06e-14 ***
depth -63.806 4.535 -14.071 < 2e-16 ***
table -26.474 2.912 -9.092 < 2e-16 ***
x -1008.261 32.898 -30.648 < 2e-16 ***
y 9.609 19.333 0.497 0.61918
z -50.119 33.486 -1.497 0.13448
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1130 on 53916 degrees of freedom
Multiple R-squared: 0.9198, Adjusted R-squared: 0.9198
F-statistic: 2.688e+04 on 23 and 53916 DF, p-value: < 2.2e-16

###From summary above, the model with all variables has Adjusted R-squared=0.9198.
###H0: variable is not significant.

###Also, p-value for variables cut^4=0.09294 ; clarity^6=0.61575 ; y=0.61918 ; z=0.13448 which is greater than alpha=0.05.
###Thus, these variables (cut^4,clarity^6,y,z) are not significant, remaining variables are all significant.

par(mfrow=c(2,2))
plot(fit)

##Normal Q-Q plot indicates that it is not exactly normal. (Observe tails)

####
fit1=lm(price~carat+cut+color+clarity+depth+table+x)
s1=summary(fit1)
s1
Call:
lm(formula = price ~ carat + cut + color + clarity + depth + table + x)
Residuals:
Min 1Q Median 3Q Max
-21385.0 -592.4 -183.7 376.5 10694.6
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5935.107 378.328 15.688 < 2e-16 ***
carat 11256.968 48.600 231.626 < 2e-16 ***
cut.L 584.717 22.476 26.015 < 2e-16 ***
cut.Q -302.037 17.983 -16.795 < 2e-16 ***
cut.C 148.065 15.459 9.578 < 2e-16 ***
cut^4 -21.253 12.364 -1.719 0.08562 .
color.L -1952.128 17.342 -112.568 < 2e-16 ***
color.Q -672.207 15.777 -42.608 < 2e-16 ***
color.C -165.451 14.724 -11.236 < 2e-16 ***
color^4 38.261 13.526 2.829 0.00468 **
color^5 -95.816 12.776 -7.500 6.50e-14 ***
color^6 -48.441 11.614 -4.171 3.04e-05 ***
clarity.L 4096.912 30.253 135.423 < 2e-16 ***
clarity.Q -1924.681 28.224 -68.192 < 2e-16 ***
clarity.C 982.004 24.149 40.664 < 2e-16 ***
clarity^4 -364.870 19.285 -18.920 < 2e-16 ***
clarity^5 233.449 15.751 14.822 < 2e-16 ***
clarity^6 6.973 13.715 0.508 0.61114
clarity^7 90.738 12.103 7.497 6.63e-14 ***
depth -66.769 4.091 -16.322 < 2e-16 ***
table -26.457 2.911 -9.089 < 2e-16 ***
x -1029.478 20.549 -50.098 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1130 on 53918 degrees of freedom
Multiple R-squared: 0.9198, Adjusted R-squared: 0.9198
F-statistic: 2.944e+04 on 21 and 53918 DF, p-value: < 2.2e-16

par(mfrow=c(2,2))
plot(fit1)

###Interpretations are more or less same.
> s$adj.r.squared
[1] 0.9197573
> s1$adj.r.squared
[1] 0.9197568
> s$sigma
[1] 1130.094
> s1$sigma
[1] 1130.098

##Eliminated variable cut,y,z
fit2=lm(price~carat+color+clarity+depth+table+x)   
s2=summary(fit2)
s2
Call:
lm(formula = price ~ carat + color + clarity + depth + table + x)
Residuals:
Min 1Q Median 3Q Max
-21828.7 -591.3 -184.1 381.3 10610.4
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 10428.768 325.184 32.070 < 2e-16 ***
carat 11286.547 48.877 230.916 < 2e-16 ***
color.L -1949.727 17.448 -111.745 < 2e-16 ***
color.Q -671.705 15.871 -42.323 < 2e-16 ***
color.C -171.515 14.812 -11.580 < 2e-16 ***
color^4 35.575 13.607 2.614 0.00894 **
color^5 -93.948 12.854 -7.309 2.73e-13 ***
color^6 -52.346 11.685 -4.480 7.48e-06 ***
clarity.L 4193.474 30.160 139.039 < 2e-16 ***
clarity.Q -2002.530 28.155 -71.125 < 2e-16 ***
clarity.C 1036.495 24.168 42.888 < 2e-16 ***
clarity^4 -399.156 19.325 -20.655 < 2e-16 ***
clarity^5 245.525 15.837 15.503 < 2e-16 ***
clarity^6 -0.855 13.793 -0.062 0.95057
clarity^7 95.949 12.175 7.881 3.31e-15 ***
depth -110.281 3.730 -29.564 < 2e-16 ***
table -54.258 2.363 -22.960 < 2e-16 ***
x -1044.128 20.665 -50.525 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1137 on 53922 degrees of freedom
Multiple R-squared: 0.9188, Adjusted R-squared: 0.9188
F-statistic: 3.588e+04 on 17 and 53922 DF, p-value: < 2.2e-16

par(mfrow=c(2,2))
plot(fit2)

##########
##Eliminated variable clarity,y,z
fit3=lm(price~carat+cut+color+depth+table+x)
s3=summary(fit3)
s3
Call:
lm(formula = price ~ carat + cut + color + depth + table + x)
Residuals:
Min 1Q Median 3Q Max
-23496.1 -588.9 -105.7 391.8 12452.3
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 11586.776 462.998 25.026 < 2e-16 ***
carat 11330.866 59.371 190.847 < 2e-16 ***
cut.L 1019.277 27.415 37.179 < 2e-16 ***
cut.Q -480.919 21.934 -21.926 < 2e-16 ***
cut.C 321.039 18.962 16.930 < 2e-16 ***
cut^4 43.433 15.205 2.857 0.00428 **
color.L -1646.134 21.181 -77.716 < 2e-16 ***
color.Q -772.264 19.329 -39.953 < 2e-16 ***
color.C -104.514 18.125 -5.766 8.15e-09 ***
color^4 98.782 16.648 5.934 2.98e-09 ***
color^5 -147.328 15.736 -9.362 < 2e-16 ***
color^6 -151.867 14.274 -10.639 < 2e-16 ***
depth -115.554 5.015 -23.040 < 2e-16 ***
table -40.388 3.584 -11.267 < 2e-16 ***
x -1349.739 24.916 -54.171 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1393 on 53925 degrees of freedom
Multiple R-squared: 0.8782, Adjusted R-squared: 0.8782
F-statistic: 2.777e+04 on 14 and 53925 DF, p-value: < 2.2e-16

par(mfrow=c(2,2))
plot(fit3)

##Eliminated variable cut,clarity,y,z
fit4=lm(price~carat+color+depth+table+x)

s4=summary(fit4)
s4
Call:
lm(formula = price ~ carat + color + depth + table + x)
Residuals:
Min 1Q Median 3Q Max
-24411.9 -582.4 -97.2 387.0 12343.8
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 20389.826 395.039 51.615 < 2e-16 ***
carat 11373.151 60.145 189.096 < 2e-16 ***
color.L -1636.388 21.462 -76.245 < 2e-16 ***
color.Q -769.320 19.583 -39.285 < 2e-16 ***
color.C -113.409 18.363 -6.176 6.62e-10 ***
color^4 92.702 16.867 5.496 3.90e-08 ***
color^5 -146.797 15.944 -9.207 < 2e-16 ***
color^6 -161.379 14.462 -11.158 < 2e-16 ***
depth -193.960 4.575 -42.399 < 2e-16 ***
table -101.608 2.907 -34.948 < 2e-16 ***
x -1382.453 25.231 -54.792 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1411 on 53929 degrees of freedom
Multiple R-squared: 0.8749, Adjusted R-squared: 0.8749
F-statistic: 3.772e+04 on 10 and 53929 DF, p-value: < 2.2e-16

##Clearly, we observe in this model when only carat, color, depth, table and x variables are present, the Adjusted R-squared=0.8749 which is less than the previous models.
##Also, the residuals error=1411 is greater than that of previous models.
##Previous models fit better than this model.

par(mfrow=c(2,2))
plot(fit4)

##Model which has greatest Adjusted R-squared vale and least Residual error is the BEST model.
##Therefore, here the model with all variables is the best model.
(price ~ carat + cut + color + clarity + depth + table + x + y + z) is the best model.


Related Solutions

Assignment: Install and load the ggplot2 package. load the "diamonds" dataset RCode: install.packages("ggplot2") library(ggplot2) ?diamonds 1....
Assignment: Install and load the ggplot2 package. load the "diamonds" dataset RCode: install.packages("ggplot2") library(ggplot2) ?diamonds 1. Explore the dataset & state insights 2. Create plots for dataset 3: Provide summary of descriptive stats) 4. Run the regressions, research, Investigate & comment on R^2 & on regression plots - 1 line each. #=========================================== # DV = Price, IV or IVs = your choice # Can we create and compare models to predict "Price"? # Question- Investigate & comment on R^2 &...
Assignment: Install and load the ggplot2 package. load the "diamonds" dataset RCode: install.packages("ggplot2") library(ggplot2) ?diamonds 1....
Assignment: Install and load the ggplot2 package. load the "diamonds" dataset RCode: install.packages("ggplot2") library(ggplot2) ?diamonds 1. Explore the dataset & state insights 2. Create plots for dataset 3: Provide summary of descriptive stats) 4. Run the regressions, research, Investigate & comment on R^2 & on regression plots - 1 line each. #=========================================== # DV = Price, IV or IVs = your choice # Can we create and compare models to predict "Price"? # Question- Investigate & comment on R^2 &...
Install and load the dataset named Carseats (in the ISLR package) into R. Run a multiple...
Install and load the dataset named Carseats (in the ISLR package) into R. Run a multiple linear regression with all the variables. Using the coefficients, write down the model. ( be careful with the qualitative variable ShelveLoc. ) obtain the interaction plot of ShelveLoc and price.
Install and load the dataset named Carseats (in the ISLR package) into R. Create a new...
Install and load the dataset named Carseats (in the ISLR package) into R. Create a new dataframe that is a copy of Carseats. Create two indicator (dummy) variables: Bad_Shelf = 1 if ShelveLoc = “Bad”, 0 otherwise Good_Shelf = 1 if ShelveLoc = “Good”, 0 otherwise Also, create two interaction variables: Price_Bad_Shelf = Price* Bad_Shelf Price_Good_Shelf = Price* Good_Shelf For Questions 1-2, please estimate a linear regression model (using the lm function) with Sales as the dependent variable and Price,...
R code: ## 2. __Basic dplyr exercises__ ## Install the package `fueleconomy` and load the dataset...
R code: ## 2. __Basic dplyr exercises__ ## Install the package `fueleconomy` and load the dataset `vehicles`. Answer the following questions. install.packages("fueleconomy") library(fueleconomy) library(dplyr) library(tidyr) data(vehicles) e. Finally, for the years 1994, 1999, 2004, 2009, and 2014, find the average city mpg of midsize cars for each manufacturer for each year. Use tidyr to transform the resulting output so each manufacturer has one row, and five columns (a column for each year). I have included sample output for the first...
Install the `babynames` package with `install.packages()`. This package includes data from the Social Security Administration about...
Install the `babynames` package with `install.packages()`. This package includes data from the Social Security Administration about American baby names over a wide range of years. Generate a plot of the reported proportion of babies born with the name Angelica over time. Do you notice anything odd about the plotted data? (Hint: you should) If so, describe the issue and generate a new plot that adjusts for this problem. Make sure you show both plots along with all code that was...
1. Load the cpus dataset from the MASS package. Use syct, mmin , mmax , cach...
1. Load the cpus dataset from the MASS package. Use syct, mmin , mmax , cach , chmin, chmax as the predictors (independent variables) to predict performance (perf) Perform the best subset selection in order to choose the best predictors from the above predictors. What is the best model obtained according to Cp, BIC, and adjusted R2? Show some plots to provide evidence for your answer, and report the coefficients of the best model obtained for each criterion. Repeat using...
load the MASS library in R. A. Package ‘MASS’ which provides a description of the datasets...
load the MASS library in R. A. Package ‘MASS’ which provides a description of the datasets available in the MASS package. Then, answer each of the following questions using the appropriate test statistic and following formal steps of hypothesis testing. A:Test of equal or given proportions: Use the “bacteria” data set to answer the question, “did the drug treatment have a significant effect of the presence of the bacteria compared with the placebo?” B: F-test: Use the “cats” data set...
Load the package nycflights13 with library(nycflights13). If you are on running R Studio locally, you must...
Load the package nycflights13 with library(nycflights13). If you are on running R Studio locally, you must install this package before you can use it! # install.packages("nycflights13") library(nycflights13) library(ggplot2) library(dplyr) data(flights) data(airports) data(airlines) Question 2 The dataset `airlines` contains the full name of the carrier (examine it!). Join the dataset with the flights dataset so all of the information in `flights` is retained. Using the merged dataset, which carrier (`name`) has the longest average departure delay? Which has the shortest?
1. The dataset prostate (in R package ”faraway”) is from a study on 97 men with...
1. The dataset prostate (in R package ”faraway”) is from a study on 97 men with prostatecancer who were due to receive a radical prostatectomy.Fit a model withlpsa(y) as the response variable andlcavol(x) as the predictor andanswer the following question: •Calculate and plot the 90%confidenceandpredictionbands. Which type ofintervals are wider?
ADVERTISEMENT
ADVERTISEMENT
ADVERTISEMENT