In: Statistics and Probability
code in R:
We will use the dataset **Auto{ISLR}** to develop a binomial classification model to predict the likelihood of automobiles having high gas mileage. So, first load the **{ISLR}** library. Since we don't have a dummy variable to classify high vs. low gas mileage vehicles, let's use the quantitative value of miles per gallon **mpg** to create a binary variable called **mpg.hi** if a vehicle has higher **mpg** than the **median mpg**. Let's first calculate the **median mpg** value using the `median()` function and store the results in an object named **med.mpg**. Then create a column in the **Auto** dataset named **mpg.hi** with a binary value of 1 if `mpg>med.mpg` and 0 otherwise, using the `ifelse()` function.
For a quick visual inspection, display the **med.mpg** value and then a 2-column data frame using `cbind()` the first 20 values for **mpg** and **mpg.hi**. Please label the columns as shown below. Don't answer this,but quickly verify that your **med.hi** variable was created correctly.
> library(ISLR)
> head(Auto)
mpg cylinders displacement horsepower weight acceleration year
origin
1 18 8 307 130 3504 12.0 70 1
2 15 8 350 165 3693 11.5 70 1
3 18 8 318 150 3436 11.0 70 1
4 16 8 304 150 3433 12.0 70 1
5 17 8 302 140 3449 10.5 70 1
6 15 8 429 198 4341 10.0 70 1
name bin_mpg mpg_hi
1 chevrolet chevelle malibu 0 0
2 buick skylark 320 0 0
3 plymouth satellite 0 0
4 amc rebel sst 0 0
5 ford torino 0 0
6 ford galaxie 500 0 0
> med_mpg = median(Auto$mpg)
> med_mpg
[1] 22.75
> mpg_hi= ifelse(Auto$mpg>=med_mpg,"mpg_hi","mpg_low")
> Auto$mpg_hi = ifelse(Auto$mpg>=med_mpg,1,0)
> head(data.frame(Auto$mpg,Auto$mpg_hi))
Auto.mpg Auto.mpg_hi
1 18 0
2 15 0
3 18 0
4 16 0
5 17 0
6 15 0
> summary(glm(mpg_hi ~ weight + year + cylinders+horsepower+
displacement+ acceleration,family = "binomial", data = Auto))
Call:
glm(formula = mpg_hi ~ weight + year + cylinders + horsepower
+
displacement + acceleration, family = "binomial", data = Auto)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.1999 -0.1126 0.0115 0.2249 3.3019
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -15.828787 5.653426 -2.800 0.00511 **
weight -0.003986 0.001085 -3.673 0.00024 ***
year 0.414204 0.072700 5.697 1.22e-08 ***
cylinders -0.015009 0.405220 -0.037 0.97045
horsepower -0.035608 0.023543 -1.512 0.13042
displacement -0.006745 0.009961 -0.677 0.49831
acceleration 0.007983 0.141357 0.056 0.95497
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 543.43 on 391 degrees of freedom
Residual deviance: 159.34 on 385 degrees of freedom
AIC: 173.34
Number of Fisher Scoring iterations: 8
> fit1 = glm(as.factor(mpg_hi) ~ weight + year +
cylinders+horsepower+ displacement+ acceleration,family =
"binomial", data = Auto)
> pred = predict(fit1,type="response")
> pred_mpg_hi = ifelse(pred>=0.5,1,0)
> table(mpg_hi,pred_mpg_hi)
pred_mpg_hi
mpg_hi 0 1
mpg_hi 16 180
mpg_low 173 23