Question

In: Computer Science

** Number 2 implemented in R (R Studio) ** Set up the Auto data: Load the...

** Number 2 implemented in R (R Studio) **

  1. Set up the Auto data:
    1. Load the ISLR package and the Auto data
    2. Determine the median value for mpg
    3. Use the median to create a new column in the data set named mpglevel, which is 1 if mpg>median and otherwise is 0. Make sure this variable is a factor. We will use mpglevel as the target (response) variable for the algorithms.
    4. Use the names() function to verify that your new column is in Auto
    5. create a 75-25 train/test split, using seed 1234 but do not include columns ‘name’ or ‘mpg’ in either train or test
  2. Plots
    1. Set up a 2x2 graph grid and plot the following pairs of plots
    2. Plot pair 1: plot mpg~horsepower and mpg~weight, setting colors according to the factor mpglevel, ex: col=(Auto$mpglevel)
    3. Plot pair 2: make conditional density plots, plotting mpglevel~horsepower and mpglevel~weight. For the cdplots, use parameter col=c('red', 'black')

Solutions

Expert Solution

code

install.packages("ISLR") # installing packages
library(ISLR) # accessing package

data = data(package='ISLR') # a) loading islr package
auto_data=as.data.frame(Auto)

head(auto_data)

median_mpg = median(auto_data$mpg) # b) finding median of mpg
median_mpg

auto_data$mpglevel <- 0

auto_data$mpglevel[auto_data$mpg>median_mpg] <- 1 # c) making 1 if greater than median

head(auto_data[auto_data$mpg>median_mpg,])
head(auto_data[auto_data$mpg<median_mpg,])

names(auto_data) # d) veriying with names()

# e) creating test and train data
sample_size <- floor(0.75 * nrow(auto_data)) # choosing 75%
set.seed(1234)
train_ind = sample(seq_len(nrow(auto_data)),size = sample_size)

train_data = train = auto_data[train_ind,]
test_data = auto_data[-train_ind,]

par(mfrow=c(2,2))

plot(auto_data$mpg,auto_data$horsepower,col=(auto_data$mpglevel))

plot(auto_data$mpg,auto_data$weight,col=(auto_data$mpglevel))

x = factor(auto_data$horsepower)
cdplot(x~auto_data$mpg,col=c('red', 'black'))

y = factor(auto_data$weight)
cdplot(y~auto_data$mpg,col=c('red','black'))


Related Solutions

Use R statictical software. Load the ISLR package to get the Auto data set. Fit below...
Use R statictical software. Load the ISLR package to get the Auto data set. Fit below non-linear models to the Auto data set. We will treat horsepower as the predictor and mpg as the response. • Fit the cubic spline with 3 knots (25th percentile, 50th percentile, and 75th percentile of horsepower) • Fit the natural spline with 3 knots (25th percentile, 50th percentile, and 75th percentile of horsepower) • Fit the smoothing spline by choosing optimal lambda with cross-validation....
Load “Lock5Data” into your R console. Load “OlympicMarathon” data set in “Lock5Data”. This data set contains...
Load “Lock5Data” into your R console. Load “OlympicMarathon” data set in “Lock5Data”. This data set contains population of all times to finish the 2008 Olympic Men’s Marathon. a) What is the population size? b) Now using “Minutes” column generate a random sample of size 5. c) Calculate the sample mean and record it (create a excel sheet or write a direct R program to record this) d) Continue steps (b) and (c) 10,000 time (that mean you have recorded 10,000...
Using R studio 1. Read the iris data set into a data frame. 2. Print the...
Using R studio 1. Read the iris data set into a data frame. 2. Print the first few lines of the iris dataset. 3. Output all the entries with Sepal Length > 5. 4. Plot a box plot of Petal Length with a color of your choice. 5. Plot a histogram of Sepal Width. 6. Plot a scatter plot showing the relationship between Petal Length and Petal Width. 7. Find the mean of Sepal Length by species. Hint: You could...
( In R / R studio ) im not sure how to share my data set,...
( In R / R studio ) im not sure how to share my data set, but below is the title of my data set and the 12 columns of my data set. Please answer as best you can wheather its pseudo code, partial answers, or just a suggestion on how i can in to answer the question. thanks #---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- The dataset incovid_sd_20201001.RDatacontains several variables related to infections of covid-19 for eachzip code in San Diego County as of October...
Write code in R for this questions,, will vote!! Load the Taxi.txt data set into R....
Write code in R for this questions,, will vote!! Load the Taxi.txt data set into R. (a) Calculate the mean, median, standard deviation, 30th percentile, and 65th percentile for Mileage and TripTime. (b) Make a frequency table for PaymentProvider that includes a Sum column. Report the resulting table. (c) Make a contingency table comparing PaymentType and Airport. Report the resulting table. (d) Use the cor() function to find the correlation between each pair of the Meter, Tip, Mileage, and TripTime...
Use R studio to do this problem. This problem uses the wblake data set in the...
Use R studio to do this problem. This problem uses the wblake data set in the alr4 package. This data set includes samples of small mouth bass collected in West Bearskin Lake, Minnesota, in 1991. Interest is in predicting length with age. Finish this problem without using Im() (a) Compute the regression of length on age, and report the estimates, their standard errors, the value of the coefficient of determination, and the estimate of variance. Write a sentence or two...
R Programming: Load the {ISLR} and {GGally} libraries. Load and attach the College{ISLR} data set. 1.2...
R Programming: Load the {ISLR} and {GGally} libraries. Load and attach the College{ISLR} data set. 1.2 Inspect the data with the ggpairs(){GGally} function, but do not run the ggpairs plots on all variables because it will take a very long time. Only include these variables in your ggpairs plot: “Outstate”,“S.F.Ratio”,“Private”,“PhD”,“Grad.Rate”. 1.3 Briefly answer: if we are interested in predicting out of state tuition (Outstate), can you tell from the plots if any of the other variables have a curvilinear relationship...
R-Studio; Statistics The data set in the table considers information on the spread of prostate cancer...
R-Studio; Statistics The data set in the table considers information on the spread of prostate cancer to the lymph nodes for 53 patients. For a sample of prostate cancer patients, a set of possible predictor variables were measured before surgery to determine if the lymph nodes were compromised. Subsequently, the patient underwent surgery and the status of his lymph nodes was determined. The data set contains 53 observations of 7 variables: id: identifiers for each subject in the study. ssln:...
USING R STUDIO- Write the r commands for the following. 1. Non-Linear Models 1.1 Load the...
USING R STUDIO- Write the r commands for the following. 1. Non-Linear Models 1.1 Load the {ISLR} and {GGally} libraries. Load and attach the College{ISLR} data set. [For you only]: Open the College data set and its help file and familiarize yourself with the data set and its fields. 1.2 Inspect the data with the ggpairs(){GGally} function, but do not run the ggpairs plots on all variables because it will take a very long time. Only include these variables in...
Construct a scattergram for each data set. Then calculate r and r 2 for each data...
Construct a scattergram for each data set. Then calculate r and r 2 for each data set. Interpret their values. Complete parts a through d a. x −1 0 1 2 3 y −3 0 1 4 5 Calculate r. r=. 9853 ​(Round to four decimal places as​ needed.) Calculate r2. r2=0.9709. ​(Round to four decimal places as​ needed.) Interpret r. Choose the correct answer below. A.There is not enough information to answer this question. B.There is a very strong...
ADVERTISEMENT
ADVERTISEMENT
ADVERTISEMENT