In: Statistics and Probability
An endocrinologist was interested in exploring the relationship between the level of a steroid (Y) and age (X) in healthy subjects whose ages ranged from 8 to 25 years. She collected a sample of 27 healthy subject in this age range. The data is located in the file problem01.txt, where the first column represents X = age and the second column represents Y = steroid level. For all R programming, print input and output codes and values.
(a) Read the file problem01.txt into R using the read.table() function. You’ll need to set the working directory to the file location. Make a scatterplot of steroid (Y) versus age (X). Include the plot.
(b) Use R to fit a simple linear regression. Write down the fitted equation and multiple R2 from the summary() output. Also comment on the p-value for the ?1
coefficient
Yi = ?0 + ?1Xi + ?i
(c) Make a scatterplot of the fitted values versus the standardized residuals for the model in part (b). Are there any violation of assumptions? Include a copy of your plot.
(d) Create a quadratic regression in R. Write down the fitted equation, multiple R2, and the p-value for ?1 from the summary()output. Compare to part (b).
Yi = ?0 + ?1Xi + ?2Xi2 + ?i
problem01.txt
"age" "steroid"
15 14.1
10 8.5
13 10.8
16 18.4
10 4.7
18 23.3
16 16.4
10 9.4
16 17.7
23 35.8
19 25.4
18 24.9
24 42.1
19 26.5
24 40
12 10.7
13 11.6
10 3.6
23 37.9
17 16.8
19 24
23 37.7
20 29.6
14 13.7
19 23.1
11 8.3
17 19.6
9 7.8
11 7.1
13 13.3
18 20.8
25 44.4
9 9.7
12 12.5
22 34.9
8 4.3
9 5.9
8 6
22 36.2
15 11.7
10 5.3
15 15.6
9 6.6
14 15.7
13 10.5
17 20.7
23 36.8
23 37.2
8 5
16 19.6
16 18.9
15 16.1
10 7.7
14 11.9
12 9
8 4.4
8 2.7
8 5.2
16 19.3
20 27.5
20 27.8
13 12.9
12 12.8
13 9.3
15 16.1
19 25
13 10.5
13 9
18 22.3
22 33.6
9 4.9
19 28.4
15 14
21 30.6
19 24.8
R Outline Sample
########################
####### Part (a) #######
########################
# First save the file 'problem01.txt' on your computer.
# Next, set the working directory to the file location by doing the following:
# 1) Click on 'Session' on the top menu
# 2) Select 'Set Working Directory' > 'Choose Directory'
# 3) Select the folder where 'problem01.txt' is saved
# Read in data using the read.table() function.
dat <-
attach(dat)
# Create a scatterplot of age (X) vs steroid (Y)
# Write code here
########################
####### Part (b) #######
########################
# Fit a simple linear regression, then display the summary
fit <- # Enter code for simple linear regression
summary(fit)
########################
####### Part (c) #######
########################
# Plot the fitted values versus the standardized residuals for the fitted
# equation in part (b). Use the functions: sigma(), resid(), and predict()
y.hat <-
e.std <-
plot(y.hat, e.std, main = "Standardized Residuals vs. Fitted Value")
########################
####### Part (d) #######
########################
rm(list=ls())
# Reading Data
df <- read.table(file.choose(),header=T) #You will put your directory here
head(df)
#Scatter Plot
plot(df$Age,df$Steriod)
# Fitting linear model
model.linear <- lm(Steriod~.,data=df)
#Summary of the model with p-values, Rsquare and p-values
summary(model.linear)
# Adding Age^2 in our data
df$Age2 <- df$Age^2
# Now performing Quadratic Regression
model.quadratic <- lm(Steriod~.,data=df)
# Summary of quadratic model
summary(model.quadratic)
Part(a)
Part(b)
Variable Age is highly correlated with the variable Steriod. Also, the R^2 is very high. Quite a good fit for the data
part(c)
Residual vs fitted
There is a non-linear pattern in data. It means we should try a quadratic model
# Standardized Residuals vs fitted
The plot is useful for heteroscedasticity. But the plot doesn't verify it.
Part(d)
model.quadratic
The quadratic model has high R-squared. Also when we include the squared term in the model. The unsquared do not contribute to our model. Hence there is a quadratic relation between the variables.