In: Statistics and Probability
Question 1. In this exercise you will simulate a data set and run a simple regression. To ensure reproducible results, make sure you use set.seed(1).
a) Using the rnorm() function create vector X that contains 200 observations from N(0,1) distribution.
Similarly, create a 200 element vector, epsilon (ϵ), drawn from N(0,0.25) distribution. This is the irreducible error.
b) Create the response data using the
following relationship:
Y=−1+0.5X+ϵ
Fit a linear regression of Y on X. Display the summary statistics and interpret the diagnosic plots. In a single graph, draw a scatter diagram of Y and X values, and the fitted line.
c) Now, fit a quadratic model by adding the X2(X in squared) into the model. Discuss whether there is improvement in the fit or not. Draw scatter diagram and fitted line similar to the previous part. Interpret the diagnostic plots.
d) Using the sample() function create train and test sets just like we did in class. Obtain predicted values (from the test set) for the linear and quadratic fits. Compare their MSEs. Which model is better in terms of predictions?
#PART A
#Creating x and epsilon vectors
```{r setup}
set.seed(1)
x=rnorm(200,mean=0,sd=1)
e=rnorm(200,mean=0,sd=0.25)
```
#PART B
#Calculating y and putting x and y in a dataframe
```{r}
y=-1+0.5*x+e
data = data.frame(X=x,Y=y,stringsAsFactors = FALSE)
print(data)
```
#Running linear regression
```{r}
model <- lm(formula=Y~X,data=data)
print(model)
summary(model)
plot(model)
```
```
#Scatter plot with fitted line
```{r}
intercept = -0.9896
coefficient = 0.4942
plot(x, y, pch = 16, cex = 1.3, col = "blue", main = "Scatter Plot
and Regression Line", xlab = "X", ylab ="Y")
abline(intercept,coefficient)
```
#PART C
```{r}
y2=-1+0.5*x+e+x^2
data2 = data.frame(X=x,Y=y2,stringsAsFactors = FALSE)
quadratic = lm(y2 ~ x)
summary(quadratic)
plot(quadratic)
#This model is not an imrpovement from previous one as R sqaure
has reduced from 76.6% to 25%
```
#Scatter plot with fitted line for quadratic model
```{r}
intercept2 = -0.13846
coefficient2 = 0.74723
plot(x, y2, pch = 16, cex = 1.3, col = "blue", main = "Scatter Plot
and Regression Line", xlab = "X", ylab ="Y2")
abline(intercept2,coefficient2)
```
#PART D
```{r}
#For first dataset
# Random sample indexes
set.seed(1)
train_index <- sample(1:nrow(data), 0.8 * nrow(data))
test_index <- setdiff(1:nrow(data), train_index)
# Build X_train, y_train, X_test, y_test
X_train <- data[train_index, -15]
y_train <- data[train_index, "Y"]
X_test <- data[test_index, -15]
y_test <- data[test_index, "Y"]
model = lm(formula=y_train~X_train)
prediction = predict(model,newdata=X_test)
mse = (sum(y_test-prediction)^2)/200
print(mse)
```
```{r}
#For quadratic dataset
# Random sample indexes
set.seed(1)
train_index <- sample(1:nrow(data), 0.8 * nrow(data))
test_index <- setdiff(1:nrow(data), train_index)
# Build X_train, y_train, X_test, y_test
X_train <- data2[train_index, -15]
y_train <- data2[train_index, "Y"]
X_test <- data2[test_index, -15]
y_test <- data2[test_index, "Y"]
model = lm(formula=y_train~X_train)
prediction = predict(model,newdata=X_test)
mse = (sum(y_test-prediction)^2)/200
print(mse)
```