Question

In: Statistics and Probability

Question 1. In this exercise you will simulate a data set and run a simple regression....

Question 1. In this exercise you will simulate a data set and run a simple regression. To ensure reproducible results, make sure you use set.seed(1).

a)   Using the rnorm() function create vector X that contains 200 observations from N(0,1) distribution.

Similarly, create a 200 element vector, epsilon (ϵ), drawn from N(0,0.25) distribution. This is the irreducible error.

b)   Create the response data using the following relationship:
Y=−1+0.5X+ϵ

Fit a linear regression of Y on X. Display the summary statistics and interpret the diagnosic plots. In a single graph, draw a scatter diagram of Y and X values, and the fitted line.

c)   Now, fit a quadratic model by adding the X2(X in squared) into the model. Discuss whether there is improvement in the fit or not. Draw scatter diagram and fitted line similar to the previous part. Interpret the diagnostic plots.

d)   Using the sample() function create train and test sets just like we did in class. Obtain predicted values (from the test set) for the linear and quadratic fits. Compare their MSEs. Which model is better in terms of predictions?

Solutions

Expert Solution

#PART A
#Creating x and epsilon vectors
```{r setup}
set.seed(1)
x=rnorm(200,mean=0,sd=1)
e=rnorm(200,mean=0,sd=0.25)
```

#PART B
#Calculating y and putting x and y in a dataframe
```{r}
y=-1+0.5*x+e
data = data.frame(X=x,Y=y,stringsAsFactors = FALSE)
print(data)
```

#Running linear regression
```{r}
model <- lm(formula=Y~X,data=data)
print(model)
summary(model)
plot(model)
```

```
#Scatter plot with fitted line
```{r}
intercept = -0.9896
coefficient = 0.4942
plot(x, y, pch = 16, cex = 1.3, col = "blue", main = "Scatter Plot and Regression Line", xlab = "X", ylab ="Y")
abline(intercept,coefficient)
```

#PART C
```{r}
y2=-1+0.5*x+e+x^2
data2 = data.frame(X=x,Y=y2,stringsAsFactors = FALSE)
quadratic = lm(y2 ~ x)
summary(quadratic)
plot(quadratic)

#This model is not an imrpovement from previous one as R sqaure has reduced from 76.6% to 25%
```

#Scatter plot with fitted line for quadratic model
```{r}
intercept2 = -0.13846
coefficient2 = 0.74723
plot(x, y2, pch = 16, cex = 1.3, col = "blue", main = "Scatter Plot and Regression Line", xlab = "X", ylab ="Y2")
abline(intercept2,coefficient2)
```

#PART D
```{r}
#For first dataset
# Random sample indexes
set.seed(1)
train_index <- sample(1:nrow(data), 0.8 * nrow(data))
test_index <- setdiff(1:nrow(data), train_index)

# Build X_train, y_train, X_test, y_test
X_train <- data[train_index, -15]
y_train <- data[train_index, "Y"]

X_test <- data[test_index, -15]
y_test <- data[test_index, "Y"]

model = lm(formula=y_train~X_train)
prediction = predict(model,newdata=X_test)
mse = (sum(y_test-prediction)^2)/200
print(mse)
```

```{r}
#For quadratic dataset
# Random sample indexes
set.seed(1)
train_index <- sample(1:nrow(data), 0.8 * nrow(data))
test_index <- setdiff(1:nrow(data), train_index)

# Build X_train, y_train, X_test, y_test
X_train <- data2[train_index, -15]
y_train <- data2[train_index, "Y"]

X_test <- data2[test_index, -15]
y_test <- data2[test_index, "Y"]

model = lm(formula=y_train~X_train)
prediction = predict(model,newdata=X_test)
mse = (sum(y_test-prediction)^2)/200
print(mse)
```


Related Solutions

You run a regression analysis on a bivariate set of data (n=99). You obtain the regression...
You run a regression analysis on a bivariate set of data (n=99). You obtain the regression equation y=0.843x+6.762 with a correlation coefficient of r=0.954 which is significant at α=0.01 You want to predict what value (on average) for the explanatory variable will give you a value of 150 on the response variable. What is the predicted explanatory value? x = (Report answer accurate to one decimal place.) Here is a bivariate data set. Find the regression equation for the response...
1. As a result of running a simple regression on a data set, the following estimated...
1. As a result of running a simple regression on a data set, the following estimated regression equation was obtained:       = 9.7 + 13.4x Furthermore, it is known that SST = 622, and SSE = 150. 2. You are given the following information about y and x: y x Dependent Variable Independent Variable 11 6 15 5 10 2 14 2 Linear regression using least squares method yielded the following equation:   = 12.06 + 0.12x What is the predicted value...
a) Run a regression analysis on the following bivariate set of data with y as the...
a) Run a regression analysis on the following bivariate set of data with y as the response variable. x y 10.7 81.6 13.7 81.5 36.7 56.5 4 72.1 50.7 23.2 47.6 -4.8 37.3 31.9 24.3 75.2 21.5 59.3 17.2 54.6 23.6 75.5 22.2 60.8 29.3 51 14 63.4 0.2 102.7 30.7 48.2 10.3 74.8 26.5 48.2 23.1 87 Verify that the correlation is significant at an ?=0.05?=0.05. If the correlation is indeed significant, predict what value (on average) for the...
8. Run a regression analysis on the following bivariate set of data with y as the...
8. Run a regression analysis on the following bivariate set of data with y as the response variable. x y 27.2 68.2 28.1 66.7 28.7 64.6 30.2 66.4 33.7 69.5 31.8 68.3 30.4 67.8 28.6 65.5 32.5 69.4 34.8 67.9 33.3 67.1 28 66.1 - Find the correlation coefficient and report it accurate to three decimal places. r = - What proportion of the variation in y can be explained by the variation in the values of x? Report answer...
Run a regression analysis on the following bivariate set of data with y as the response...
Run a regression analysis on the following bivariate set of data with y as the response variable. x y 15.5 58.8 25.4 62.9 53.7 68.8 46.5 78.6 28.5 57.5 5.7 68.1 -0.4 67.8 43.8 87.7 23.1 64.9 31.3 81.3 48.2 80.1 15.9 71.1 1) Find the correlation coefficient and report it accurate to three decimal places. r = 2) What proportion of the variation in y can be explained by the variation in the values of x? Report answer as...
Run a regression analysis on the following bivariate set of data with y as the response...
Run a regression analysis on the following bivariate set of data with y as the response variable. x y 81 81.3 92.6 90.8 80.1 94.9 77.8 53.4 89.4 102.9 70.3 38.2 90.2 98 81.4 94.6 94.9 122.4 77.2 42.1 70.6 47.8 71 50.6 Find the correlation coefficient and report it accurate to three decimal places. r = What proportion of the variation in y can be explained by the variation in the values of x? Report answer as a percentage...
Run a regression analysis on the following bivariate set of data with y as the response...
Run a regression analysis on the following bivariate set of data with y as the response variable. x y 73 16.1 80 14.9 72.5 8.5 55.8 33.6 54.6 23.4 76.6 26.2 74.6 19.1 40.2 40.6 58.7 25.8 Find the correlation coefficient and report it accurate to three decimal places. What proportion of the variation in y can be explained by the variation in the values of x? Report answer as a percentage accurate to one decimal place. (If the answer...
Run a regression analysis on the following bivariate set of data with y as the response...
Run a regression analysis on the following bivariate set of data with y as the response variable. x y 26.1 54.3 28.4 42.5 33.8 50.2 63.4 63.3 64.3 79.4 72.5 76.1 46.2 57.4 70.2 82.1 50.5 64.3 55.4 58.8 40.7 48.5 69.4 57.5 40.4 47.8 45.1 60 64 72.5 50.6 56.9 44.2 65.6 Verify that the correlation is significant at an α = 0.05 . If the correlation is indeed significant, predict what value (on average) for the explanatory variable...
Run a regression analysis on the following bivariate set of data with y as the response...
Run a regression analysis on the following bivariate set of data with y as the response variable. x y 58.6 56.3 66.3 72.1 54.8 119.2 57.4 83.2 62.8 74.3 77.6 72.2 71.8 62.2 46.4 77.2 77.4 86.6 60.5 78.4 66.4 131.3 76.4 113.5 68.5 84.2 81.5 102 77.5 136.3 49 12.2 Verify that the correlation is significant at an α=0.05. If the correlation is indeed significant, predict what value (on average) for the explanatory variable will give you a value...
Run a regression analysis on the following bivariate set of data with y as the response...
Run a regression analysis on the following bivariate set of data with y as the response variable. x y 37.2 26.9 66.2 41.4 80.9 37.6 83.7 45 55 31.2 46.3 29.1 82 44.1 71.6 36.1 54.7 29.3 56.4 33.8 Predict what value (on average) for the response variable will be obtained from a value of 45.1 as the explanatory variable. Use a significance level of α = 0.05 to assess the strength of the linear correlation. What is the predicted...
ADVERTISEMENT
ADVERTISEMENT
ADVERTISEMENT